PALM: Few-Shot Prompt Learning for Audio Language Models

ZeroShot Inference in ALMs - Primer

ZeroShot inference in audio-language models (ALMs) refers to making predictions on new, unseen data without specific training. Let's denote an ALM with $f_{\theta} = \{f_{_{A}},f_{_{T}}\}$, whereas $f_{_{A}}$ and $f_{_{T}}$ are audio and text encoders, respectively. For classification in zero-shot scenario, the audio $\boldsymbol{\mathrm{x}}$ is first passed to the audio encoder $f_{_{A}}$, resulting in a $d-$ dimensional feature vector $f_{_{A}}(\boldsymbol{\mathrm{x}}) \in \mathbb{R}^{d}$. Similarly, on the text encoder side, each class label $y_i \in \{\mathit{y}_{1}, \mathit{y}_{2}, \dots, \mathit{y}_{C} \}$ is wrapped within the class-specific text template, such as: $$t_i = \mathrm{''An~audio~recording~of~\{CLASS~y_i\}.''}$$ Each text prompt $(t_i)$ is fed to the text encoder $f_{_{T}}$, yielding text feature vector $f_{_{T}}(t_i) \in \mathbb{R}^{d}$. The relationship between the audio's feature vector and the text prompt feature vector is quantified using cosine similarity, $\mathtt{sim}(f_{A}(\boldsymbol{\mathrm{x}}),f{_{T}}(t_i))$, to evaluate the audio's alignment with $i_{\text{th}}$ class. Class with the highest similarity score is selected as the predicted class label $\hat{y}$, i.e. $$\hat{y} = \underset{ i\in \{1,2,\dots,C\} }{\mathbf{argmax}} ~~~ \mathtt{sim}\big(f_{_{A}}(\boldsymbol{\mathrm{x}})~,~f_{_{T}}(t_i)\big)$$

ZERO SHOT The audio feature vector is compared with the feature vector of text prompt (of each class) using cosine similarity. The class with the highest similarity score is then assigned to the input audio.

Prompt Learning

Zero-shot inference in vision-language-models (VLMs) and audio-language-models (ALMs) relies on manually crafted text prompts, which significantly impact performance. Prompt Learning, as explored by Gu et al. 2023, automates this by learning text prompts from training data, eliminating manual effort. The first notable method, COOP, learns the context of text prompts in the token-embedding space using few-shot training setup. This compute-efficient approach improves VLMs' performance in downstream tasks, requiring only a small data subset to learn prompts.

COOP learns the context of text prompts in the token-embedding space by minimizing the cross-entropy loss using few-shot training setup, improving performance in downstream tasks.

Prompt learning in ALMs is a relatively new and under-explored area of research. In this work, we first show the efficacy of prompt learning methods (originally introduced for VLMs) in ALMs and then propose a novel method, PALM, that optimizes the feature space of the text encoder branch in ALMs. Our method is either on par with or outperforms baseline approaches while being computationally less demanding.

PALM

PALM (Prompt Learning for Audio Language Models) method does not require hand-crafted prompts; instead, it simply uses class names as the input to the text encoder i.e. $t_i =\mathrm{''\{CLASS~y_i\}''}$. Moreover, unlike COOP, which learns the context of input text prompts in the token embedding space, PALM learns the context in the feature space of prompts. Specifically, after obtaining the feature vector of the $i_{\text{th}}$ class text prompt via the text encoder, i.e., $f_{_{T}}(t_i) \in \mathbb{R}^{d}$, it adds a learnable vector $z_i \in \mathbb{R}^{d}$ to it to get the updated text feature vector as follows: $$f_{_{T}}^{\prime}(t_i) = (1-\lambda_i)\cdot f_{_{T}}(t_i)~+~\lambda_i \cdot z_i$$ where $\lambda_i \in [0,1]$ is a learnable parameter that determines the contributions of both vectors. Assuming $\boldsymbol{\mathrm{t}}=\{t_1,t_2,\dots,t_C\}$ denotes text prompts of all classes, the raw/un-normalized prediction scores (logits), denoted as $f_{_{\theta}}(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{t}}) \in \mathbb{R}^{C}$, for an audio waveform $(\boldsymbol{\mathrm{x}})$ are obtained as follows: $$f_{_{\theta}}(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{t}}) = \bigg\{~\mathtt{sim}\bigg(f_{_{A}}(\boldsymbol{\mathrm{x}})~,~f_{_{T}}^{\prime}(t_i)\bigg)~\bigg\}_{i=1}^{C},$$ where $\texttt{sim}(\cdot)$ is cosine-similarity function and $C$ is the number of classes. $f_{_{A}}(\boldsymbol{\mathrm{x}})$ is the feature vector from the audio encoder, and $f_{_{T}}^{\prime}(t_i)$ is the updated text feature vector of $i_{\text{th}}$ class. Following objective function is optimized to learn feature-space context embeddings $\boldsymbol{\mathrm{z}}=\{z_1,z_2,\dots,z_C\}$ and their corresponding contributions $\lambda=\{\lambda_1,\lambda_2,\dots,\lambda_C\}$: $$\underset{ \boldsymbol{\mathrm{z}}~,~\lambda }{\mathbf{minimize}}~~ \sum_{(\boldsymbol{\mathrm{x}},y)\in\mathcal{D}} \mathcal{L}\big(f_{_{\theta}}(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{t}}),y\big),$$ where $\mathcal{D}=\{\boldsymbol{\mathrm{x}}_i,y_i\}_{i=1}^{N}$ is training dataset consisting of $N$ audio-class pairs and $\mathcal{L}(\cdot)$ denotes cross-entropy loss. After learning the parameters, following equation is used for audio classification during inference stage: $$\hat{y} = \underset{ i\in \{1,2,\dots,C\} }{\mathbf{argmax}} ~~~ \mathtt{sim}\big(f_{_{A}}(\boldsymbol{\mathrm{x}})~,~f_{_{T}}^{\prime}(t_i)\big)$$ An overview of our proposed approach (PALM) can be found in the following figure.

PALM optimizes the feature space of text prompts in audio-language models by adding the learnable vectors to text features and minimizing the cross-entropy loss using few-shot training setup, enhancing performance without the need for hand-crafted prompts while being computationally efficient.

Contributions

First Demonstration of Prompt Learning Efficacy in ALMs: Inspired by the success of few-shot based prompt learning in VLMs, we are the first (to the best of our knowledge) to demonstrate its efficacy in ALMs. We show that prompt learning techniques, initially developed for VLMs, can significantly enhance the performance when adapted for ALMs.
Introduction of PALM: We introduce a novel few-shot based prompt learning method, PALM, for ALMs that optimizes the feature space of the text encoder, outperforming existing baselines.
Comprehensive Evaluation: We demonstrate our approach's effectiveness on 11 audio recognition datasets, comparing it to three baselines in a few-shot learning setup. Our method matches or outperforms others while being less computationally demanding, establishing a benchmark for prompt learning in audio-language models and paving the way for future research.

Baselines & Results

Baselines Our baselines include ZERO-SHOT, COOP and COCOOP. COOP and COCOOP are prompt-learning methods for VLMs, which we adapted for audio-language models by replacing the vision-encoder with an audio-encoder. Both methods optimize the text encoder’s input space, with COCOOP adding a feedback loop from audio features to the text encoder input. For all baselines, we use PENGI (a multimodal-to-text generation model). We use only the audio and text encoders from the model.

Datasets We evaluate our method on 11 audio classification datasets, covering a range of tasks from Emotion Recognition to Music Analysis. Following table lists all the datasets used in experiments.

DATASETS	TYPE	CLASSES	SPLIT
Beijing-Opera	Instrument Classification	4	Five Fold
NS-Instruments	Instrument Classification	10	Train-Test
ESC50	Sound Event Classification	50	Five Fold
ESC50-Actions		10	Five Fold
UrbanSound8K		10	Ten Fold
CREMA-D	Emotion Recognition	6	Train-Test
RAVDESS	Emotion Recognition	8	Train-Test
VocalSound	Vocal Sound Classification	6	Train-Test
SESA	Surveillance Sound Classification	4	Train-Test
TUT2017	Acoustic Scene Classification	15	Four Fold
GT-Music-Genre	Music Analysis	10	Train-Test

Experimental Settings All experiments are run for 50 epochs, with 16 samples of each class (randomly selected) from training set for few-shot setup and the full test set for inference. We use SGD with a learning rate of 0.05, and "Accuracy" as the evaluation metric.

Results (Comparison of $\mathrm{PALM}$ with $\mathrm{Baselines}$) The accuracy scores of the baselines (ZERO-SHOT, COOP and COCOOP, and our proposed method PALM) across 11 datasets are presented. For each method (except ZERO SHOT), experiments were performed using three different seeds. The accuracy scores for all seeds are reported, along with the average score. Bold values indicate the best average score in each row. Compared to the baselines, our proposed method achieves favorable results, with an average improvement of 5.5% over COOP and 3.1% over COCOOP. It should be noted that both COOP and COCOOP are computationally expensive, as these approaches require loss gradients to flow through the text encoder. Additionally, COCOOP has a feedback loop from audio features to the input space of the text encoder, making it even more computationally expensive. On the other hand, PALM is relatively less computationally expensive.

METHODS →	ZERO SHOT	COOP				COCOOP				PALM_(ours)
DATASETS ↓	––	SEED-0	SEED-1	SEED-2	AVG	SEED-0	SEED-1	SEED-2	AVG	SEED-0	SEED-1	SEED-2	AVG

Beijing-Opera	0.2881	0.9323	0.9660	0.9619	0.9534	0.9577	0.9830	0.9916	0.9774	0.9747	0.9066	0.9787	0.9533
CREMA-D	0.2310	0.3130	0.4197	0.2760	0.3362	0.2539	0.3358	0.3156	0.3018	0.4453	0.3580	0.2344	0.3459
ESC50-Actions	0.6525	0.9625	0.9400	0.9550	0.9525	0.9631	0.9620	0.9648	0.9634	0.9700	0.9625	0.9650	0.9658
ESC50	0.4965	0.9410	0.9390	0.9345	0.9382	0.9460	0.9370	0.9450	0.9427	0.9560	0.9600	0.9620	0.9593
GT-Music-Genre	0.3250	0.7250	0.6950	0.7350	0.7183	0.7500	0.7450	0.7607	0.7520	0.7900	0.7850	0.8250	0.8000
NS-Instruments	0.3291	0.5728	0.5562	0.6177	0.5822	0.5996	0.5740	0.6438	0.6058	0.6394	0.6108	0.6648	0.6383
RAVDESS	0.1222	0.3849	0.2688	0.3422	0.3320	0.3727	0.4399	0.3523	0.3883	0.4562	0.4603	0.4623	0.4596
SESA	0.7238	0.9143	0.8953	0.8762	0.8952	0.8381	0.8762	0.8952	0.8698	0.8857	0.9143	0.8857	0.8952
TUT2017	0.2435	0.6391	0.6667	0.6525	0.6528	0.7499	0.7215	0.7312	0.7342	0.7959	0.8047	0.7729	0.7912
UrbanSound8K	0.5349	0.7607	0.7378	0.7666	0.7544	0.7576	0.7748	0.7597	0.7652	0.8120	0.8037	0.8074	0.8077
VocalSound	0.4197	0.7162	0.7485	0.6642	0.7096	0.8081	0.7825	0.7463	0.7790	0.8101	0.8168	0.7964	0.8078

AVERAGE	0.3969	0.7146	0.7121	0.7074	0.7114	0.7276	0.7396	0.7369	0.7347	0.7759	0.7621	0.7595	0.7658

$\textbf{Comparison of $\mathrm{PALM}^{\dagger}$ and $\mathrm{PALM}$}$ Here, $\mathrm{PALM}^{\dagger}$ refers to the $\mathrm{PALM}$ method with the Learnable Context embeddings removed from the feature space of the text encoder. The removal of context embeddings drastically degrades performance, highlighting their importance.

Conclusion

In this study, we investigate the application of prompt learning techniques, originally developed for vision-language models (VLMs), in the context of audio-language models (ALMs). We introduce PALM, a novel method that optimizes the feature space of the text encoder branch, enhancing training efficiency compared to existing methods that operate in the input space. Evaluated on 11 diverse audio recognition datasets, PALM consistently matches or surpasses established baselines in a few-shot learning setup while reducing computational demands. PALM offers a promising direction for enhancing the performance of ALMs in zero-shot and few-shot learning scenarios, contributing to the broader field of audio recognition and paving the way for future research in multimodal tasks.

For additional details about PALM, datasets, results, please refer to our main paper and Github code repository. Thank you!

Contact

For any query related to our work, contact asif dot hanif at mbzuai dot ac dot ae

BibTeX


@article{hanif2024palm,
  title={PALM: Few-Shot Prompt Learning for Audio Language Models},
  author={Hanif, Asif and Agro, Maha Tufail and Qazi, Mohammad Areeb and Aldarmaki, Hanan},
  journal={arXiv preprint arXiv:2409.19806},
  year={2024}
}