PALM: Few-Shot Prompt Learning for Audio Language Models

Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE

Abstract

Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding.

TLDR: We adapt vision-language prompt learning methods for audio-language models and introduce PALM, a new method that is computationally efficient and outperforms or matches baselines in audio classification across 11 datasets.


ZeroShot Inference in ALMs - Primer

ZeroShot inference in audio-language models (ALMs) refers to making predictions on new, unseen data without specific training. Let's denote an ALM with \(f_{\theta} = \{f_{_{A}},f_{_{T}}\}\), whereas \(f_{_{A}}\) and \(f_{_{T}}\) are audio and text encoders, respectively. For classification in zero-shot scenario, the audio \(\boldsymbol{\mathrm{x}}\) is first passed to the audio encoder \(f_{_{A}}\), resulting in a \(d-\) dimensional feature vector \(f_{_{A}}(\boldsymbol{\mathrm{x}}) \in \mathbb{R}^{d}\). Similarly, on the text encoder side, each class label \(y_i \in \{\mathit{y}_{1}, \mathit{y}_{2}, \dots, \mathit{y}_{C} \}\) is wrapped within the class-specific text template, such as: $$t_i = \mathrm{''An~audio~recording~of~\{CLASS~y_i\}.''}$$ Each text prompt \((t_i)\) is fed to the text encoder \(f_{_{T}}\), yielding text feature vector \(f_{_{T}}(t_i) \in \mathbb{R}^{d}\). The relationship between the audio's feature vector and the text prompt feature vector is quantified using cosine similarity, \(\mathtt{sim}(f_{A}(\boldsymbol{\mathrm{x}}),f{_{T}}(t_i))\), to evaluate the audio's alignment with \(i_{\text{th}}\) class. Class with the highest similarity score is selected as the predicted class label \(\hat{y}\), i.e. $$\hat{y} = \underset{ i\in \{1,2,\dots,C\} }{\mathbf{argmax}} ~~~ \mathtt{sim}\big(f_{_{A}}(\boldsymbol{\mathrm{x}})~,~f_{_{T}}(t_i)\big)$$



ZERO SHOT The audio feature vector is compared with the feature vector of text prompt (of each class) using cosine similarity. The class with the highest similarity score is then assigned to the input audio.




Prompt Learning

Zero-shot inference in vision-language-models (VLMs) and audio-language-models (ALMs) relies on manually crafted text prompts, which significantly impact performance. Prompt Learning, as explored by Gu et al. 2023, automates this by learning text prompts from training data, eliminating manual effort. The first notable method, COOP, learns the context of text prompts in the token-embedding space using few-shot training setup. This compute-efficient approach improves VLMs' performance in downstream tasks, requiring only a small data subset to learn prompts.




COOP learns the context of text prompts in the token-embedding space by minimizing the cross-entropy loss using few-shot training setup, improving performance in downstream tasks.

Prompt learning in ALMs is a relatively new and under-explored area of research. In this work, we first show the efficacy of prompt learning methods (originally introduced for VLMs) in ALMs and then propose a novel method, PALM, that optimizes the feature space of the text encoder branch in ALMs. Our method is either on par with or outperforms baseline approaches while being computationally less demanding.




PALM

PALM (Prompt Learning for Audio Language Models) method does not require hand-crafted prompts; instead, it simply uses class names as the input to the text encoder i.e. \(t_i =\mathrm{''\{CLASS~y_i\}''}\). Moreover, unlike COOP, which learns the context of input text prompts in the token embedding space, PALM learns the context in the feature space of prompts. Specifically, after obtaining the feature vector of the \(i_{\text{th}}\) class text prompt via the text encoder, i.e., \(f_{_{T}}(t_i) \in \mathbb{R}^{d}\), it adds a learnable vector \(z_i \in \mathbb{R}^{d}\) to it to get the updated text feature vector as follows: $$f_{_{T}}^{\prime}(t_i) = (1-\lambda_i)\cdot f_{_{T}}(t_i)~+~\lambda_i \cdot z_i$$ where \(\lambda_i \in [0,1]\) is a learnable parameter that determines the contributions of both vectors. Assuming \(\boldsymbol{\mathrm{t}}=\{t_1,t_2,\dots,t_C\}\) denotes text prompts of all classes, the raw/un-normalized prediction scores (logits), denoted as \(f_{_{\theta}}(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{t}}) \in \mathbb{R}^{C}\), for an audio waveform \((\boldsymbol{\mathrm{x}})\) are obtained as follows: $$f_{_{\theta}}(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{t}}) = \bigg\{~\mathtt{sim}\bigg(f_{_{A}}(\boldsymbol{\mathrm{x}})~,~f_{_{T}}^{\prime}(t_i)\bigg)~\bigg\}_{i=1}^{C},$$ where \(\texttt{sim}(\cdot)\) is cosine-similarity function and \(C\) is the number of classes. \(f_{_{A}}(\boldsymbol{\mathrm{x}})\) is the feature vector from the audio encoder, and \(f_{_{T}}^{\prime}(t_i)\) is the updated text feature vector of \(i_{\text{th}}\) class. Following objective function is optimized to learn feature-space context embeddings \(\boldsymbol{\mathrm{z}}=\{z_1,z_2,\dots,z_C\}\) and their corresponding contributions \(\lambda=\{\lambda_1,\lambda_2,\dots,\lambda_C\}\): $$\underset{ \boldsymbol{\mathrm{z}}~,~\lambda }{\mathbf{minimize}}~~ \sum_{(\boldsymbol{\mathrm{x}},y)\in\mathcal{D}} \mathcal{L}\big(f_{_{\theta}}(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{t}}),y\big),$$ where \(\mathcal{D}=\{\boldsymbol{\mathrm{x}}_i,y_i\}_{i=1}^{N}\) is training dataset consisting of \(N\) audio-class pairs and \(\mathcal{L}(\cdot)\) denotes cross-entropy loss. After learning the parameters, following equation is used for audio classification during inference stage: $$\hat{y} = \underset{ i\in \{1,2,\dots,C\} }{\mathbf{argmax}} ~~~ \mathtt{sim}\big(f_{_{A}}(\boldsymbol{\mathrm{x}})~,~f_{_{T}}^{\prime}(t_i)\big)$$ An overview of our proposed approach (PALM) can be found in the following figure.




PALM optimizes the feature space of text prompts in audio-language models by adding the learnable vectors to text features and minimizing the cross-entropy loss using few-shot training setup, enhancing performance without the need for hand-crafted prompts while being computationally efficient.


Contributions

  1. First Demonstration of Prompt Learning Efficacy in ALMs: Inspired by the success of few-shot based prompt learning in VLMs, we are the first (to the best of our knowledge) to demonstrate its efficacy in ALMs. We show that prompt learning techniques, initially developed for VLMs, can significantly enhance the performance when adapted for ALMs.
  2. Introduction of PALM: We introduce a novel few-shot based prompt learning method, PALM, for ALMs that optimizes the feature space of the text encoder, outperforming existing baselines.
  3. Comprehensive Evaluation: We demonstrate our approach's effectiveness on 11 audio recognition datasets, comparing it to three baselines in a few-shot learning setup. Our method matches or outperforms others while being less computationally demanding, establishing a benchmark for prompt learning in audio-language models and paving the way for future research.




Baselines & Results

Baselines Our baselines include ZERO-SHOT, COOP and COCOOP. COOP and COCOOP are prompt-learning methods for VLMs, which we adapted for audio-language models by replacing the vision-encoder with an audio-encoder. Both methods optimize the text encoder’s input space, with COCOOP adding a feedback loop from audio features to the text encoder input. For all baselines, we use PENGI (a multimodal-to-text generation model). We use only the audio and text encoders from the model.


Datasets We evaluate our method on 11 audio classification datasets, covering a range of tasks from Emotion Recognition to Music Analysis. Following table lists all the datasets used in experiments.

DATASETS TYPE CLASSES SPLIT
Beijing-Opera Instrument Classification 4 Five Fold
NS-Instruments 10 Train-Test
ESC50 Sound Event Classification 50 Five Fold
ESC50-Actions 10 Five Fold
UrbanSound8K 10 Ten Fold
CREMA-D Emotion Recognition 6 Train-Test
RAVDESS 8 Train-Test
VocalSound Vocal Sound Classification 6 Train-Test
SESA Surveillance Sound Classification 4 Train-Test
TUT2017 Acoustic Scene Classification 15 Four Fold
GT-Music-Genre Music Analysis 10 Train-Test



Experimental Settings All experiments are run for 50 epochs, with 16 samples of each class (randomly selected) from training set for few-shot setup and the full test set for inference. We use SGD with a learning rate of 0.05, and "Accuracy" as the evaluation metric.



Results (Comparison of \(\mathrm{PALM}\) with \(\mathrm{Baselines}\)) The accuracy scores of the baselines (ZERO-SHOT, COOP and COCOOP, and our proposed method PALM) across 11 datasets are presented. For each method (except ZERO SHOT), experiments were performed using three different seeds. The accuracy scores for all seeds are reported, along with the average score. Bold values indicate the best average score in each row. Compared to the baselines, our proposed method achieves favorable results, with an average improvement of 5.5% over COOP and 3.1% over COCOOP. It should be noted that both COOP and COCOOP are computationally expensive, as these approaches require loss gradients to flow through the text encoder. Additionally, COCOOP has a feedback loop from audio features to the input space of the text encoder, making it even more computationally expensive. On the other hand, PALM is relatively less computationally expensive.

METHODS → ZERO SHOT COOP COCOOP PALM(ours)
DATASETS ↓ –– SEED-0 SEED-1 SEED-2 AVG SEED-0 SEED-1 SEED-2 AVG SEED-0 SEED-1 SEED-2 AVG
Beijing-Opera 0.2881 0.9323 0.9660 0.9619 0.9534 0.9577 0.9830 0.9916 0.9774 0.9747 0.9066 0.9787 0.9533
CREMA-D 0.2310 0.3130 0.4197 0.2760 0.3362 0.2539 0.3358 0.3156 0.3018 0.4453 0.3580 0.2344 0.3459
ESC50-Actions 0.6525 0.9625 0.9400 0.9550 0.9525 0.9631 0.9620 0.9648 0.9634 0.9700 0.9625 0.9650 0.9658
ESC50 0.4965 0.9410 0.9390 0.9345 0.9382 0.9460 0.9370 0.9450 0.9427 0.9560 0.9600 0.9620 0.9593
GT-Music-Genre 0.3250 0.7250 0.6950 0.7350 0.7183 0.7500 0.7450 0.7607 0.7520 0.7900 0.7850 0.8250 0.8000
NS-Instruments 0.3291 0.5728 0.5562 0.6177 0.5822 0.5996 0.5740 0.6438 0.6058 0.6394 0.6108 0.6648 0.6383
RAVDESS 0.1222 0.3849 0.2688 0.3422 0.3320 0.3727 0.4399 0.3523 0.3883 0.4562 0.4603 0.4623 0.4596
SESA 0.7238 0.9143 0.8953 0.8762 0.8952 0.8381 0.8762 0.8952 0.8698 0.8857 0.9143 0.8857 0.8952
TUT2017 0.2435 0.6391 0.6667 0.6525 0.6528 0.7499 0.7215 0.7312 0.7342 0.7959 0.8047 0.7729 0.7912
UrbanSound8K 0.5349 0.7607 0.7378 0.7666 0.7544 0.7576 0.7748 0.7597 0.7652 0.8120 0.8037 0.8074 0.8077
VocalSound 0.4197 0.7162 0.7485 0.6642 0.7096 0.8081 0.7825 0.7463 0.7790 0.8101 0.8168 0.7964 0.8078
AVERAGE 0.3969 0.7146 0.7121 0.7074 0.7114 0.7276 0.7396 0.7369 0.7347 0.7759 0.7621 0.7595 0.7658



\(\textbf{Comparison of $\mathrm{PALM}^{\dagger}$ and $\mathrm{PALM}$}\) Here, \(\mathrm{PALM}^{\dagger}\) refers to the \(\mathrm{PALM}\) method with the Learnable Context embeddings removed from the feature space of the text encoder. The removal of context embeddings drastically degrades performance, highlighting their importance.




Conclusion

In this study, we investigate the application of prompt learning techniques, originally developed for vision-language models (VLMs), in the context of audio-language models (ALMs). We introduce PALM, a novel method that optimizes the feature space of the text encoder branch, enhancing training efficiency compared to existing methods that operate in the input space. Evaluated on 11 diverse audio recognition datasets, PALM consistently matches or surpasses established baselines in a few-shot learning setup while reducing computational demands. PALM offers a promising direction for enhancing the performance of ALMs in zero-shot and few-shot learning scenarios, contributing to the broader field of audio recognition and paving the way for future research in multimodal tasks.


For additional details about PALM, datasets, results, please refer to our main paper and Github code repository. Thank you!



Contact

For any query related to our work, contact asif dot hanif at mbzuai dot ac dot ae



BibTeX


@article{hanif2024palm,
  title={PALM: Few-Shot Prompt Learning for Audio Language Models},
  author={Hanif, Asif and Agro, Maha Tufail and Qazi, Mohammad Areeb and Aldarmaki, Hanan},
  journal={arXiv preprint arXiv:2409.19806},
  year={2024}
}

              
--