Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks,
which match features of audio waveforms with class-specific text prompt features, inspired by advancements
in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of
hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the
efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio
Language Models (PALM), which optimizes the feature space of the text encoder
branch. Unlike existing methods that work in the input space, our approach results in greater training
efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing
a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning
setup. Our method is either on par with or outperforms other approaches while being computationally less
demanding.
TLDR: We adapt vision-language prompt learning methods for audio-language models and introduce PALM,
a new method that is computationally efficient and outperforms or matches baselines in audio classification
across 11 datasets.
ZeroShot inference in audio-language models (ALMs) refers to making predictions on new, unseen data without specific training. Let's denote an ALM with \(f_{\theta} = \{f_{_{A}},f_{_{T}}\}\), whereas \(f_{_{A}}\) and \(f_{_{T}}\) are audio and text encoders, respectively. For classification in zero-shot scenario, the audio \(\boldsymbol{\mathrm{x}}\) is first passed to the audio encoder \(f_{_{A}}\), resulting in a \(d-\) dimensional feature vector \(f_{_{A}}(\boldsymbol{\mathrm{x}}) \in \mathbb{R}^{d}\). Similarly, on the text encoder side, each class label \(y_i \in \{\mathit{y}_{1}, \mathit{y}_{2}, \dots, \mathit{y}_{C} \}\) is wrapped within the class-specific text template, such as: $$t_i = \mathrm{''An~audio~recording~of~\{CLASS~y_i\}.''}$$ Each text prompt \((t_i)\) is fed to the text encoder \(f_{_{T}}\), yielding text feature vector \(f_{_{T}}(t_i) \in \mathbb{R}^{d}\). The relationship between the audio's feature vector and the text prompt feature vector is quantified using cosine similarity, \(\mathtt{sim}(f_{A}(\boldsymbol{\mathrm{x}}),f{_{T}}(t_i))\), to evaluate the audio's alignment with \(i_{\text{th}}\) class. Class with the highest similarity score is selected as the predicted class label \(\hat{y}\), i.e. $$\hat{y} = \underset{ i\in \{1,2,\dots,C\} }{\mathbf{argmax}} ~~~ \mathtt{sim}\big(f_{_{A}}(\boldsymbol{\mathrm{x}})~,~f_{_{T}}(t_i)\big)$$
Zero-shot inference in vision-language-models (VLMs) and audio-language-models (ALMs) relies on manually crafted text prompts, which significantly impact performance. Prompt Learning, as explored by Gu et al. 2023, automates this by learning text prompts from training data, eliminating manual effort. The first notable method, COOP, learns the context of text prompts in the token-embedding space using few-shot training setup. This compute-efficient approach improves VLMs' performance in downstream tasks, requiring only a small data subset to learn prompts.
Prompt learning in ALMs is a relatively new and under-explored area of research. In this work, we first show the efficacy of prompt learning methods (originally introduced for VLMs) in ALMs and then propose a novel method, PALM, that optimizes the feature space of the text encoder branch in ALMs. Our method is either on par with or outperforms baseline approaches while being computationally less demanding.
PALM (Prompt Learning for Audio Language Models) method does not require hand-crafted prompts; instead, it simply uses class names as the input to the text encoder i.e. \(t_i =\mathrm{''\{CLASS~y_i\}''}\). Moreover, unlike COOP, which learns the context of input text prompts in the token embedding space, PALM learns the context in the feature space of prompts. Specifically, after obtaining the feature vector of the \(i_{\text{th}}\) class text prompt via the text encoder, i.e., \(f_{_{T}}(t_i) \in \mathbb{R}^{d}\), it adds a learnable vector \(z_i \in \mathbb{R}^{d}\) to it to get the updated text feature vector as follows: $$f_{_{T}}^{\prime}(t_i) = (1-\lambda_i)\cdot f_{_{T}}(t_i)~+~\lambda_i \cdot z_i$$ where \(\lambda_i \in [0,1]\) is a learnable parameter that determines the contributions of both vectors. Assuming \(\boldsymbol{\mathrm{t}}=\{t_1,t_2,\dots,t_C\}\) denotes text prompts of all classes, the raw/un-normalized prediction scores (logits), denoted as \(f_{_{\theta}}(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{t}}) \in \mathbb{R}^{C}\), for an audio waveform \((\boldsymbol{\mathrm{x}})\) are obtained as follows: $$f_{_{\theta}}(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{t}}) = \bigg\{~\mathtt{sim}\bigg(f_{_{A}}(\boldsymbol{\mathrm{x}})~,~f_{_{T}}^{\prime}(t_i)\bigg)~\bigg\}_{i=1}^{C},$$ where \(\texttt{sim}(\cdot)\) is cosine-similarity function and \(C\) is the number of classes. \(f_{_{A}}(\boldsymbol{\mathrm{x}})\) is the feature vector from the audio encoder, and \(f_{_{T}}^{\prime}(t_i)\) is the updated text feature vector of \(i_{\text{th}}\) class. Following objective function is optimized to learn feature-space context embeddings \(\boldsymbol{\mathrm{z}}=\{z_1,z_2,\dots,z_C\}\) and their corresponding contributions \(\lambda=\{\lambda_1,\lambda_2,\dots,\lambda_C\}\): $$\underset{ \boldsymbol{\mathrm{z}}~,~\lambda }{\mathbf{minimize}}~~ \sum_{(\boldsymbol{\mathrm{x}},y)\in\mathcal{D}} \mathcal{L}\big(f_{_{\theta}}(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{t}}),y\big),$$ where \(\mathcal{D}=\{\boldsymbol{\mathrm{x}}_i,y_i\}_{i=1}^{N}\) is training dataset consisting of \(N\) audio-class pairs and \(\mathcal{L}(\cdot)\) denotes cross-entropy loss. After learning the parameters, following equation is used for audio classification during inference stage: $$\hat{y} = \underset{ i\in \{1,2,\dots,C\} }{\mathbf{argmax}} ~~~ \mathtt{sim}\big(f_{_{A}}(\boldsymbol{\mathrm{x}})~,~f_{_{T}}^{\prime}(t_i)\big)$$ An overview of our proposed approach (PALM) can be found in the following figure.
Baselines Our baselines include ZERO-SHOT, COOP and COCOOP. COOP and COCOOP are prompt-learning methods for VLMs, which we adapted for audio-language models by replacing the vision-encoder with an audio-encoder. Both methods optimize the text encoder’s input space, with COCOOP adding a feedback loop from audio features to the text encoder input. For all baselines, we use PENGI (a multimodal-to-text generation model). We use only the audio and text encoders from the model.
Datasets We evaluate our method on 11 audio classification datasets, covering a range of tasks from Emotion Recognition to Music Analysis. Following table lists all the datasets used in experiments.
DATASETS | TYPE | CLASSES | SPLIT |
---|---|---|---|
Beijing-Opera | Instrument Classification | 4 | Five Fold |
NS-Instruments | 10 | Train-Test | |
ESC50 | Sound Event Classification | 50 | Five Fold |
ESC50-Actions | 10 | Five Fold | |
UrbanSound8K | 10 | Ten Fold | |
CREMA-D | Emotion Recognition | 6 | Train-Test |
RAVDESS | 8 | Train-Test | |
VocalSound | Vocal Sound Classification | 6 | Train-Test |
SESA | Surveillance Sound Classification | 4 | Train-Test |
TUT2017 | Acoustic Scene Classification | 15 | Four Fold |
GT-Music-Genre | Music Analysis | 10 | Train-Test |
Experimental Settings All experiments are run for 50 epochs, with 16 samples of each class (randomly selected) from training set for few-shot setup and the full test set for inference. We use SGD with a learning rate of 0.05, and "Accuracy" as the evaluation metric.
Results (Comparison of \(\mathrm{PALM}\) with \(\mathrm{Baselines}\)) The accuracy scores of the baselines (ZERO-SHOT, COOP and COCOOP, and our proposed method PALM) across 11 datasets are presented. For each method (except ZERO SHOT), experiments were performed using three different seeds. The accuracy scores for all seeds are reported, along with the average score. Bold values indicate the best average score in each row. Compared to the baselines, our proposed method achieves favorable results, with an average improvement of 5.5% over COOP and 3.1% over COCOOP. It should be noted that both COOP and COCOOP are computationally expensive, as these approaches require loss gradients to flow through the text encoder. Additionally, COCOOP has a feedback loop from audio features to the input space of the text encoder, making it even more computationally expensive. On the other hand, PALM is relatively less computationally expensive.
METHODS → | ZERO SHOT | COOP | COCOOP | PALM(ours) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DATASETS ↓ | –– | SEED-0 | SEED-1 | SEED-2 | AVG | SEED-0 | SEED-1 | SEED-2 | AVG | SEED-0 | SEED-1 | SEED-2 | AVG |
Beijing-Opera | 0.2881 | 0.9323 | 0.9660 | 0.9619 | 0.9534 | 0.9577 | 0.9830 | 0.9916 | 0.9774 | 0.9747 | 0.9066 | 0.9787 | 0.9533 |
CREMA-D | 0.2310 | 0.3130 | 0.4197 | 0.2760 | 0.3362 | 0.2539 | 0.3358 | 0.3156 | 0.3018 | 0.4453 | 0.3580 | 0.2344 | 0.3459 |
ESC50-Actions | 0.6525 | 0.9625 | 0.9400 | 0.9550 | 0.9525 | 0.9631 | 0.9620 | 0.9648 | 0.9634 | 0.9700 | 0.9625 | 0.9650 | 0.9658 |
ESC50 | 0.4965 | 0.9410 | 0.9390 | 0.9345 | 0.9382 | 0.9460 | 0.9370 | 0.9450 | 0.9427 | 0.9560 | 0.9600 | 0.9620 | 0.9593 |
GT-Music-Genre | 0.3250 | 0.7250 | 0.6950 | 0.7350 | 0.7183 | 0.7500 | 0.7450 | 0.7607 | 0.7520 | 0.7900 | 0.7850 | 0.8250 | 0.8000 |
NS-Instruments | 0.3291 | 0.5728 | 0.5562 | 0.6177 | 0.5822 | 0.5996 | 0.5740 | 0.6438 | 0.6058 | 0.6394 | 0.6108 | 0.6648 | 0.6383 |
RAVDESS | 0.1222 | 0.3849 | 0.2688 | 0.3422 | 0.3320 | 0.3727 | 0.4399 | 0.3523 | 0.3883 | 0.4562 | 0.4603 | 0.4623 | 0.4596 |
SESA | 0.7238 | 0.9143 | 0.8953 | 0.8762 | 0.8952 | 0.8381 | 0.8762 | 0.8952 | 0.8698 | 0.8857 | 0.9143 | 0.8857 | 0.8952 |
TUT2017 | 0.2435 | 0.6391 | 0.6667 | 0.6525 | 0.6528 | 0.7499 | 0.7215 | 0.7312 | 0.7342 | 0.7959 | 0.8047 | 0.7729 | 0.7912 |
UrbanSound8K | 0.5349 | 0.7607 | 0.7378 | 0.7666 | 0.7544 | 0.7576 | 0.7748 | 0.7597 | 0.7652 | 0.8120 | 0.8037 | 0.8074 | 0.8077 |
VocalSound | 0.4197 | 0.7162 | 0.7485 | 0.6642 | 0.7096 | 0.8081 | 0.7825 | 0.7463 | 0.7790 | 0.8101 | 0.8168 | 0.7964 | 0.8078 |
AVERAGE | 0.3969 | 0.7146 | 0.7121 | 0.7074 | 0.7114 | 0.7276 | 0.7396 | 0.7369 | 0.7347 | 0.7759 | 0.7621 | 0.7595 | 0.7658 |
In this study, we investigate the application of prompt learning techniques, originally developed for vision-language models (VLMs), in the context of audio-language models (ALMs). We introduce PALM, a novel method that optimizes the feature space of the text encoder branch, enhancing training efficiency compared to existing methods that operate in the input space. Evaluated on 11 diverse audio recognition datasets, PALM consistently matches or surpasses established baselines in a few-shot learning setup while reducing computational demands. PALM offers a promising direction for enhancing the performance of ALMs in zero-shot and few-shot learning scenarios, contributing to the broader field of audio recognition and paving the way for future research in multimodal tasks.
For additional details about PALM, datasets, results, please refer to our main paper and Github code repository. Thank you!
For any query related to our work, contact asif dot hanif at mbzuai dot ac dot ae
@article{hanif2024palm,
title={PALM: Few-Shot Prompt Learning for Audio Language Models},
author={Hanif, Asif and Agro, Maha Tufail and Qazi, Mohammad Areeb and Aldarmaki, Hanan},
journal={arXiv preprint arXiv:2409.19806},
year={2024}
}