Qwen Audio

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Alibaba Group

21 Dec 2023

Introduction

Qwen-Audio uses over 30 tasks, 8 languages, and various types of audio.
Qwen-Audio contains an audio encoder and a large language model.
- Audio Encoder - Whisper Large v2 Encoder.
- LLM - Qwen 7B.

Audio Encoder

Initialize from whisper large v2 encoder with 640M parameters.
Preprocessing:
- Resample to 16kHz.
- Converts into 80-channel melspectrogram.
- Window size of 25ms and a hop size of 10ms.
- SpecAugment is applied.

Language Model

Pretrained weights from Qwen-7B.
- 32-layer transformer decoder.
- Hidden size of 4096.
- 7.7B parameters.

Multitask Pretraining

Similar tasks can benefit from knowledge sharing during co-training.
Tasks that rely on lower-level perceptual abilities can assist tasks that require higher-level understanding or reasoning capabilities.
Simply mixing datasets will introduces interference.
Whisper specific tasks and condition info as input special tokens.
- e.g. VAD, lang id, and timestamp tags.

Multi-task Training Format Framework

Transcription Tag:
- <|startoftranscripts|> for speech recognition and speech translation tasks.
- <|startofanalysis|> tag is utilized for other tasks.
Audio Language Tag:
- <|en|>, <|zh|>, <|de|>, <|es|>, <|ko|>, <|fr|>, <|ja|>, <|it|>.
- <|unknown|> for non-speech, such as natural sounds and music.

Multi-task Training Format Framework

Task Tag:
- <|transcribe|>, <|translate|>, <|caption|>, <|analysis|>, <|question-answer|>.
- Append the corresponding questions after the tag for QA task.
Text Language Tag: the language of output text sequences.
Timestamps Tag: <|timestamps|>, <|notimestamps|>
- Fine-grained word-level timestamp (SRWT).
Output Instruction: provide output instruction for different subtasks.

Supervised Finetuning

Manually create demonstrations consist of raw text labels, questions, and answers for each task.
Utilize GPT-3.5 to generate further questions and answers based on the provided raw text labels.
Label different audios with "Audio id:" to handle multiple audio inputs.
Each statement is marked with <im_start> and <im_end>.
Include pure text instruction data.

Chat Template

Experiment Setup

Multi-task Pretraining:
- Freeze the LLM.
- Optimize the audio encoder.
Supervised Finetuning:
- Freeze the audio encoder.
- Optimize the LLM.

Evaluation Benchmarks

Automatic Speech Recognition

Speech-To-Text Translation

The Analysis of Word-level Timestamps Prediction

Made with Slides.com