Introduction
-
Qwen-Audio uses over 30 tasks, 8 languages, and various types of audio.
-
Qwen-Audio contains an audio encoder and a large language model.
-
Audio Encoder - Whisper Large v2 Encoder.
-
LLM - Qwen 7B.
-
Audio Encoder
-
Initialize from whisper large v2 encoder with 640M parameters.
-
Preprocessing:
-
Resample to 16kHz.
-
Converts into 80-channel melspectrogram.
-
Window size of 25ms and a hop size of 10ms.
-
SpecAugment is applied.
-
Language Model
-
Pretrained weights from Qwen-7B.
-
32-layer transformer decoder.
-
Hidden size of 4096.
-
7.7B parameters.
-
Multitask Pretraining
-
Similar tasks can benefit from knowledge sharing during co-training.
-
Tasks that rely on lower-level perceptual abilities can assist tasks that require higher-level understanding or reasoning capabilities.
-
Simply mixing datasets will introduces interference.
-
Whisper specific tasks and condition info as input special tokens.
-
e.g. VAD, lang id, and timestamp tags.
-
Multi-task Training Format Framework
-
Transcription Tag:
-
<|startoftranscripts|> for speech recognition and speech translation tasks.
-
<|startofanalysis|> tag is utilized for other tasks.
-
-
Audio Language Tag:
-
<|en|>, <|zh|>, <|de|>, <|es|>, <|ko|>, <|fr|>, <|ja|>, <|it|>.
-
<|unknown|> for non-speech, such as natural sounds and music.
-
Multi-task Training Format Framework
-
Task Tag:
-
<|transcribe|>, <|translate|>, <|caption|>, <|analysis|>, <|question-answer|>.
-
Append the corresponding questions after the tag for QA task.
-
-
Text Language Tag: the language of output text sequences.
-
Timestamps Tag: <|timestamps|>, <|notimestamps|>
-
Fine-grained word-level timestamp (SRWT).
-
-
Output Instruction: provide output instruction for different subtasks.
Supervised Finetuning
-
Manually create demonstrations consist of raw text labels, questions, and answers for each task.
-
Utilize GPT-3.5 to generate further questions and answers based on the provided raw text labels.
-
Label different audios with "Audio id:" to handle multiple audio inputs.
-
Each statement is marked with <im_start> and <im_end>.
-
Include pure text instruction data.
Chat Template
Experiment Setup
-
Multi-task Pretraining:
-
Freeze the LLM.
-
Optimize the audio encoder.
-
-
Supervised Finetuning:
-
Freeze the audio encoder.
-
Optimize the LLM.
-
Evaluation Benchmarks
Automatic Speech Recognition
Speech-To-Text Translation
The Analysis of Word-level Timestamps Prediction
Qwen Audio
By Penut Chen(陳威廷)
Qwen Audio
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
- 1