Qwen-Audio uses over 30 tasks, 8 languages, and various types of audio.
Qwen-Audio contains an audio encoder and a large language model.
Audio Encoder - Whisper Large v2 Encoder.
LLM - Qwen 7B.
Initialize from whisper large v2 encoder with 640M parameters.
Preprocessing:
Resample to 16kHz.
Converts into 80-channel melspectrogram.
Window size of 25ms and a hop size of 10ms.
SpecAugment is applied.
Pretrained weights from Qwen-7B.
32-layer transformer decoder.
Hidden size of 4096.
7.7B parameters.
Similar tasks can benefit from knowledge sharing during co-training.
Tasks that rely on lower-level perceptual abilities can assist tasks that require higher-level understanding or reasoning capabilities.
Simply mixing datasets will introduces interference.
Whisper specific tasks and condition info as input special tokens.
e.g. VAD, lang id, and timestamp tags.
Transcription Tag:
<|startoftranscripts|> for speech recognition and speech translation tasks.
<|startofanalysis|> tag is utilized for other tasks.
Audio Language Tag:
<|en|>, <|zh|>, <|de|>, <|es|>, <|ko|>, <|fr|>, <|ja|>, <|it|>.
<|unknown|> for non-speech, such as natural sounds and music.
Task Tag:
<|transcribe|>, <|translate|>, <|caption|>, <|analysis|>, <|question-answer|>.
Append the corresponding questions after the tag for QA task.
Text Language Tag: the language of output text sequences.
Timestamps Tag: <|timestamps|>, <|notimestamps|>
Fine-grained word-level timestamp (SRWT).
Output Instruction: provide output instruction for different subtasks.
Manually create demonstrations consist of raw text labels, questions, and answers for each task.
Utilize GPT-3.5 to generate further questions and answers based on the provided raw text labels.
Label different audios with "Audio id:" to handle multiple audio inputs.
Each statement is marked with <im_start> and <im_end>.
Include pure text instruction data.
Multi-task Pretraining:
Freeze the LLM.
Optimize the audio encoder.
Supervised Finetuning:
Freeze the audio encoder.
Optimize the LLM.