Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

 

Alibaba Group

 

21 Dec 2023

 

Paper | HF Hub

Introduction

  • Qwen-Audio uses over 30 tasks, 8 languages, and various types of audio.

  • Qwen-Audio contains an audio encoder and a large language model.

    • Audio Encoder - Whisper Large v2 Encoder.

    • LLM - Qwen 7B.

Audio Encoder

  • Initialize from whisper large v2 encoder with 640M parameters.

  • Preprocessing:

    • Resample to 16kHz.

    • Converts into 80-channel melspectrogram.

    • Window size of 25ms and a hop size of 10ms.

    • SpecAugment is applied.

Language Model

  • Pretrained weights from Qwen-7B.

    • 32-layer transformer decoder.

    • Hidden size of 4096.

    • 7.7B parameters.

       

Multitask Pretraining

  • Similar tasks can benefit from knowledge sharing during co-training.

  • Tasks that rely on lower-level perceptual abilities can assist tasks that require higher-level understanding or reasoning capabilities.

  • Simply mixing datasets will introduces interference.

  • Whisper specific tasks and condition info as input special tokens.

    • e.g. VAD, lang id, and timestamp tags.

Multi-task Training Format Framework

  • Transcription Tag:

    • <|startoftranscripts|> for speech recognition and speech translation tasks.

    • <|startofanalysis|> tag is utilized for other tasks.

  • Audio Language Tag:

    • <|en|>, <|zh|>, <|de|>, <|es|>, <|ko|>, <|fr|>, <|ja|>, <|it|>.

    • <|unknown|>  for non-speech, such as natural sounds and music.

Multi-task Training Format Framework

  • Task Tag:

    • <|transcribe|>, <|translate|>, <|caption|>, <|analysis|>, <|question-answer|>.

    • Append the corresponding questions after the tag for QA task.

  • Text Language Tag: the language of output text sequences.

  • Timestamps Tag: <|timestamps|>, <|notimestamps|>

    • Fine-grained word-level timestamp (SRWT).

  • Output Instruction: provide output instruction for different subtasks.

Supervised Finetuning

  • Manually create demonstrations consist of raw text labels, questions, and answers for each task.

  • Utilize GPT-3.5 to generate further questions and answers based on the provided raw text labels.

  • Label different audios with "Audio id:" to handle multiple audio inputs.

  • Each statement is marked with <im_start> and <im_end>.

  • Include pure text instruction data.

Chat Template

Experiment Setup

  • Multi-task Pretraining:

    • Freeze the LLM.

    • Optimize the audio encoder.

  • Supervised Finetuning:

    • Freeze the audio encoder.

    • Optimize the LLM.

Evaluation Benchmarks

Automatic Speech Recognition

Speech-To-Text Translation

The Analysis of Word-level Timestamps Prediction

Qwen Audio

By Penut Chen(陳威廷)

Qwen Audio

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

  • 1