穩健的口語化文本分割方法

林榮顯, 黃純敏*, 陳硯楷, 林揚展
TANET 2023 Oral
Session
1. INTRODUCTION
2. RELATED WORK
3. METHOD
4. RESULT&CONCLUSION 4 結果與結論
1. INTRODUCTION
文本分割
詞
句子
段落
1. INTRODUCTION
文本分割
口語化
詞
句子
段落


STS
(ASR/Human)
1. INTRODUCTION
What is the difference?
文本分割
文本分割
口語化
1. INTRODUCTION
What is the difference?
Q: Can we apply a model that has been trained only on written data directly to spoken data?
文本分割
文本分割
口語化
1. INTRODUCTION
What is the difference?
文本分割
文本分割
口語化

SegBot
Text Segmentation as a
Supervised Learning Task
2016
2018
Text Segmentation by
Cross Segment Attention
2020
1. INTRODUCTION
What is the difference?
文本分割
文本分割
口語化
SegBot
Text Segmentation as a
Supervised Learning Task
2016
2018
Text Segmentation by
Cross Segment Attention
2020
Choi
0.7k
1. INTRODUCTION
What is the difference?
文本分割
文本分割
口語化
SegBot
Text Segmentation as a
Supervised Learning Task
2016
2018
Text Segmentation by
Cross Segment Attention
2020
Wiki-727K
727k
Choi
0.7k
larger 1000x
1. INTRODUCTION
What is the difference?
文本分割
文本分割
口語化
SegBot
Text Segmentation as a
Supervised Learning Task
2016
2018
Text Segmentation by
Cross Segment Attention
2020

1. INTRODUCTION
What is the difference?
文本分割
文本分割
口語化
風格:口語表達通常較書面更隨意與破碎,多數語句傳達的資訊通常比書面更加稀薄

1. INTRODUCTION
What is the difference?
文本分割
文本分割
口語化
干擾:由人類手工轉錄成本高且耗時,現有方法通常藉由ASR轉錄,但轉錄結果還是會有一定誤差
1. INTRODUCTION
What is the difference?
文本分割
文本分割
口語化
資料集:相較目前(2023年8月)常用的口語化文本分割資料集AMI僅含有約100小時的轉錄文字,相較目前常用的書面文本分割資料集Wiki-727k擁有約727k篇英語維基百科頁面,其多樣性與資料量皆不足以支撐書面文本分割常用的監督式學習方法。
AMI meeting corpus
1000K words
larger 34x
Wiki-727K
34442K words
1. INTRODUCTION
What is the difference?
文本分割
文本分割
口語化
Q: Can we apply a model that has been trained only on written data directly to spoken data?
Ans: No, we can't.
2. RELATED WORK
Semantic TextTiling

2. RELATED WORK
Semantic TextTiling

Problem
1. 無法保證S是完整的句子
2. RELATED WORK
Semantic TextTiling
Problem
1. 無法保證S是完整的句子
S1: win and lose condition in our game will
S2: depend on how many score points can the
S1: … mounted from floor
S2: all the way to the ceiling …
S1: … splice method needs at least two
S2: arguments the first argument is the …
2. RELATED WORK
Semantic TextTiling

Problem
1. 無法保證S是完整的句子
2. 嵌入空間是否各向同性
2. RELATED WORK
Semantic TextTiling
Problem
1. 無法保證S是完整的句子
2. 嵌入空間是否各向同性

2. RELATED WORK
Semantic TextTiling
Problem
1. 無法保證S是完整的句子
2. 嵌入空間是否各向同性

2. RELATED WORK
Semantic TextTiling
Problem
1. 無法保證S是完整的句子
2. 嵌入空間是否各向同性

3. METHOD

3. METHOD

3. METHOD


3. METHOD

3. METHOD


4. RESULT&CONCLUSION
Metric
4. RESULT&CONCLUSION
Metric
4. RESULT&CONCLUSION
Metric
4. RESULT&CONCLUSION
Metric
4. RESULT&CONCLUSION 4 結果與結論

Copy of Copy of Copy of Copy of AWS
By r oger
Copy of Copy of Copy of Copy of AWS
- 5