Text-based audio editing in radio production

Chris Baume

Senior Research Engineer

Current radio

production practice

Scale

9 national stations, 34m listeners
40 local stations, 8m listeners
1 global station in 29 languages, 269m listeners
Mostly live, but lots of pre-production and post-production

Manual transcription

Producers writing transcripts
- rough
- slow
- waste of time
Paying others to do it
- slow
- expensive
No integration with audio editors
- Alt-tab
- Two computers
- Printouts
Not much use of speech-to-text (yet!)

Academic prototypes

Whittaker, Steve, et al. "SCANMail: a voicemail interface that makes speech browsable, readable and searchable." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2002.

Casares, Juan, et al. "Simplifying video editing with SILVER." CHI'02 Extended Abstracts on Human Factors in Computing Systems. ACM, 2002.

Whittaker, Steve, and Brian Amento. "Semantic speech editing." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2004.

Berthouzoz, Floraine, Wilmot Li, and Maneesh Agrawala. "Tools for placing cuts and transitions in interview video." ACM Trans. Graph. 31.4 (2012): 67-1.

Rubin, Steve, et al. "Content-based tools for editing audio stories." Proceedings of the 26th annual ACM symposium on User interface software and technology. ACM, 2013.

Sivaraman, Venkatesh, Dongwook Yoon, and Piotr Mitros. "Simplified Audio Production in Asynchronous Voice-Based Discussions." Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 2016.

Shin, Hijung Valentina, Wilmot Li, and Frédo Durand. "Dynamic Authoring of Audio with Linked Scripts." Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 2016.

BBC prototypes

Prototype One

"Speech Editor Prototype"

Prototype Two

"Discourse" / "Dialogger"

Prototype Three

"The Magic Pen"

Evaluation

Method

Real radio producers creating real programmes
Contextual inquiry (interview, task observation, interview)
Two rounds:

Prototype 1 vs. Normal process (5 producers)
Prototype 2 vs. Prototype 3 vs. Paper transcript (8 producers)

Transcripts

Transcripts themselves are most useful
Automated speech-to-text is good enough for editing own content
Light correction needed for sharing
Heavy correction needed for publishing
Annotation is very important
- Highlighting (bold, underline)
- Ranking (star ratings)
- Notes (comments)
- Segmentation (paragraphs)
Drag'n'drop interface doesn't scale well

Listening

'Radio is made with your ears'
Speech-to-text is lossy
Need to spot good/bad sound 'quality'
Very difficult to substitute, but slow
Faster than real-time playback
Easy ways to navigate, preview edits
Thoughtful keyboard shortcuts

Portability

Open-plan offices
- Noisy
- Distracting
Lots of screen-based working
- Boring
- Hard on the eyes
Want to work outside the office
- Cafés
- Commuter train

Collaboration

Needed to get feedback/approval
Transcripts often shared without audio
Easy way to share and listen to edits and receive feedback
Could allow new forms of content?

Software

github.com/bbc/dialogger

Dialogger design

Transcript representation: static

[
  {
    "start": 5.58,
    "end": 5.83,
    "confidence": 0.4,
    "word": "hello",
    "punct": "Hello"
  },
  {
    "start": 5.85,
    "end": 6.08,
    "confidence": 0.49,
    "word": "world",
    "punct": "world."
]

JSON

<a data-start="5580" data-end="5830" data-next="5850" data-content="00:00:05">Hello </a>
<a data-start="5850" data-end="6080" data-next="6120" data-content="00:00:05">world. </a>

HTML

Transcript representation: streaming

{
  "grain_type": "event",
  "source_id": "fa15e306-ede6-4f7f-8025-3ef4191c9e13",
  "flow_id": "f058bf49-fc5b-4a05-86be-5fd4e0bf8b9a",
  "origin_timestamp": "1471604633:632000000",
  "sync_timestamp": "1468420295:0",
  "creation_timestamp": "1471604633:632000000",
  "event_payload": {
    "type": "urn:x-ipstudio:format:event.transcript",
    "topic": "/sources/64313f2c-fd6f-46a5-9f3b-1ce92a97c20f",
    "path": "/segments/db40db76-532f-48e1-93ec-bf0f6a8f1730/
               utterance/cc7e04cc-5aed-44c7-851a-79414aee565f",
    "pre": {},
    "post": {
      "word": "hello",
      "punctuated_word": "Hello",
      "confidence": 0.4
    }
  }
}

JSON

Text editing

Requirements:

Correction (i.e. text editing)
while retaining timestamps
Editing (i.e. audio editing)
Live preview of edits
Display/edit speaker segments
Display timestamps
Export EDL

Solution:

Use CKEditor for text editing with restricted functionality
- No cut/copy/paste or dragging
- Replace only
- Selections jump to start/end of words
- Generate new timestamps when replacing >1 word
Edit audio using bold/underline
Use HTML5-video-compositor for preview
JSON->HTML, HTML->JSON
JSON->EDL

Next steps

Idea braindump

in no particular order...

Common base UI element for timed transcript editing
Google Docs style collaborative time transcript editor/player
Better, meaningful annotations (e.g. rate segments, export >4*)
Template for EDL file generation
Embed transcript and annotations in audio file
Umm detection/removal (STT with umms?)
Automatic segmentation with tagging and summaries
Better time compression
Tools for recording multiple versions of a script
Digital pen with audio playback, natural annotation and live bidirectional sync
Smart correction by exposing STT graphs
Fast clipping of a live audio stream using transcripts
Bidirectional integration with a proper audio editing system

textav

By chrisbaume

textav

Text-based audio editing in radio production

3,515

Text-based audio editing in radio production

Current radio

production practice

Scale

Manual transcription

Academic prototypes

BBC prototypes

Prototype One

Prototype Two

Prototype Three

Evaluation

Method

Transcripts

Listening

Portability

Collaboration

Software

github.com/bbc/dialogger

Dialogger design

Transcript representation: static

Transcript representation: streaming

Text editing

Next steps

Idea braindump

textav

More from chrisbaume