Text-based audio editing in radio production

Chris Baume

Senior Research Engineer

Current radio

production practice


  • 9 national stations, 34m listeners
  • 40 local stations, 8m listeners
  • 1 global station in 29 languages, 269m listeners
  • Mostly live, but lots of pre-production and post-production

Manual transcription

  • Producers writing transcripts
    • rough
    • slow
    • waste of time
  • Paying others to do it
    • slow
    • expensive
  • No integration with audio editors
    • Alt-tab
    • Two computers
    • Printouts
  • Not much use of speech-to-text (yet!)

Academic prototypes

Whittaker, Steve, et al. "SCANMail: a voicemail interface that makes speech browsable, readable and searchable." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2002.

Casares, Juan, et al. "Simplifying video editing with SILVER." CHI'02 Extended Abstracts on Human Factors in Computing Systems. ACM, 2002.

Whittaker, Steve, and Brian Amento. "Semantic speech editing." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2004.

Berthouzoz, Floraine, Wilmot Li, and Maneesh Agrawala. "Tools for placing cuts and transitions in interview video." ACM Trans. Graph. 31.4 (2012): 67-1.

Rubin, Steve, et al. "Content-based tools for editing audio stories." Proceedings of the 26th annual ACM symposium on User interface software and technology. ACM, 2013.

Sivaraman, Venkatesh, Dongwook Yoon, and Piotr Mitros. "Simplified Audio Production in Asynchronous Voice-Based Discussions." Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 2016.

Shin, Hijung Valentina, Wilmot Li, and Frédo Durand. "Dynamic Authoring of Audio with Linked Scripts." Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 2016.

BBC prototypes

Prototype One

"Speech Editor Prototype"

Prototype Two

"Discourse" / "Dialogger"

Prototype Three

"The Magic Pen"



  • Real radio producers creating real programmes
  • Contextual inquiry (interview, task observation, interview)
  • Two rounds:
  1. Prototype 1 vs. Normal process (5 producers)
  2. Prototype 2 vs. Prototype 3 vs. Paper transcript (8 producers)


  • Transcripts themselves are most useful
  • Automated speech-to-text is good enough for editing own content
  • Light correction needed for sharing
  • Heavy correction needed for publishing
  • Annotation is very important
    • Highlighting (bold, underline)
    • Ranking (star ratings)
    • Notes (comments)
    • Segmentation (paragraphs)
  • Drag'n'drop interface doesn't scale well


  • 'Radio is made with your ears'
  • Speech-to-text is lossy
  • Need to spot good/bad sound 'quality'
  • Very difficult to substitute, but slow
  • Faster than real-time playback
  • Easy ways to navigate, preview edits
  • Thoughtful keyboard shortcuts


  • Open-plan offices
    • Noisy
    • Distracting
  • Lots of screen-based working
    • Boring
    • Hard on the eyes
  • Want to work outside the office
    • Cafés
    • Commuter train


  • Needed to get feedback/approval
  • Transcripts often shared without audio
  • Easy way to share and listen to edits and receive feedback
  • Could allow new forms of content?



Dialogger design

Transcript representation: static

    "start": 5.58,
    "end": 5.83,
    "confidence": 0.4,
    "word": "hello",
    "punct": "Hello"
    "start": 5.85,
    "end": 6.08,
    "confidence": 0.49,
    "word": "world",
    "punct": "world."


<a data-start="5580" data-end="5830" data-next="5850" data-content="00:00:05">Hello </a>
<a data-start="5850" data-end="6080" data-next="6120" data-content="00:00:05">world. </a>


Transcript representation: streaming

  "grain_type": "event",
  "source_id": "fa15e306-ede6-4f7f-8025-3ef4191c9e13",
  "flow_id": "f058bf49-fc5b-4a05-86be-5fd4e0bf8b9a",
  "origin_timestamp": "1471604633:632000000",
  "sync_timestamp": "1468420295:0",
  "creation_timestamp": "1471604633:632000000",
  "event_payload": {
    "type": "urn:x-ipstudio:format:event.transcript",
    "topic": "/sources/64313f2c-fd6f-46a5-9f3b-1ce92a97c20f",
    "path": "/segments/db40db76-532f-48e1-93ec-bf0f6a8f1730/
    "pre": {},
    "post": {
      "word": "hello",
      "punctuated_word": "Hello",
      "confidence": 0.4


Text editing


  • Correction (i.e. text editing)
    while retaining timestamps
  • Editing (i.e. audio editing)
  • Live preview of edits
  • Display/edit speaker segments
  • Display timestamps
  • Export EDL




  • Use CKEditor for text editing with restricted functionality
    • No cut/copy/paste or dragging
    • Replace only
    • Selections jump to start/end of words
    • Generate new timestamps when replacing >1 word
  • Edit audio using bold/underline
  • Use HTML5-video-compositor for preview

Next steps

Idea braindump

in no particular order...

  • Common base UI element for timed transcript editing
  • Google Docs style collaborative time transcript editor/player
  • Better, meaningful annotations (e.g. rate segments, export >4*)
  • Template for EDL file generation
  • Embed transcript and annotations in audio file
  • Umm detection/removal (STT with umms?)
  • Automatic segmentation with tagging and summaries
  • Better time compression
  • Tools for recording multiple versions of a script
  • Digital pen with audio playback, natural annotation and live bidirectional sync
  • Smart correction by exposing STT graphs
  • Fast clipping of a live audio stream using transcripts
  • Bidirectional integration with a proper audio editing system


By chrisbaume


Text-based audio editing in radio production

  • 2,578