Text-based audio editing in radio production
Chris Baume
Senior Research Engineer
Current radio
production practice
Scale
- 9 national stations, 34m listeners
- 40 local stations, 8m listeners
- 1 global station in 29 languages, 269m listeners
- Mostly live, but lots of pre-production and post-production
Manual transcription
- Producers writing transcripts
- rough
- slow
- waste of time
- Paying others to do it
- slow
- expensive
- No integration with audio editors
- Alt-tab
- Two computers
- Printouts
- Not much use of speech-to-text (yet!)
Academic prototypes
Whittaker, Steve, et al. "SCANMail: a voicemail interface that makes speech browsable, readable and searchable." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2002.
Casares, Juan, et al. "Simplifying video editing with SILVER." CHI'02 Extended Abstracts on Human Factors in Computing Systems. ACM, 2002.
Whittaker, Steve, and Brian Amento. "Semantic speech editing." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2004.
Berthouzoz, Floraine, Wilmot Li, and Maneesh Agrawala. "Tools for placing cuts and transitions in interview video." ACM Trans. Graph. 31.4 (2012): 67-1.
Rubin, Steve, et al. "Content-based tools for editing audio stories." Proceedings of the 26th annual ACM symposium on User interface software and technology. ACM, 2013.
Sivaraman, Venkatesh, Dongwook Yoon, and Piotr Mitros. "Simplified Audio Production in Asynchronous Voice-Based Discussions." Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 2016.
Shin, Hijung Valentina, Wilmot Li, and Frédo Durand. "Dynamic Authoring of Audio with Linked Scripts." Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 2016.
BBC prototypes
Prototype One
"Speech Editor Prototype"
Prototype Two
"Discourse" / "Dialogger"
Prototype Three
"The Magic Pen"
Evaluation
Method
- Real radio producers creating real programmes
- Contextual inquiry (interview, task observation, interview)
- Two rounds:
- Prototype 1 vs. Normal process (5 producers)
- Prototype 2 vs. Prototype 3 vs. Paper transcript (8 producers)
Transcripts
- Transcripts themselves are most useful
- Automated speech-to-text is good enough for editing own content
- Light correction needed for sharing
- Heavy correction needed for publishing
- Annotation is very important
- Highlighting (bold, underline)
- Ranking (star ratings)
- Notes (comments)
- Segmentation (paragraphs)
- Drag'n'drop interface doesn't scale well
Listening
- 'Radio is made with your ears'
- Speech-to-text is lossy
- Need to spot good/bad sound 'quality'
- Very difficult to substitute, but slow
- Faster than real-time playback
- Easy ways to navigate, preview edits
- Thoughtful keyboard shortcuts
Portability
- Open-plan offices
- Noisy
- Distracting
- Lots of screen-based working
- Boring
- Hard on the eyes
- Want to work outside the office
- Cafés
- Commuter train
Collaboration
- Needed to get feedback/approval
- Transcripts often shared without audio
- Easy way to share and listen to edits and receive feedback
- Could allow new forms of content?
Software
github.com/bbc/dialogger
Dialogger design
Transcript representation: static
[
{
"start": 5.58,
"end": 5.83,
"confidence": 0.4,
"word": "hello",
"punct": "Hello"
},
{
"start": 5.85,
"end": 6.08,
"confidence": 0.49,
"word": "world",
"punct": "world."
]
JSON
<a data-start="5580" data-end="5830" data-next="5850" data-content="00:00:05">Hello </a>
<a data-start="5850" data-end="6080" data-next="6120" data-content="00:00:05">world. </a>
HTML
Transcript representation: streaming
{
"grain_type": "event",
"source_id": "fa15e306-ede6-4f7f-8025-3ef4191c9e13",
"flow_id": "f058bf49-fc5b-4a05-86be-5fd4e0bf8b9a",
"origin_timestamp": "1471604633:632000000",
"sync_timestamp": "1468420295:0",
"creation_timestamp": "1471604633:632000000",
"event_payload": {
"type": "urn:x-ipstudio:format:event.transcript",
"topic": "/sources/64313f2c-fd6f-46a5-9f3b-1ce92a97c20f",
"path": "/segments/db40db76-532f-48e1-93ec-bf0f6a8f1730/
utterance/cc7e04cc-5aed-44c7-851a-79414aee565f",
"pre": {},
"post": {
"word": "hello",
"punctuated_word": "Hello",
"confidence": 0.4
}
}
}
JSON
Text editing
Requirements:
- Correction (i.e. text editing)
while retaining timestamps - Editing (i.e. audio editing)
- Live preview of edits
- Display/edit speaker segments
- Display timestamps
- Export EDL
Solution:
- Use CKEditor for text editing with restricted functionality
- No cut/copy/paste or dragging
- Replace only
- Selections jump to start/end of words
- Generate new timestamps when replacing >1 word
- Edit audio using bold/underline
- Use HTML5-video-compositor for preview
- JSON->HTML, HTML->JSON
- JSON->EDL
Next steps
Idea braindump
in no particular order...
- Common base UI element for timed transcript editing
- Google Docs style collaborative time transcript editor/player
- Better, meaningful annotations (e.g. rate segments, export >4*)
- Template for EDL file generation
- Embed transcript and annotations in audio file
- Umm detection/removal (STT with umms?)
- Automatic segmentation with tagging and summaries
- Better time compression
- Tools for recording multiple versions of a script
- Digital pen with audio playback, natural annotation and live bidirectional sync
- Smart correction by exposing STT graphs
- Fast clipping of a live audio stream using transcripts
- Bidirectional integration with a proper audio editing system
textav
By chrisbaume
textav
Text-based audio editing in radio production
- 2,755