RE: Automating Multi-Lingual and Multi-Speaker Closed-Captioning and Transcripting Workflow with srt2vtt
Also srt2vtt @host.srt *.srt *.vtt o:output.vtt output.htm {normalize} inputs can be SRT or VTT, "normalize" submits each phrase of transcript to AI puncuator
Is - I think, not going to get good results because you're feeding the punctuator just parts of a sentence. Which brings me to the next item: caption standards. Captions for each speaker should start with a sentence.
Well we can let that be for now to not get crazy before we start building this!
But in essence well spoken speakers should start their answers or lines with a meaningful sentence anyway.
However even the best speaker will need time to think about an answer and thus fill the void with words or utterances. Maybe a good Machine Learning package could mark such parts and delete those for us automagically... So then the whole process could be fully automated.
captions are submitted after the initial transcript is created, which means the text submitted is the entire stream of text between one speaker and the next...
ie. <speaker1: this whole block of text will be sent as a single block for additional punctuation>
<speaker2: this whole block of text will be sent next>
Well, if you want to go that far, there are some programs I've used that can remove such extraneous "noise", and do quite an amazing job at it. The best ones, however, tend to be quite expensive, and they generally mute/muffle based on a noise print, as opposed to outright cutting text.
For longer words, or even filler phrases, that also starts to enter much more subjective territory. Either way, it still makes the case for pre-filtering the audio tracks before processing to remove as much of this extraneous "noise" as possible. It seems you may have also missed that aspect of my post....
Why don't you remove all the extraneous noise and filler words using a multitrack audio editor over all tracks simultaneously before running through text-to-speech?
Well it just makes no sense prefiltering when with Speech-to-Text you could see where the actual words are in time and filter the noise out accordingly. That's what you can see in the YouTube web app but it doesn't let you edit the audio unfortunately. But if any mulitrack editor could have the words alongside it like YT web app does that would be excellent.
just wanted to add these few links we discussed for reference:
Link: How to Use Truncate Silence and Sound Smarter with Audacity
Link: Howto Truncate Silence in Audacity
Link: Deep Learning 'ahem' detector (github project)
not sure I understand this, but exporting YouTube caption via WebVTT gaves me the individual timings of each word shown in the auto-captioning. The multitrack editing (cutting across all tracks) phase would take place BEFORE any captioning takes place, and you would have already removed any of those undesirable "artifacts".
Interestingly though you might be able to do it that way too in reverse, get combined transcript and convert into audacity labels.. http://wiki.audacityteam.org/wiki/Movie_subtitles_(*.SRT)
Even if all the artifacts aren't there, maybe it makes it easier to find some of them to cut out.
btw, also added a feature to
srt2vtt
that converts caption files to the audacity label format:outputs audacity-compatible text labels from captions file