You are viewing a single comment's thread from:

RE: Automating Multi-Lingual and Multi-Speaker Closed-Captioning and Transcripting Workflow with srt2vtt

in #beyondbitcoin7 years ago

Also srt2vtt @host.srt *.srt *.vtt o:output.vtt output.htm {normalize} inputs can be SRT or VTT, "normalize" submits each phrase of transcript to AI puncuator

Is - I think, not going to get good results because you're feeding the punctuator just parts of a sentence. Which brings me to the next item: caption standards. Captions for each speaker should start with a sentence.

Well we can let that be for now to not get crazy before we start building this!

But in essence well spoken speakers should start their answers or lines with a meaningful sentence anyway.

However even the best speaker will need time to think about an answer and thus fill the void with words or utterances. Maybe a good Machine Learning package could mark such parts and delete those for us automagically... So then the whole process could be fully automated.

Sort:  

captions are submitted after the initial transcript is created, which means the text submitted is the entire stream of text between one speaker and the next...

ie. <speaker1: this whole block of text will be sent as a single block for additional punctuation>
<speaker2: this whole block of text will be sent next>

Well, if you want to go that far, there are some programs I've used that can remove such extraneous "noise", and do quite an amazing job at it. The best ones, however, tend to be quite expensive, and they generally mute/muffle based on a noise print, as opposed to outright cutting text.

For longer words, or even filler phrases, that also starts to enter much more subjective territory. Either way, it still makes the case for pre-filtering the audio tracks before processing to remove as much of this extraneous "noise" as possible. It seems you may have also missed that aspect of my post....

Edit the resulting file by removing redundant words

Why don't you remove all the extraneous noise and filler words using a multitrack audio editor over all tracks simultaneously before running through text-to-speech?

Well it just makes no sense prefiltering when with Speech-to-Text you could see where the actual words are in time and filter the noise out accordingly. That's what you can see in the YouTube web app but it doesn't let you edit the audio unfortunately. But if any mulitrack editor could have the words alongside it like YT web app does that would be excellent.

not sure I understand this, but exporting YouTube caption via WebVTT gaves me the individual timings of each word shown in the auto-captioning. The multitrack editing (cutting across all tracks) phase would take place BEFORE any captioning takes place, and you would have already removed any of those undesirable "artifacts".

Interestingly though you might be able to do it that way too in reverse, get combined transcript and convert into audacity labels.. http://wiki.audacityteam.org/wiki/Movie_subtitles_(*.SRT)

Even if all the artifacts aren't there, maybe it makes it easier to find some of them to cut out.

btw, also added a feature to srt2vtt that converts caption files to the audacity label format:

  • srt2vtt audacity transcript.{srt|vtt}
    outputs audacity-compatible text labels from captions file

Coin Marketplace

STEEM 0.20
TRX 0.12
JST 0.027
BTC 61272.24
ETH 3371.76
USDT 1.00
SBD 2.46