You are viewing a single comment's thread from:

RE: Automating Multi-Lingual and Multi-Speaker Closed-Captioning and Transcripting Workflow with srt2vtt

in #beyondbitcoin7 years ago (edited)

I found one that seemed to work quite well, and another one that was staring us in the face. It turns out that YouTube itself generates word-by-word aligned captions!

Wow! I didn't know that!

However, the interesting thing with the YouTube word aligned captions is that they still seem slightly off, perhaps to make them more "fluid" while following the captions on screen. Another interesting issue I found was that while I could download these precise timings in a WebVTT file, if I uploaded the exact same file back to YouTube, it no longer worked correctly!

The story gets murkier by the minute. Besides of YouTube timing issue, are you sure you're not complicating it yourself? ;)

My intended workflow which has to be tested:

  1. Record multitrack audio
  2. get Speech-To-Text with time codes for each track separately
  3. Run an aligner like aeneas or Prosodylab to increase the accuracy i.e. get time stamps for each word
  4. Merge those files with srt2vtt
  5. Edit the resulting file by removing redundant words
  6. Now use the punctuator if you hadn't already manually - the preffered way
  7. Cut the audio against the the time stamps that are left with videogrep
  8. Realign the timings using one of the aligners above.
  9. Create transcript with srt2vtt

So this should be it or you might point out what I'm missing?

It's important to know the difference between Speech-To-Text and aligning words. Of which the latter should be much easier because you're just guessing which word will be next instead of guessing what word is being spoken or if it is just noise.

Also note that when speaker IDs are inserted into the text that we should not try to align those since they are actually not there.

Sort:  

The story gets murkier by the minute. Besides of YouTube timing issue

did you watch the sample YouTube video in my post? Try changing between "English" captions and "English (auto-generated)" to see the difference in display and timings (both of which were excellent for most use cases I would envision, at least with this particular audio sample).

Also note that when speaker IDs are inserted into the text that we should not try to align those since they are actually not there.

YouTube's captioning before speaker IDs are added, Gentile simply skips over them. But of course, as I explained, using word-by-word captioning may be mostly unnecessary if all audio tracks are simultaneously "corrected" as follows:

  1. Record multitrack audio
    1a. Load all tracks into a multitrack audio editor, cut noise/"utterances" over all clips simultaneously, export individual edited tracks
  2. get Speech-To-Text with time codes for each track separately
    ...

Sure I agree that multitrack cleaning is the way to go but SteemPowerPics didn't know how even when I showed it should be possible...

I'm not sure about the speaker IDs, if the speakerIDs are like words/names are in the transcript how can you skip them? With [something] like square brackets?

And, for perhaps an even better word-by-word alignment, I came across the amazing Gentle project (based on Kaldi which may also work for speaker recognition). So I incorporated the ability to convert Gentle's "word-by-word alignment" JSON output file (that even includes the position of each phoneme!) into a WebVTT caption file

Kind of missed that part but replace step 3 with anything which works according to your needs right?

Coin Marketplace

STEEM 0.20
TRX 0.12
JST 0.027
BTC 61272.24
ETH 3371.76
USDT 1.00
SBD 2.46