You are viewing a single comment's thread from:
RE: Automating Multi-Lingual and Multi-Speaker Closed-Captioning and Transcripting Workflow with srt2vtt
And, for perhaps an even better word-by-word alignment, I came across the amazing Gentle project (based on Kaldi which may also work for speaker recognition). So I incorporated the ability to convert Gentle's "word-by-word alignment" JSON output file (that even includes the position of each phoneme!) into a WebVTT caption file
Kind of missed that part but replace step 3 with anything which works according to your needs right?