Testing Machine Learning tools for optimizing the Steem experience

in #machinelearning7 years ago

Funny TL;DR

For the ADHD inclined

Intro

I've been very busy researching machine learning projects on GitHub and reading papers on optimizing the experience of writing, transforming, retrieval, evaluation and analytics of articles like those on Steem.

Problems I see for example with Steem is finding helpful and genuine comments. So sentiment analysis would be great for that. Also I'd love to see if my efforts paid off on commenting on great projects people propose or do, to invite them to the #BeyondBitcoin #Whaletank, a startup incubator you could call it.

So some analytics would be wonderful and to do it by hand AskSteem would need to support comments, which it should in the future when I asked @theKyle.

While it may look like I'm just researching exciting machine learning (ML) projects like [Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source]. Let's not forget we are also using some quite often, like punctuator which makes sentences out of say the transcription YouTube provides.

There are also projects like Deep Learning cleans podcast episodes from ‘ahem’ sounds GitHub which could greatly reduce work in cleaning up podcast as shown here in this crude video I recorded yesterday:

Multitrack editing with synchronization in Audacity

But this is still research and not testing : (

Thanks to @officialfuzzy we have enough audio to train such networks but I don't have this kind hardware (GPU rigs) but maybe #gridcoin could help out?

Where is the start of a new sentence?

Detecting when sentences start and end in a stream of talk would be great in making podcasts much more understandable. But trying to improve by reading what various 'experts' think, you inevitably come across some who disagree...

What one finds in language processing is that words and sentences are there just to help written language and do not occur in speech. How do I detect the starts and ends of sentences in audio?

At least the punctuator makes a great difference but it analyzes text, not the audio.

Looking further, another project which analyzes text coming out of Automatic Speech Recognition is Sounder:

filter(query, reserved_sub_words=None) is basically a utility provided you to filter the stop words out of your string, for instance, "Hey Stephanie, what is the time right now?" would filter away ['hey', 'what', 'is', 'the'] since they don't hold higher meaning, leaving behind key_words like ['stephanie', 'time', 'right', 'now']

As you can see it might help to filter out common stop words. But it is a library meaning it's made to support your application, not be one. And it seems to be tailored for appliances like Google Home, Alexa Echo and other virtual assistants.

Tag non-lexical utterances

We also want to make our podcasts nice and speedy for our listeners so what is needed is a method to catch Part-of-Speech (POS) utterances like

"um", "er", "ah", or other vocalisations for reasons that linguists are not entirely sure about.

— Yeah coming from the Register, I'm sure they are not sure about many things. Like starting at the bsic question that you'd maybe not ask linguists but rather neuroscientists?

And automagically filtering them out to make podcasts' duration shorter.

The hardest part in searching was maybe figuring out how to define those utterances. WikiPedia says they are either a (non-lexical) Vocable, an utterance that is not considered a word or Speech disfluency

grunts or non-lexical utterances such as "huh", "uh", "erm", "um", "well", "so", and "like"

A sad thing I keep seeing is that models of the training data are rarely included it the projects I reviewed, maybe because they would be too big or costly?

Training is done through creating dialogue matrices (one per speaker in each dialogue), whereby the format of these for each row in the matrix is:

word_idx, pos_idx, word_duration, acoustic_features..., lm_features...., labelGitHub

Also there's a GUI client available from another source, thanks DuckDuckGo, not Google!

A multi-level annotator (part-of-speech tagging, chunking, disfluency detection) for spoken language transcriptions based on Conditional Random Fields (CRF) Github

an API for managing captions for YouTube

Furhermore if you'd want to swifly communicate with YouTube you'd want access to their API.

A caption resource represents a YouTube caption track. A caption track is associated with exactly one YouTube video.

The API supports the following methods for captions resources:

  • list

    Retrieve a list of caption tracks that are associated with a specified video. Note that the API response does not contain the actual captions and that the captions.download method provides the ability to retrieve a caption track. Try it now.

  • insert

    Upload a caption track.

  • update

    Update a caption track. When updating a caption track, you can change the track's draft status, upload a new caption file for the track, or both.

  • delete

    Delete a specified caption track. Try it now.

  • download

    Download a caption track. The caption track is returned in its original format unless the request specifies a value for the tfmt parameter and in its original language unless the request specifies a value for the tlang parameter. Try it now.

https://developers.google.com/youtube/v3/docs/captions


Further notable tools

Proselint

proselint, a linter for prose. (A linter is a computer program that, like a spell checker, scans through a document and analyzes it.)

website

proselint ~/Documents/test.md

~/Documents/test.md:2:853: typography.symbols.curly_quotes Use curly quotes “”, not straight quotes "". Found 24 times elsewhere.

~/Documents/test.md:4:271: typography.symbols.ellipsis '...' is an approximation, use the ellipsis symbol '…'. Found 14 times elsewhere.

~/Documents/test.md:6:115: typography.symbols.curly_quotes Use curly quotes “”, not straight quotes "".

~/Documents/test.md:8:275: leonard.exclamation.30ppm More than 30 ppm of exclamations. Keep them under control.

~/Documents/test.md:12:281: typography.symbols.ellipsis '...' is an approximation, use the ellipsis symbol '…'.

~/Documents/test.md:16:226: misc.chatspeak 'btw!' is chatspeak. Write it out.

~/Documents/test.md:16:251: typography.symbols.ellipsis '...' is an approximation, use the ellipsis symbol '…'.

~/Documents/test.md:22:85: consistency.spacing Inconsistent spacing after period (1 vs. 2 spaces).

~/Documents/test.md:34:25: after_the_deadline.redundancy Redundancy. Use 'remains' instead of 'still remains.'.

~/Documents/test.md:38:502: consistency.spacing Inconsistent spacing after period (1 vs. 2 spaces).

~/Documents/test.md:43:297: consistency.spacing Inconsistent spacing after period (1 vs. 2 spaces).

~/Documents/test.md:50:776: misc.chatspeak 'lol)' is chatspeak. Write it out.

~/Documents/test.md:52:487: consistency.spacing Inconsistent spacing after period (1 vs. 2 spaces).

~/Documents/test.md:79:312: misc.chatspeak 'lol' is chatspeak. Write it out.

~/Documents/test.md:81:309: after_the_deadline.redundancy Redundancy. Use 'same' instead of 'exact same*'.

~/Documents/test.md:126:124: misc.chatspeak 'lol' is chatspeak. Write it out.

Test document: Automating Multi-Lingual and Multi-Speaker Closed-Captioning and Transcripting Workflow with srt2vtt by alexpmorris


A cool way to use natural language in JavaScript

This example is so cool I think I can get away with just showing the code:##

nlp.statement('She sells seashells').negate().text()

// She doesn't sell seashells

nlp.sentence('I fed the dog').replace('the [Noun]', 'the cat').text()
// I fed the cat

nlp.text("Tony Hawk did a kickflip").people();
// [ Person { text: 'Tony Hawk' ..} ]

But I imagine using this for people not only wanting to learn the language but also extract meta data from posts. I hope AskSteem or any other analytics site will incorporate some of it's ideas as well.

HackerNews

GitHub


Summarising Steem posts?

I've been also reading this very interesting paper about making summaries of multiple documents and getting very exciting by its possibilities.

Multi-document Summarization:

The plethora of information available on the web have made the need for efficient auto-matic summarization tools more urgent over the years.

While extracting sentences to compile a summary appears to be insufficient, abstractiveapproaches are gradually gaining ground. However the existing abstractive techniquesrely on heavy linguistic resources, making them domain-dependent and language depen-dent.


So sometimes when it looks like you're not doing much or not making any progress. Document it.

Sort:  

If you need machines to run your computation you surely can use the BOINC and Gridcoin network. I am sure there will be many willing to crunch your WUs. Look at Anderson Attack project, volunteers were able to finish their workload in few months (2? iirc). After filling it's purpose and publishing paper, the project ceased to exist.
There is a catch though. You must develop your application with data distribution and scaling in mind. This shows to be a challenge for @dutch.
If you get a suitable app, but can't set up a boinc server, Gridcoin community and project admins can help you. Or you can be like Yafu project and run the server from laptop.
I wish you a good luck.

Great, I was just talking about it to Alex while watching a great presentation:

We need an n-gram model to improve the YouTube output to account for unknown words like names like for example BitShares and DEX. An appropriate language model is the best sollution to the current transcription problem because of the different and highly specific terms we use in our crypto space.
KenLM: kheafield.com/code/kenlm
What you do is create a LM for countering weird stuff. They call it the Tchaikovsky problem.

Training time on an TitanX ~= 30 days We can feed it all the Mumble talks from the past years, it needs about 10.000 hours of training. If we feed it a good language model by leveraging steem, bitsharestalk, bitcointalk then we'll have a better engine than YouTube. Can use gridgoin network for this and have it in less than a week.

You should contact admin of Citizen Science Grid. They are experimenting with neural learning on boinc. You will soon run into a problem: If it takes 30 says on a TitanX (nvidia), are you able to split the task to hundredths of computers?

Thanks, I have no idea, well there are also the #gridcoin people I could ask...

that could possibly be a solution, if you really wanted to give it a try ... https://boinc.berkeley.edu/trac/wiki/PythonApps

Thanks Alex but Boinc is a volunteer project, so you mean to add a new project? I would have thought peope already had a speech-to-text machine learning task running on Boinc but I don't see it in the list of projects...

I was just saying it should be possible. I believe though (if I recall correctly) new projects would have to be proposed to the community, and you have to get people interested in dedicating time to your project.

However, this may be a better question to direct to @tomasbrod, since he originally suggested boinc/gridcoin at the top of this thread!

who's that guy speaking in the tutorial video?! he's gotta stop saying ummm so many times?! gotta run that s**t through some AI! lol

LOL! Alex you're the man! XD

Very nice Piece of Information but still I have a long way to go to understand such complicated things :)
Good luck everybody !!

Coin Marketplace

STEEM 0.19
TRX 0.16
JST 0.034
BTC 63935.74
ETH 2749.19
USDT 1.00
SBD 2.65