Bot that detects spam in comments (using Multinomial Naive Bayes filter)

I created a bot which purpose is to detect spam comments on Steem blockchain. It uses Multinomial Naive Bayes algorithm. It can reply to spam comment or downvote it. I've done it for #polish community, but it can be adapted for every tag (or all tags) - it's a matter of training file.

Github repository

Log from console:


In action:



$ POSTING_KEY=<posting_key> config.json

Private posting key is stored as environent variable.


All parameters are stored in config.json file.

accountaccount used by bot
nodeslist of Steem nodes
tagstags which are observed
probability_thresholdthreshold to classify as spam
training_fileinput training file
reply_mode0 - without reply, 1 - with reply
vote_mode0 - without vote, 1 with vote
vote_weightweight of the vote from range [-100.0, 100.0]

Training file contains rows with label ham or spam like below:

ham For years I am interested in the Anglo Boer war and their fight against the Commonwealth. W. Churchill's military career was full of hits and misses but he gained huge popularity in England after the war. Great post once again.
ham The development of a language can be a very boring topic, but sometimes it can also be exciting. While I was never interested in the development of the English language, I often liked to learn some facts about the development of the German language (maybe since it's my native language). It's nice to see that German has influenced such a young language. Maybe I'd even be able to understand some Afrikaans and maybe it's even worth a try to learn it.
ham This country is a cultural boiling pot like none, so many interesting stories ; struggle for independence, new discoveries, many different influences makes it a unique place , so also the language carries all this exciting details in it. I will definitely start learning Afrikaans .. one day ;)
ham I am South African born. I also grew up in an Afrikaanse town. Despite my background and being able to speak the language, I was not aware of the history. Probably should have paid more attention in school, lol. Thanks for the lesson!
ham I thought as much. So, you've been in this game since April 2016. I must be joking to think I've arrived when I haven't even started. I'm throwing away my white collar to settle for this. Should you need a minnow to mentor, please, let me be the number one to be considered.
ham Great interview, guys! We haven't met in person yet, but I'd say Tom's one of the most grounded people I've ever come to know. I love your focus and straightforwardness, combined with your endless generousity. For many reasons you're one of the most successful but especially most admirable people on this platform.
spam    Upvote, follow, resteem
spam    Followed and resteemed
spam    Great photo
spam    Follow me I will follow you
spam    Hey Beautyfull love your blog... UPVOTED and RESTEEMED
spam    Hi there, i RESTEEMED & UPVOTED for you! Have a nice day.
spam    Cool pic bro
spam    Thank you for sharing i will resteem it
spam    good post
spam    Super

Technology Stack

  • python3.6
  • libraries: steem-python, scikit-learn, pandas, textblob, bs4

Repository contains requirements.txt file.


There is still a lot that can be done:

  • enlarging the training set
  • adding new algorithms such as Neural Network or Support Vector Machine
  • taking into account previous comments, not only current one
  • taking into account user reputation
  • adding to blacklist / whitelist

That's awesome. You might want to take a percentage of the data to use as testing data, specifically not data used to train it. This way you can get an idea how accurate the filter is.

Nice work. This really should be part of the core Condenser platform...

Thank you for the contribution. It has been approved. The problem I see here is that you need to keep update your training file, shouldn't it be like checking the users's past comment to see if that user is spamming.

