I updated a bot which purpose is to detect spam comments on Steem blockchain. It uses Multinomial Naive Bayes algorithm combined with SVM (model stacking). It can reply to spam comment and downvote it. I've done it for #polish community, but it can be adapted for every tag (or all tags) - it's a matter of training file.
Log from console:
I have stacked 4 algorithms: Multinomial Naive Bayes and 3 variants of SVM.
self.model = StackedModel([ MultinomialNB(), SVC(kernel='linear', C=C, probability=True), SVC(kernel='rbf', gamma=0.7, C=C, probability=True), NuSVC(probability=True)
To check the accuracy, I calculated a confusion matrix for each algorithm.
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) Confusion matrix: [[65 1] [ 0 45]] SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False) Confusion matrix: [[65 1] [ 0 45]] SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.7, kernel='rbf', max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False) Confusion matrix: [[66 0] [ 0 45]] NuSVC(cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, nu=0.5, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False) Confusion matrix: [[65 1] [ 1 44]]
The confusion matrix for the stacked model looks as follows.
Stacked model Confusion matrix: [[65 1] [ 0 45]]
As you can see the results are similar for each algorithm separately as well as for the stacked model. You have to experiment a bit here to find the best combination. The results will probably change slightly as the data set increases.
Bot checks not only current comment, but also previous comments. I think that single comment
nice photo is ok, but if user posts this type of comments all the time it is considered spam:
The bot also pays attention to repeated, generic comments:
And even scams (if user is on scamlist):
$ POSTING_KEY=<posting_key> spam_detector.py config.json
Private posting key is stored as environment variable.
All parameters are stored in config.json file.
|account||account used by bot|
|nodes||list of Steem nodes|
|tags||tags which are observed|
|probability_threshold||threshold to classify as spam|
|training_file||input training file|
|blacklist_file||file containing blacklist|
|whitelist_file||file containing whitelist|
|scamlist_file||file containing users who post scams|
|reply_mode||0 - without reply, 1 - with reply|
|vote_mode||0 - without vote, 1 with vote|
|vote_weight||weight of the vote from range [-100.0, 100.0]|
|num_previous_comments||number of user comments that are investigated|
Training file contains rows with label
spam like below:
ham Wow. Even though I was well aware of Churchill's later career, I actually didn't know he was here during the Anglo Boer war, let alone as a prisoner of war. Thank you for a very interesting and informative post! ham Yea this post isn't really about fixing all the problems on Steem - it's just that there always seems to be a lot of drama over the trending page, and i think it's a bad thing for new people coming to the site to see first, so just throwing out the idea of getting rid of it for now. ham Yea, I believe there was something about notifications in one of the SteemIt, Inc roadmaps but don't quote me on that. Notifications are really important though, can't expect everyone to use ham Yeah, I may have to sit down & do a post or two myself! It’s fun to imagine! Other than promoted posts, I do think we should have advertising, albeit in a very user focused & friendly way. ham Yes I agree. My suggestion was based on how things actually are currently which as you said is not representative of the best posts. I don’t believe that is going to change any time soon, if ever, so in the mean time I think it would be better to just get rid of that page. ham Yes! This thought never occurred to me before, but your idea is perfect!! I think it would help underpaid content creators be noticed. Better yet, don't sort people based on potential payout. Create an algorithm that sorts out such things as grammatical and spelling errors, "articles" that are too short, authors that post 10 times per day, copy/paste content, ect. and only the highest quality bloggers would make it to the top... ham Yes, there are only a few flagging because majority is scared. He has already ruined many people's accounts and reps and flagged all of their posts to $0.00 for voicing opinons. People disagree with the rewards of his posts. You are well aware of haejin's 10-12 posts per day reaching an easy $350 per post every time. I don't think anyone is against his predictions in the sense that anyone is able to use common sense and choose if they invest or not based on his predictions. I have not seen any whales helping recover these people's accounts for flagging him. Perhaps this is not an unjustified flag war? I have sacrificed my entire blog and all earnings for six weeks to try and lower the rewards. I am not scared of the consequences as I know what they are. People are scared though so I think if a lot of users delegate a small portion of their Steem power to one of these accounts then the rewards can be lower substantially. I also feel that it would be a more organized approach at flagging him as it will be a scheduled downvote of 10 posts every evening. I feel that if enough people make the delegation's he will be unable to flag every user that delegated down to $0.00 as he would have to use all of his power flagging instead of upvoting himself. You can count on support from whales to resolve unjustified flag wars, if you feel like post are more over-valued than the majority of Steem content then flag them and don't be scared of reprisals. ham you are right. As it is now, he's spending a tonne of his vote power flagging anyone who disagrees with his rewards. He cannot flag everyone it would cut into his profits, as his vote power drains to 0. If rancho comes in and starts flagging too, then they are making even less money because now he's wasting his vote power by flagging instead of upvoting the 10 posts a day that he has to. ham You know.. I delegated what little SP i can afford exactly because you took the risk. Now if he did wanna go all out flag, he'd had to waste his vp on both you and me. if enough people did it we can even go against the biggest abusers too. ham Your concept is very solid, it might seem hard to implement in the start but I know that if you keep at it you will reach your goal!I cannot wait to start using your system! spam i follow you spam Upvote, follow, resteem spam UPVOTED spam UPVOTED & RESTEEMED spam Upvoted and followed you back spam UPVOTED RESTEEMED spam very funny spam very nice spam Write Link, send 0.100 sbd. 3000+ followers can see you (resteem) spam Yes very nice post.
- libraries: steem-python, scikit-learn, pandas, textblob, bs4
Repository contains requirements.txt file.
enlarging the training set adding new algorithm such as Support Vector Machine taking into account previous comments, not only current one adding to blacklist / whitelist
- taking into account user reputation
- tuning parameters in existing algorithms
- adding new algorithm such as Neural Network and maybe Random Forest
- enlarging the training set (again)
Posted on Utopian.io - Rewarding Open Source Contributors