Steem Sincerity - Improved Anti-Spam API

in #steemdev6 years ago (edited)



Steem Sincerity is a project aimed at helping to address the spam problem we have on Steem.

As I explained in my introductory post there are three aspects to this. This post discusses the most important aspect in more detail.

Public API for Developers

This is a service hosted on my server(s), which can be queried by any front-end website or app to obtain information about Steem accounts. It uses a database which stores the last 7 days worth of posts, comments and votes.

Periodically the software extracts meta-data (data about the data) from these accounts, and much of this can be easily accessed by application developers using the methods here. The meta-data for each account is also fed into a kind of artificial intelligence software which looks at how it compares to other known spamming and bot accounts, so it can 'classify' each active account.

What is classification?

In machine learning, classification is an approach in which the computer program learns from the data input given to it and then uses this learning to classify new observations. So from our perspective, we first 'train' the classifier by giving it three lists of Steem accounts that have been manually classified as either Human Content Creator, Spammer or Bot.

It is programmed to be able to extract the relevent meta-data - or what are called features in machine learning - from these accounts. Some of the many features used in the Steem Sincerity software are: number of comments, number of posts, average number of downvoted comments, average word length etc. It looks at how these features vary between the different classes of account, and makes rules for itself to use when deciding about how to classify accounts that it hasn't seen before.

The classifier has currently been trained using only around 30 accounts of each type, and has a cross-validation accuracy of around 78%, and very little non-spam is classified as spam. Cross-validation is a standard technique for evaluating the accuracy of a classifier, but of course what constitutes spam is highly personal, so inevitably my preferences will have introduced biases. A larger crowdsourced training set is planned to reduce this bias in the near future.

Rather than making a direct prediction about whether an account belongs to a spammer, the API actually returns the probabilities of the account belonging to each of the three classes. For example an account may show the following classifications scores:

Human Content Creator: 45%
Spammer: 45%
Bot: 10%

Each front-end using the API can make its own decision about what should happen at different spam thresholds. For example, it could fade the comment if the spammer score is between 40-70% and hide it altogether if the score exceeds 70%. It could even leave this up to the user to decide.



This is a very simple illustration of how accounts with comments containing certain combinations of features may be classified as spam. The red dots represent real spamming accounts, and the pink area shows the accounts which are classified as spammers. The accuracy is not perfect, but good enough to be useful. In practice the machine learning algorithm used by the Sincerity software uses far too many features to be able to show in a two-dimensional diagram.

API Specifications

If you are a developer, you can find the API specification here. There are currently 10 methods, and since the main intention is to help improve front-end user experiences, performance is prioritised over the having larger amounts of historical data. Currently no API keys are required, and request rate limiting is fairly relaxed, but this may need to change depending on future demand.

Main Methods

/api/accounts-info/account1,account2,account2

This expects a comma separated list of accounts, and returns various useful meta-data about the accounts. This includes the probability that each account is a: Human Content Creator, Spammer or Bot. It also includes some metrics about the commenting and voting behaviour of the accounts. Note that only accounts which have commented in the last period will have records in the database. Because up to 100 accounts can be queried at a time, this is the most useful method for hiding or changing the appearance of spam in your application.

/account-full-info/account1

This returns the complete analysis information that are held for the account specified. There are many fields, a few of which are unused. You may want to query this when an account profile is clicked for example.

/account-comments/account1

Returns a time-sorted list of the comments made by the specified account in the last 7 days.

/account-outgoing-votes/account1

Returns a time-sorted list of the votes made by the specified account in the last 7 days.

/account-outgoing-downvotes/account1

Returns a time-sorted list of the flags given by the specified account in the last 7 days.

/account-apps-used/account1

Returns the list of apps the specified user has used to post and comment in the last 7 days.

/biggest-spammers/

Returns the 500 accounts most likely to be spamming accounts. This may be useful for stakeholders employing bots to clean up the platform.

There are a few other methods, and I will add more over time.


I'll be improving the Chrome Sincerity extension soon, to use some of these new methods.

If you have other requirements for a different API method or need to apply machine learning to different data, I'd be delighted to work for STEEM ;)

Sort:  

This post was funded/promoted by @DevFund using a budget of about 360.00 USD on voting bots.

100% of the money sent or earned via upvotes to this account will be powered down and used to give back via promotion bots to Steem ecosystem development initiatives like this one.

https://steemit.com/@devfund/comments

thanks @andybets this will make the steemit community a better place to be . :)

Hi, awesome work! Would you also like to have users input on this?
I am thinking about using this on SteemPlus extension (currently about 1600 active users) and could code something to report spammers / bots to your API if you want to take human feedback into account. You can contact me on Steem.chat/Discord if you're interested.
EDIT: self voting for visibility

That'd be excellent! I was thinking about the possibility of adding that to my very simple extension, but since yours is much better than I could do, and you have lots of active users, it makes great sense. I'll be in touch. :)

Great! Waiting for your message then.

This API sounds awesome!

Maybe MB will use this in the coming days to detect abuse ;)

Hey reggae, did you notice @art-universe made a painting of you?

here's the link to the original post if you wanna go check it out.

definitely noticed :)

What are the use cases of these ? Is it like people can see and upvote or flag accordingly ? or is this meant for @steemcleaners?

It has many uses which app developers will decide, but one is that it can be used for re-rendering comment sections in front-ends to hide spam.

@steemreports will shortly have some tools to display this info for end-users.

One men one account ?

@andybets great idea, but for many people (like me) it could also mean less visibility. For some reason, I was human before but now I am identified as spammer (which is pretty weird as I haven't been active in the last couple of days) and there's rreally not much you can do about it..

Sorry for this inaccuracy, it is clear to me you're not a spammer, so I've added you account name to the training data. When the next version is released your scores should improve.

Thank you so much! I was also wondering how the "personal voting option" for the steemplus extension plays a role in it? Light how much is the voice of a personal voter weighted against the api?

The data from SteemPlus is used to help form the training data that informs the API what spamming and bot accounts look like, so it can make estimates about the othr thousands of accounts that it isn't given a classification for. There are various other data sources as well as SteemPlus though.

Oh wow, great! Thanks for making that clear!

All I can say is: wow this is freakin cool! I am going to add this to my list of things to integrate into the post promoter voting bot software!

Great! Let me know if you would like any changes on my side.

Hi @andybets! Although I'm very excited about the API, as a frontend user my perspective is userish: I would rather prefer it "onload" than "onclicked".
chrome.browserAction.onClicked.addListener
If a user installs the extension, she wants it 2 b active by default. Correct me if i'm wrong.

Thanks for the input. I actually ask about this issue here (or maybe that's where you saw it?):
https://github.com/andybets/steem-sincerity-chrome-extension/issues/1

I think I now understand how this should work, and will start working on the next version of the Chrome extension soon. I think I may not even need the background page, but am very new to Chrome development.

Saw it now, sorry, wasn't aware of ur awareness:)
2 your concern of load on the API, i think load is the first indicator of success and worth thinking about. like some sort of incentive 4 users 2 share their comp's resources... but as i'm diving deeper, it bcomes clear 2me, that i'm trying 2 reinvent steem and that job is already done, pretty fucking well.
If i can help u by my old cpu/hd/bandwidth and even 4 redundancyz sake, i'll gladly do.

Hi I'm confused why it said I was 60% spam? All my post are encouragement and from the heart? Is there something I don't know?

Hi, your account @cliffpower is not classified as spam:
http://steemreports.com/sincerity-accounts-info/?accounts=cliffpower

...do you have another that you are referring to?

What about the guy who does'nt pay, is there something I can do? I'm new at steem since January and still figuring this all out. Now we have spam police who just seem to steal your money. @buildawhale and @smartsteem did the same thing to me? do you have any advice :) @smartsteem owing.png

THANK YOU, I just want to be a good player :) I'm one man one account.

to remove the spam we need to pay a moderator to eliminate spam accounts but that would no longer be unraveled

Hi, i've just found Steem Sincerity in SteemPlus, I've been using it for 2 days now. This is a great tool i think. But i have a question. How can it be calculated? One of my friend is a newbie steemian, @zitus. She made only some posts but she is considered as a 38.14% human, 34.97% spammer and 26.89% bot. And me, as 58.40% human, 40.00% spammer(!!!)and 1.60% bot. Well, i frequently use the same phrases, like "dear Steemies, today is orange, TuesdayOrange" (and other colors for each days of the week)because it's more comfortable for me than formulating different English sentences. That's a hard effort for me because my English is not so good. And recently i made much more posts, 6-7 a day (but they were all good quality posts) Other question: does it count, that i use upvote bots, 3 times daily?

Hi , these scores are indications or probabilities which applications developers can use in their interfaces for excluding or penalising accounts considered to the spammers. Many will not take any action until the spammer score is above 70%, so you don't need to worry about this. New accounts have baseline probabilities, which are 40% human, 30% spammer and 30% bot, and as you interact with the platform they are re-evaluated.

Here you can see your current scores:
http://steemreports.com/sincerity-accounts-info/?accounts=kalemandra%2C+zitus

Only accounts in the 'Spammer' triangle, may be penalised by some app developer is they're using the Sincerity API.

Coin Marketplace

STEEM 0.30
TRX 0.12
JST 0.033
BTC 64513.89
ETH 3155.04
USDT 1.00
SBD 4.00