Visualising Spam Scores

personz (66) 8 years ago

This is cool, and love the graph choice.

Is the code publicly available for the classification process?

steemreports (64) 8 years ago

Thanks.

No. I've been considering whether to open source it, and there are two issues:

Due to the current lower prices, I haven't nearly covered my learning/development time yet from the community rewards, significant though they have been.
If this code was made public it would allow organised spammers to much more easily circumvent adverse classification, and undermine the effort.

I think this issue is similar to some we've discussed in the past, with no obvious perfect solutions.

$0.00

personz (66) 8 years ago

Sure, and my position remains the same: I can't trust a metric I can't inspect, so while I know you have the best of intentions you become the point of trust for your algorithm instead of the algorithm standing objectively on it's own.

A compromise could be to allow someone you trust to inspect it, someone with some credibility publicly, and they can state their findings without revealing the algorithm. If you had a few of those from people I knew to be competent and honest it would significantly raise my trust in it.

Just something to think about in the interest of claiming these metrics have any meaning.

$0.00

steemreports (64) 8 years ago (edited)

I suppose the community can look at the results from the algorithm to assess whether it has any meaning, and yes, trust me if they think I deserve it. That said, I wouldn't rule out what you have mentioned.

Perhaps you could send me a list of people you know to be competent and honest? ;)

$0.00

personz (66) 8 years ago

The results won't be enough to test that unless you had good knowledge of the entire ecosystem, i.e. what did it leave out? Sounds expensive to verify blindly.

@timcliff seems to fit the bill, and I believe he's taken an interest in your project.

$0.00

steemreports (64) 8 years ago

I'll certainly consider that once I have more training data and a cleaner implementation. That's not a very long list of people though, and you said you'd want 'a few' in order to significantly raise your trust.

$0.00

personz (66) 8 years ago

It necessarily going to be a small list, I can only think of one other - @jesta. Think about it and why not pick some others too, it's not specifically for me or anything.

$0.00

andybets (62) 8 years ago

I will think about it, and tidy up my code ;)

$0.00

creativista (54) 8 years ago (edited)

This is a really great tool @steemreports. Unfortunately, resteeming accounts not only pick content of value, but everything people pay them to resteem and this is really hurtful for my feed, because it hides my friend's content. So, I was trying this tool out and found that some of my steemit friends get a tiny bit in the direction of spammers, how do you determine that? Is there a correlation between posting frequency and spamming? Or maybe word length and spamming? I'm really curious to know. I love to play with these genius tools!

steemreports (64) 8 years ago

Thank you. Yes, posting frequency and word length are some of the many factors used in the algorithm. It is interesting, but I don't want to disclose too much as it might undermine the effectiveness of the process.

I haven't been recording resteems, but I now you mention it, I suspect that might be useful for this classification too!

$0.00

creativista (54) 8 years ago

Yeah, you are right, thanks for answering my intrusive questions anyway :). Yeah, that's it, maybe if the resteem to posts ratio in one account is 90/10 I would say it belongs in the spammer side of things, maybe I'm being to harsh...

$0.00

antonchanning (59) 8 years ago

Very good report. Love the ternary plot. Very effective also for determining difference between bad and good bots, and bad and good humans. Often people talk about the 'bot problem', but there are also many useful and good bots on the steem network, so their knee jerk reaction of "why can't we ban bots", fails to make that nuance.

steemreports (64) 8 years ago

Thanks. I'm undecided about bots in general, but they are certainly a factor which adds complexity to the Steem ecosystem.

$0.04

2 votes

cardboard (61) 8 years ago

This is so cool :) I would lik to use it for my voting bot to exclude spammers. But - does the verification tells more about user commens or posts? I mean does he create spammy posts or more spam comments?

steemreports (64) 8 years ago

Thanks. This classification tells us more about comment spam than spammy posts because that is currently more represented in the initial training set, and where I think it can have the greatest positive impact. I would think (but don't know) that spammy posts which are written deliberately to promote through vote bots will tend to be harder to classify with machine learning, due to more effort being made, and the plagiarism of good content.

$0.00

cardboard (61) 8 years ago

Aye. Nevertheless, if someone produces spammy comments, his posts are probably also low value :)

$0.00

steemreports (64) 8 years ago

Yes, that's true :)

$0.00

jacek-w (50) 8 years ago

What is your definition of being bot? Because this is a very important question :D

If someone creates normal content, but votes using script - is he bot?
If someone only votes (manually, within small period of time) - is he bot?
If someone only makes witness actions - is he bot?
If someone posts the same comment in response to specific posts - is he bot or spammer?

steemreports (64) 8 years ago

Good question.

I don't have any definitions. I have simply trained the initial classifier with a selection of bots that I manually evaluated as such. In arriving at classifications for the initial training set, my biases towards what I consider to be a spam and bots will inevitably shine through. This will be reduced over time as the training data expands.

I'll answer with my current opinions anyway though, insofar is they relate to this project :)
If someone creates normal content, but votes using script - is he bot? Depends on proportions.
If someone only votes (manually, within small period of time) - is he bot? No.
If someone only makes witness actions - is he bot? No, it doesn't affect comments/posts.
If someone posts the same comment in response to specific posts - is he bot or spammer? Depends on whether it's scripted.

It is good to frame the project better though, and maybe I haven't given enough thought to that.

$0.00

scientes (43) 8 years ago

Yep defintitley going to use it to for my voting bot. When i'm finished i should have something similar to @trufflepig (most likely not as good)
i want to weigh the votes not on the vote wheight alone for my training data but also will try to filter out vote bots.

bobcastleman (37) 8 years ago

I really like this project of yours. Is it just you working on it?

The problem of spam and bots is a tough one. They can really screw up the economies of crytpos. Just getting a good training data set is tough.

$0.03

andybets (62) 8 years ago

It is a difficult problem. I think a good training set will go a long way in improving performance and I'm discussing collaboration on this. I'll hopefully have an announcement on that soon!

$0.00

bobcastleman (37) 8 years ago

Perhaps some volunteers to manually classify posts in the training sets?

$0.03