Using Machine Learning to Fight Plagiarism

in #steemit8 years ago

Humans appear to be very similar, but if we dig deep enough, we will uncover a vast array of subtle differences.

Plagiarism on Steemit


Another plagiarist bites the dust, after generating thousands of dollars from stolen content.

This seems to be a common theme, and despite the commendable best efforts of the community and groups like #steemcleaners and #steemitabuse, many cheaters get away with stolen content.

@anyx has done some great work on @cheetah, which detects copy/pasted content.

Could we go a step further and build a tool that detects stolen content by analyzing posts by an arbitrary Steemit user?

Looking for patterns

It is well known that individuals can be identified by their physical traits. Several methods have been deployed to take advantage of this, such as:

  • facial and voice recognition
  • fingerprint and retina scanning
  • DNA sequencing

There are also less known, but interesting ways to identify people, by using proxy signals. For instance, it is possible to semi-identify a subject from CCTV footage by analyzing their walking patterns.

Internet companies (such as Google) use browser fingerprinting and behavior analysis to reliably track individuals online, and completely bypass adblock/privacy plugins.

Patterns in writing

Let me start with an anecdote - my writing style. My writing style is probably unique, and here is why. English is not my native language, and I have learned it mostly on the internet.

For instance, I have no idea where should commas (,) go. In fact, I don't really know any of the formal rules of the language.

I also tend to prefix my sentences with things like For instance and In fact quite often.

I am also completely unaware of all the syntax/grammatical errors I repeatedly make (as these are brain coded patterns - a result of neurons firing in the same way each time).

I like to keep sentences short and simple.

Language is an ever-changing thing, and given my age, it is less likely that I am familiar with older or obsolete words.

The region I grew up with determined which words I use, and people I've talked to and the books I've read had shaped my unique style.

And lastly, there are 171,476 words in the Oxford Dictionary alone. I probably know and use only a tiny subset of these. There are many ways to convey a message with a shared meaning, using a completely different set of words.

Using Machine Learning to identify a writing style

Unfortunately, I haven't done any NLP related work, so this topic is a bit foreign to me. I have only considered this problem with the high level idea and did some quick research.

Hypothesis
I have found several tools and techniques that can be used to determine an original author of the works, given enough source material.
Our algorithm for Steemit has an easier task - we don't care about who the original author is, all we care is if a person on Steemit is legit.

If someone is using content that originates from different people, resulting posts will be subtly different. The theory goes, that if the posts are inconsistent, we can raise a red flag for human inspection - as such author is likely sourcing his content from other people.

Disclaimer: I have no idea if this would actually work. I'm putting the idea out there, and maybe someone with experience in this field can help out.

Where this method falls short
Impersonation. If someone is impersonating a single person, then the writing style will be consistent, and thus undetectable under this method.

Resources

I have found several libraries and frameworks that could be used for this problem.

Who wrote the 15th book of Oz?

The tool from the video:
https://github.com/troywatson/Python-Stylometry-Authorship-Ascription-using-Burrow-s-Technique-2.0

More Python Stylometry Tools:
https://github.com/mikekestemont/pystyl
https://github.com/jpotts18/stylometry

And finally an awesome whitepaper, that covers everything from pre-processing to applying classical algos - Naive Bayes, DT and RF - and performance thereof.
https://brage.bibsys.no/xmlui/bitstream/id/360867/Rune%20Borge%20Kalleberg.pdf


Don't miss out on the next post. Follow me.


Sort:  

It could be similar to the penguin thing on Google, Although I don't know the technical details.
Although how would it work for people like Stellabelle who have "guest posts" and other people who have similar content openly posted on their blog?

That is something we have to think about, But I completely agree. What @msgivings did was really messed up.
Maybe we could somehow do more of a plagiarism search?! Maybe Cheetah bot is too generous? What is the percentage on that?

I plugged Msgivings most recent article into my Grammarly. And turned the plagiarism button on. It came up 23 % unoriginal. And the Reddit stuff that was stolen from was cited on it. Now I'm ashamed I didn't figure it out! Although copying and pasting every article I read everyday would be extremely time consuming.

A decision to 'nuke' someone should never be algorithm based. Despite a fancy name 'machine learning', these algorithms are very primitive.

These tools can be used as an alerting system. For instance, if the bot finds a suspicious post it informs #steemitabuse chat. Humans can then validate the claim. If the author in question is a host of guest posts, then, he/she may be put on a 'whitelist'.

What about news articles? There is no point in posting just a link. Obviously it's not good to copypaste entire news but only few samples that the poster feels to be most interesting. Always clearly state the source and only few selected quotes from the news.

The bot can ignore quoted parts of the post.

Markdown uses email-style > characters for blockquoting. If you’re familiar with quoting passages of text in an email message, then you know how to create a blockquote in Markdown. It looks best if you hard wrap the text and put a > before every line:

source

There is nothing to stop you from summarizing the highlights of the article and then linking to it. That way, you are posting original material without copying. The exception would be if you wanted to pick up on someone quoted in the article and write some commentary. You'd need the original quote to make sense of the commentary

@furion
I've built the bot quite awhile back off the same hypothesis.
The hypothesis is incorrect.

Writing styles are more influenced by recent shared experience than by personal history.
For example if you spent a goodly amount of time on reddit you'll pick up the same mannerisms as other people who spent too long on there. Same thing with games like WoW.

We leave fingerprints yes, but they are prints of fingers, person is left handed, person is right handed etc. Person is bipolar and concerned with equality, feels vulnerable etc.

What you really uncover is that 2 people encountered similar stressors at stages in life which had similar impact.
This is why myself and @sykochica flag as being the same under a system like you propose and I told @anyx I was building to try and help cheetah.

Unfortunately, it doesn't work, you aren't as unique as you might hope.
You get more details here though...
https://steemit.com/steemit/@williambanks/are-you-following-me-you-re-about-to-be-featured-nameinlights
Check my replies in there where I'm talking to pilotbot2015

Big Brother is watching. Remember Orwell's movie 1984 :)

I am still one that believes in the human perception.
Yes, machines are good and all, spun content detection is good and all BUT all these start to matter IF shit content get $$$$$$. And when they do, as it was case with msgivings, a lot of eyes are on her. A lot of eyes = a lot of spidey senses that something is amiss.
Pay attention to who's trending and on what - the worst content, the most money, the most dubious should be asked to provide proof.
That this is not the norm now, only shows where Steemit main interest lays...in which case, all this machine detection is useless when a society chooses itself to be blind to the corruption.

...all this machine detection is useless when a society chooses itself to be blind to the corruption.

Exactly! Right now, it seems that not many users (particularly the highest-weighted voters) are concerned about capitalizing off of outright plagiarism or fraud. And why should they be? They have the power to hide it or to smack down anyone attempting to reveal it.

I've been thinking about this lately, and @bacchist wrote a related post recently about spun content (using a computer to spin an article so it's hard to detect as plagarised).

I understand that you're proposing a technique where each post by a user is parsed to check their "style". If each post differs by a certain amount, that is, that the style in each post differs too much, it could be flagged as plagiarised.

It's an interesting approach. What's nice is that it only concerns itself with what is on steemit. This is in contrast to cheetah which as far as I can tell, searches the net for similar content.

How about a system where it checks the top 25 returns from a google search and if the info is similar across different domains that should flagged. Many searches it it will be 24 out of 25 are exactly the same but different domains each return, seemingly. A single blog post won't get indexed quickly and won't propagate to fill up search returns in a timely manner either.

So, that is one aspect of ML I would incorporate.

good idea I'm going to start one at one point it will be done by 2020 probably tho lmao

You are so followed.

Awesome post, and please do get ahold of me on google hangouts at [email protected] -- I'd love to help you with the server end of this and I just happen to have an ingestion engine set up that feeds the chain into elasticsearch-- and soon, the Cayley graph DB :)!

I do not use any communication tools provided by Google/Microsoft/FB/etc due to the data-mining concerns. I have pinged you on steemit.chat

And google is btw at the edge of writing style analysis. They can already associate any computer written text with its author with a very high confidence.

There has been some research along these lines. Part of the problem of using machine learning to find someone's "writing fingerprint" is that you need enough of a corpus to train the classifier on. Shakespeare, Agatha Christie and Louis L'Amour (to name a few) have pretty big bodies of work to use. But a random Internet poster? That's hard.

It might be better attacking the problem similar to financial fraud detection. Develop a list of features common to posters who plagiarize and use those features to score new posts. High scores then can be followed up with more in-depth (human) study.

What features are those? I don't know. I would begin investigating it by clustering posts around reading level, frequency, topics and maybe payouts, just for starters. Then analyze each cluster to see how many and what kind of plagiarized posts appear in them. If nothing turns up, vary the features and start again.

It's an interesting problem that I think will take a long time to solve.

Surely this would not work if you can only train the model on what the steemit user is posting?

The corpus you use to build the model in the case of a serial plagiarist will consist of several different posts, but when you train your model you'll end up with a model of the kind of posts that person plagiarises - there'll be no way to detect the odd ones out.

I never would have guessed English was not your first language @furion, I wish I was as competent as you at another language.

I think this is a brilliant idea and recently read something similar about the writing patterns of trolls so this idea if it could be implemented, could kill two birds with one stone.

A solution can never solve every problem but this is a great one!

Coin Marketplace

STEEM 0.20
TRX 0.14
JST 0.030
BTC 68854.36
ETH 3283.36
USDT 1.00
SBD 2.67