Hacking The Top 100 New STEEM Posts - First Text Mining Analysis for Fun & Profit!

in #money8 years ago (edited)

Ever since I saw STEEM become #35,637 most visited website online today, I just had to take a look at what exactly was going on. Alexa is an Amazon-owned company that provides commercial web traffic data and analytics, and let’s just say they must be really impressed with STEEM.

But what I decided to do, is to go the extra mile. What exactly is going on INSIDE Steem?

So I wrote a little Python application which allowed me to extract a whole lot of data. For example, here are the top 10 tags sorted by the highest payouts.

Not surprisingly steem and introduceyourself are at the top, followed by steemit. But here is where we discover the first patterns - travel, life and photography are making the most money.

But you may be wondering, who is getting the most replies?

Interesting! Seems like the three forerunners are again at the top, followed by bitcoin, photography, and money. And it seems that photography and money have reached a healthy medium up until now.

But being the nervous ADHD brain that I am, I wasn’t satisfied by this cursory glance at the inner workings - so I decided to grab the Top 100 Trending Posts! Here is what I found out.

The average earnings for a Top 100 post are: $1881.4283168316824

Woah. That’s a lot more than I thought! What about average votes and responses?

Votes: 141.97029702970298

Responses: 33.59405940594059

What did the most successful “trending” post earn?: $15951.16

But this kind of data isn’t THAT interesting. What I didn’t tell you is that I also grabbed all the text from those posts. So let’s take a look at the most commonly used words in a trending title.

That looks about right, but most of these are common words that don’t tell us anything. What we have to do is get rid of those pesky stopwords (I, me, is, this, that). And a close look at the graph will show you that it’s only 20 examples and not 25 as the title says. Oops! Let’s try again.

Ahh, much better. “Looks like today, new Bitcoin posts on Steemit come first and have the most community importance, especially if they have some nice photography... This could get big. Better check your Ethereum accounts, northerners.”

But what about the trending posts themselves? What are the most common words there? Hold on tight, this is a big one.

“I would like one steem, people, and I’d also like to know about those good passwords…”

What I want to know now, is there anyone I should be jealous of? Which users have more than one post in the trending top 100?

Looks like simba and thedashguy are kicking ass. Unfortunately, I’ve read somewhere that katecloud has been hacked and can’t get the account back. Sorry to hear that, and I hope it gets resolved.

Which posts are getting the most votes, the most dollars, and the most responses?

What about the lengths of posts?

The average length of a trending top 100 post is: 496.35643564356434 words. The shortest is only 7 words long, and the longest is 4472.

Thank you for taking a look. Please upvote and comment if you can. I will be uploading a lot more content in the future.

P.S. If you would like to learn how to do this yourself, I recommend starting with the book Learning Python, or Python the Hard Way which both teach you the basics of Python programming. Then you can look at doing things like html requests, and driving a browser with Selenium. It takes a while, but it's worth it.

Sort:  

I like stats report and this one is no different. Good job. One thing I'd like to see, is the time-span which these stats are relevant. I can recognize the titles of some articles, and thus conclude that they are recent, but the period analyzed should be mentioned.

Those are all great statistics. I think creating an APP showing these in addition to Steem & Steem Dollars prices would be fantastic. In the meantime looking forward to more stat treasures.

This is awesome @filip-martinka
Could you possibly share the python code?

Thank you. Although at the moment it's entirely spaghetti code and a lot was done in the console so its not built-in yet. But in the future I might release a better, more coherent version of the code if there is enough demand.

edit: I have a work in progress at http://steemread.com/ now, but it's a very rough draft. I will include information such as word counts, etc. soon. It's very basic at the moment.

Awesome, upvoted. I have a question for you. How have you gathered the data? Did you just collect it off the stream using something like Piston, or have you used some other technique to crawl/scrape steemit?

Thanks. I used selenium and xpath selectors.

Using the steemd API instead of scraping HTML will make your code much more robust. We know people use the steemd API and we try not to introduce breaking changes to it, but the HTML structure of steemit.com pages obviously isn't an API and we feel free to change it as needed, when needed.

Thanks that was the next step I had in mind, but I wanted to experiment.

Well...
This is a quality post!
At least 4 hours to make (both with the writing, editing an especially the app)

Thank you so much for taking the time to do this. It's really cool to see the numbers in action. It makes it possible for us to also see what categories can be contributed to more.

That you went as far as to provide the titles, categories and an account of the ongoings with a Steemit favorite @katecloud I just had to laugh at the words that English has dwindled down to but useful, informative data nonetheless. Thanks so much @filip-martinka This is so upvote worthy!

Very interesting well researched post. Good to see some steemians are still taking time to release quality material. kudos . webocel its kind of bad form to beg for votes then use boardline emotional blackmail to try and secure it. If your post is worthy you will have nothing to worry about my friend . Trust in your post and have trust in our community.

Similar to SP, SMD tokens cannot be purchased directly on an external exchange. SMD are primarily earned through contributing but can be purchased by converting STEEM tokens to SMD tokens.

Actually Steem Dollars can now purchased on external exchanges !
https://poloniex.com/exchange#btc_sbd
https://bittrex.com/Market/Index?MarketName=BTC-SBD

PS Abbreviation of SBD = Steem Backed Dollars
or just SD = Steem Dollars (not SMD please edit)

If you would like to learn how to do this yourself, I recommend starting with the book Learning Python which teaches you the basics of Python programming.

Nice recommendation! Thanks, I'll keep it on my radar.

Have you heard of Learn Python The Hard Way? Most of the content is free in the form of an e-book but you can also buy it if you want and the video explanations. I don't really care for video lectures though.

I used the same author to 'Learn Ruby The Hard Way' as a start off point after my disappointing experience with CodeCademy. I definitely got my shit pushed in. It was satisfying.

That's a good book too. And after a few projects, reading documentation and examples becomes especially useful as well. In general I believe that it's best to learn by practical example.

I guess people want to talk about steem and steemit more than any other topic!

Coin Marketplace

STEEM 0.20
TRX 0.12
JST 0.029
BTC 60870.17
ETH 3385.85
USDT 1.00
SBD 2.57