Machine learning and Steem #3: Account classification - accuracy improvement up to 95%

in #utopian-io7 years ago (edited)

Repository

https://github.com/keras-team/keras

What Will I Learn?

  • Collect data from beem and SteemSQL
  • Build neural network model for multiclass classification problem
  • Build decision tree for multiclass classification problem
  • Visualize decision tree

Requirements

  • python
  • basic concepts of data analysis / machine learning

Tools:

It looks like a lot of libraries, but it's a standard python toolset for data analysis / machine learning.

Difficulty

  • Intermediate

Tutorial Contents

  • Problem description
  • Collecting data
  • Building neural network
  • Building decision tree
  • Visualization of the decision tree

Problem description

The purpose of this tutorial is to improve the accuracy of the model used to classify Steem accounts (content creator vs scammer vs comment scammer vs bid-bot). In the previous part of this tutorial, we achieved an accuracy of 83%. And this is what the confusion matrix looks like:



We will try to get improvement through:

  • enlarging the training set
  • increasing the number of iterations
  • using a decision tree instead of a neural network

Collecting data

The data will be downloaded in a similar way to the previous part of the tutorial, but we want to collect as much as possible. Last time, each class had 100 elements, but this is not enough.


Scammers will be collected from the comments of the users guard and arcange, who warns about this type of accounts. For example:






We will use the following queries:

SELECT DISTINCT SUBSTRING(body, CHARINDEX('@', body) + 1, CHARINDEX(' leads', body) - CHARINDEX('@', body) - 1) as account
FROM Comments (NOLOCK)
WHERE depth > 0 AND
      author = 'guard' AND CONTAINS(body, 'phishing')

 SELECT DISTINCT SUBSTRING(body, CHARINDEX('@', body) + 1, CHARINDEX(' is a', body) - CHARINDEX('@', body) - 1) as account
 FROM Comments (NOLOCK)
 WHERE depth > 0 AND
       author = 'arcange' AND 
       CONTAINS(body, 'CONFIRMED AND SCAM') AND 
       body LIKE '%The message you received from%'"""

In this way, we've obtained 750 unique accounts. Therefore, we will try to get the same number of elements for other classes.


Content creators will be collected from the SteemSQL database with the following script:

SELECT TOP 750 author
FROM Comments (NOLOCK)
WHERE depth = 0 AND
      category in ('utopian-io', 'dtube', 'dlive', 'steemhunt', 'polish')
ORDER BY NEWID()

To get a list of spammers, first find the more frequent short comments.

SELECT body, COUNT(*) as cnt
FROM Comments (NOLOCK)
WHERE depth = 1 AND LEN(body) < 15 AND created BETWEEN GETUTCDATE() - 60 AND GETUTCDATE()
GROUP BY body
ORDER BY cnt DESC

This gives us the following list:

spam_phrases = [
    'nice', 'nice post', 'good', 'beautiful', 'good post',
    'thanks', 'upvoted', 'very nice', 'great', 'nice blog',
    'thank you', 'wow', 'amazing', 'nice one', 'awesome',
    'great post', 'lol', 'like', 'cool', 'hi',
    'nice, upvoted', 'good job', 'nice article', 'nice pic', 'nice photo',
    'welcome', 'hello', 'good article', 'nice picture', 'nice info',
    'promote me', 'fantastic', 'super', 'nice work', 'nice video',
    'good project', 'wonderful', 'nice bro', 'lovely', 'nice shot'
]

Now we can collect the list of spammers with the following query:

query = """\
SELECT TOP 750 author
FROM Comments (NOLOCK)
WHERE depth = 1 AND
      created BETWEEN GETUTCDATE() - 60 AND GETUTCDATE() AND
      body in """ + to_sql_list(spam_phrases) + """
GROUP BY author
ORDER BY COUNT(*) DESC"""

The list of bid-bots has been collected manually from https://steembottracker.com/. But here's the problem, we only have 100 records. There are two options:

  • leave 100 records and thus have unbalanced classes
  • use some method to add records

We will use the simplest method - we copy data to get 750 records. This is not an ideal situation, but rather a better than unbalanced class.

The full script that retrieves the features of accounts can be found here.

As a reminder, the features analyzed are:

['followers', 'followings', 'follow ratio', 'muters',
'reputation', 'effective sp', 'own sp', 'sp ratio', 'curation_rewards',
'posting_rewards', 'witnesses_voted_for', 'posts', 'average_post_len', 'comments',
'average_comment_len', 'comments_with_link_ratio', 'posts_to_comments_ratio']

I will not focus on visualization here, because it was shown in previous parts and I do not want to repeat myself. The sample chart looks as follows and all of them can be seen here.



Building neural network

Let's start with the simplest neural network possible.

model = Sequential()
model.add(Dense(17, input_dim=17, activation='relu'))
model.add(Dense(4, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='nadam',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=50, batch_size=1, verbose=0)
score = model.evaluate(X_test, y_test,verbose=0)
y_pred = model.predict_classes(X_test)

print('accuracy: %.3f' % score[1])
cm = confusion_matrix(np.argmax(y_test, axis=1), y_pred)
plot_confusion_matrix(cm)


accuracy: 0.847




A short glossary:

NameDescription
model.addadds new layer
Densefully connected layer
reluReLU activation function
sigmoidsigmoid activation function
categorical_crossentropyloss used for multiclass classification problem
nadamAdaptive Moment Estimation optimizer, basically RMSProp with Nesterov momentum
accuracypercentage of correctly classified inputs

Let's add more layers and neurons and increase number of iterations.

model = Sequential()
model.add(Dense(85, input_dim=17, activation='relu'))
model.add(Dense(40, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(4, activation='softmax'))

accuracy: 0.851




We see some improvement over the model from the previous part of the tutorial, but it is not yet too significant.

Building decision tree

Let's try to use a different model - a decision tree. It is a decision support tool that uses a tree-like graph of decisions and their consequences.

X_cols = columns
y_cols = ['class']
X = df[X_cols]
y = df[y_cols]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
input_dim = len(X_cols)

model = tree.DecisionTreeClassifier(max_depth=8)

model.fit(X_train, y_train)
score = model.fit(X_test, y_test)
y_pred = model.predict(X_test)

print('accuracy: %.3f' % accuracy_score(y_pred, y_test))
cm = confusion_matrix(y_test, y_pred)

accuracy: 0.951




The accuracy of the model has increased significantly up to 95%!

Visualization of the decision tree

We will use a graphviz library for visualization.

import graphviz
from sklearn.tree import export_graphviz

dot_data = tree.export_graphviz(
    model,
    out_file=None,
    feature_names=X_cols,
    class_names=class_names,
    filled=True,
    rounded=True,
    special_characters=True)
graph = graphviz.Source(dot_data)

graph = graphviz.Source(dot_data)
graph.format = 'png'
graph.render('dtree_render', view=True)

As we can see, the decision tree is very large (to see the details just open the image in the new tab).

Let's look at a fragment of this tree. The color of nodes is related to the class. The value gini (Gini impurity) measures how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

Curriculum

  1. Machine learning (Keras) and Steem #1: User vs bid-bot binary classification
  2. Machine learning and Steem #2: Multiclass classification (content creator vs scammer vs comment spammer vs bid-bot)

Conclusions

  • the bigger the dataset, the better we can train our model
  • it's worth trying different algorithms instead of sticking to one
  • the decision tree turned out to be a better choice, both in terms of efficiency and execution time

Proof of Work Done

Collecting data
Building classifiers

Sort:  

Thank you for your contribution.

  • Very good tutorial, thanks for your effort for this tutorial.

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Hey @jacekw.dev
Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Congratulations @jacekw.dev! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of upvotes received

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

To support your work, I also upvoted your post!

Do you like SteemitBoard's project? Then Vote for its witness and get one more award!

this is really awesome. Well done. I am thinking I would love to work with you on something. You have fantastic skills and your posts are a breath of fresh air.

Hope you don't mind me asking, but did you sign up to steemit via utopian?

Thanks. I came to steemit about a year ago, but I only wrote in the #polish community using the @jacekw account. Recently I have some ideas for using Machine Learning (eg account classification), so I decided to start creating contributions for #utopian-io.

I am thinking I would love to work with you on something.

With pleasure!

wow just looking at some of your posts on your other account too. I had never come across this account before, this is such a pity. but I am glad I have discovered you now. Steem on

Coin Marketplace

STEEM 0.09
TRX 0.30
JST 0.033
BTC 110587.74
ETH 3911.40
USDT 1.00
SBD 0.58