Machine learning and Steem #2: Multiclass classification (content creator vs scammer vs comment spammer vs bid-bot)

in #utopian-io7 years ago (edited)

Repository

https://github.com/keras-team/keras

What Will I Learn?

  • Collect data from beem and SteemSQL
  • Preprocess data to improve classification accuracy
  • Visualize features
  • Build neural network model for multiclass classification problem
  • Measure model performance
  • Use different neural network architectures
  • Use different neural network optimizers

Requirements

  • python
  • basic concepts of data analysis / machine learning

Tools:

It looks like a lot of libraries, but it's a standard python toolset for data analysis / machine learning.

Difficulty

  • Intermediate

Tutorial Contents

  • Problem description
  • Collecting account list for different classes
  • Collecting account features from beem
  • Collecting account features from SteemSQL
  • Visualization of the dataset
  • Preprocessing
  • Building model for multiclass classification problem
  • Measuring model performance
  • Comparing different neural network architectures
  • Comparing different neural network optimizers
  • Conclusions

Problem description

The purpose of this tutorial is to learn how to build a neural network model for multiclass classification problem.

We will use the following classes:

  • content creator
  • scammer (an account that writes comments with links to phishing sites)
  • comment spammer (an account that spams short with comments such as: nice, beautiful, upvoted)
  • bid-bots

Each class will have 100 records.

Collecting account list for different classes

Accounts creating the content will be collected from the SteemSQL database with the following script:

SELECT TOP 100 author
FROM Comments (NOLOCK)
WHERE depth = 0 AND
      category in ('utopian-io', 'dtube', 'dlive')
ORDER BY NEWID()

Scammers will be selected from the comments of the user guard, who warns about this type of accounts. For example:

SELECT TOP 100 account
FROM (
       SELECT SUBSTRING(body, CHARINDEX('@', body) + 1, CHARINDEX(' leads', body) - CHARINDEX('@', body) - 1) as account, *
       FROM Comments (NOLOCK)
       WHERE depth > 0 AND
             author = 'guard' AND CONTAINS(body, 'phishing')) C
GROUP BY account
ORDER BY COUNT(*) DESC

The users who have written the most comments such as: nice, beautiful, upvoted will be selected as comment spammers:

SELECT TOP 100 author
FROM Comments (NOLOCK)
WHERE depth = 1 AND
      created BETWEEN GETUTCDATE() - 90 AND GETUTCDATE() AND
      ((CONTAINS (body, 'nice') AND body LIKE 'nice') OR
       (CONTAINS (body, 'beautiful') AND body LIKE 'beautiful') OR
       (CONTAINS (body, 'upvoted') AND body LIKE 'upvoted'))
GROUP BY author
ORDER BY COUNT(*) DESC

The list of bid-bots has been collected manually from https://steembottracker.com/.

Collecting account features from beem

The features to be analysed are:

  • followers
  • followings
  • follow ratio
  • muters
  • reputation
  • effective sp
  • own sp
  • sp ratio
  • curation_rewards
  • posting_rewards
  • witnesses_voted_for
  • posts
  • average_post_len
  • comments
  • average_comment_len
  • comments_with_link_ratio
  • posts_to_comments_ratio

Some of the features will be collected with the beem script below.

full_list = content_creators + scammers + comment_spammers + bid_bots
d = defaultdict(lambda: defaultdict(int))
for name in full_list:
    account = Account(name)
    foll = account.get_follow_count()
    d[name]['name'] = name
    d[name]['followers'] = foll['follower_count']
    d[name]['followings'] = foll['following_count']
    d[name]['follow ratio'] = foll['following_count'] / foll['follower_count']
    d[name]['muters'] = len(account.get_muters())
    d[name]['reputation'] = account.get_reputation()
    d[name]['effective sp'] = account.get_steem_power()
    own_sp = account.get_steem_power(onlyOwnSP=True)
    d[name]['own sp'] = own_sp
    d[name]['sp ratio'] = account.get_steem_power() / own_sp if own_sp > 0 else 0

Collecting account features from SteemSQL

The remaining features will be downloaded from the SteemSQL database. This will be done with several separate queries, to keep them simple.

The following query will collect features such as:

  • curation_rewards
  • posting_rewards
  • witnesses_voted_for
query = """\
SELECT
  name,
  curation_rewards,
  posting_rewards,
  witnesses_voted_for
FROM Accounts (NOLOCK) a
WHERE name in """ + to_sql_list(full_list)

for row in cursor.execute(query):
    name = row[0]
    curation_rewards = row[1] / 1000.0
    posting_rewards = row[2] / 1000.0
    witnesses_voted_for = row[3]   
    d[name]['curation_rewards'] = curation_rewards
    d[name]['posting_rewards'] = posting_rewards
    d[name]['witnesses_voted_for'] = witnesses_voted_for

The following query will collect features such as:

  • posts
  • average_post_len
query = """\
SELECT
  author,
  COUNT(*),
  AVG(LEN(body))
FROM Comments (NOLOCK) a
WHERE depth = 0 AND
      created BETWEEN GETUTCDATE() - 90 AND GETUTCDATE()
      AND author in """ + to_sql_list(full_list) + """
GROUP BY author"""

for row in cursor.execute(query):
    name = row[0]
    posts = row[1]
    average_post_len = row[2]
    d[name]['posts'] = posts
    d[name]['average_post_len'] = average_post_len

The last query will collect features such as:

  • comments
  • average_comment_len
  • comments_with_link_ratio
  • posts_to_comments_ratio
query="""\
SELECT
  author,
  COUNT(*),
  AVG(LEN(body)),
  CAST(SUM(CASE WHEN body LIKE '%http%' THEN 1 ELSE 0 END) as DECIMAL(10, 3)) / COUNT(*)
FROM Comments (NOLOCK) a
WHERE depth > 0 AND
      created BETWEEN GETUTCDATE() - 90 AND GETUTCDATE()
      AND author in """ + to_sql_list(full_list) + """
GROUP BY author
"""

for row in cursor.execute(query):
    name = row[0]
    comments = row[1]
    average_comment_len = row[2]
    comments_with_link_ratio = row[3]
    d[name]['comments'] = comments
    d[name]['average_comment_len'] = average_comment_len
    d[name]['comments_with_link_ratio'] = comments_with_link_ratio
    d[name]['posts_to_comments_ratio'] = d[name]['posts'] / comments if comments > 0 else 0

You also need to add the appropriate class, depending on which set the account belongs to.

def to_class(name):
    if name in content_creators:
        return 0
    elif name in scammers:
        return 1
    elif name in comment_spammers:
        return 2
    else:
        return 3
    
for name in full_list:
    d[name]['class'] = to_class(name)

All data will be saved as file data.csv.

columns = ['name', 'followers', 'followings', 'follow ratio', 'muters',
           'reputation', 'effective sp', 'own sp', 'sp ratio', 'curation_rewards',
          'posting_rewards', 'witnesses_voted_for', 'posts', 'average_post_len', 'comments',
          'average_comment_len', 'comments_with_link_ratio', 'posts_to_comments_ratio', 'class']

with open ('data.csv' , 'w') as f:
    f.write(','.join(columns) + '\n')
    for name in full_list:
        row = [d[name][column] for column in columns]
        f.write(','.join(map(str, row)) + '\n')

Visualization of the dataset

The dataset is loaded with the following command.

df = pd.read_csv('data.csv', index_col=None, sep=",")

It looks as follows:

                    name  followers  followings  follow ratio  muters  reputation  effective sp        own sp    sp ratio  curation_rewards  posting_rewards  witnesses_voted_for  posts  average_post_len  comments  average_comment_len  comments_with_link_ratio  posts_to_comments_ratio  class
0            alucare        556         119      0.214029       1   58.640845  4.417161e+02    441.716135    1.000000            21.275          433.544                    4    111               630       387                   97                  0.074935                 0.286822      0
1           ayufitri        644         318      0.493789       4   46.679335  1.503135e+01      8.926123    1.683973             0.037           17.479                   30    301              1016       717                  125                  0.054393                 0.419805      0
2             imh3ll         87          26      0.298851       0   53.589527  5.795137e-01     67.930278    0.008531             0.545          132.534                    0      0                 0         2                   24                  0.000000                 0.000000      0
3       andeladenaro        205          75      0.365854       0   56.414916  9.938178e+01     99.381785    1.000000             0.158          195.882                    1      0                 0         0                    0                  0.000000                 0.000000      0
4             shenoy       1225        1576      1.286531       3   65.416593  2.577685e+03   2063.983517    1.248888            46.552         2581.548                    0     88               568        41                   34                  0.048780                 2.146341      0
5           tidylive        542          70      0.129151       2   64.583380  8.252583e+02    825.258289    1.000000            18.131         1891.259                    1     88               496        61                   99                  0.245902                 1.442623      0
6          arpita182         26           0      0.000000       0   27.253055  1.505696e+01      0.173166   86.950855             0.000            0.143                    0      0                 0         0                    0                  0.000000                 0.000000      0
7     world-of-music         45           1      0.022222       5   15.326558  5.008056e+00      0.100673   49.745673             0.000            0.000                    0      0                 0         0                    0                  0.000000                 0.000000      0
8        mirnasahara       1145        2038      1.779913       2   47.750057  1.563195e+01     15.631949    1.000000             0.128           29.303                    0     46                74        12                   29                  0.000000                 3.833333      0
9       haveaniceday         57           2      0.035088       0   42.550673  8.587180e-01      0.858718    1.000000             0.000            2.137                    0      0                 0         0                    0                  0.000000                 0.000000      0
10     blackhypnotik        513          70      0.136452       4   61.570600  4.559381e+02    455.938075    1.000000             5.423          893.037                    1    189               955        70                  372                  0.671429                 2.700000      0
11   openingiszwagra        207          42      0.202899       0   25.000000  1.505430e+01      0.100362  149.999441             0.001            0.000                    0     47               395         7                   28                  0.000000                 6.714286      0
12        swelker101       1610         131      0.081366      13   63.245558  1.681717e+03   5831.538042    0.288383            61.037         1845.328                   26     45               731        52                  106                  0.057692                 0.865385      0
13          phoneinf       1032          19      0.018411      14   65.778281  6.021037e+03   2418.063434    2.490024            50.373         1789.770                    7    451               659      1573                  184                  0.052765                 0.286713      0
14           spawn09        989          93      0.094034       6   62.882242  1.253178e+02    125.317849    1.000000            10.626         1736.275                    0    240               598        65                   32                  0.015385                 3.692308      0
15          dschense        112          34      0.303571       1   28.768973  5.017238e+00      0.637683    7.867919             0.012            0.233                    0      0                 0         0                    0                  0.000000                 0.000000      0
16          brainpod        310         187      0.603226       0   53.062904  3.407024e+01     34.070244    1.000000             0.047           77.218                    2     44              1219        64                  123                  0.031250                 0.687500      0
17       jezieshasan        350         216      0.617143       1   42.953291  6.709893e+00      5.705882    1.175961             0.447           12.838                    0      0                 0         0                    0                  0.000000                 0.000000      0
18         hmagellan        453         169      0.373068       0   45.283351  1.004239e+01    689.153713    0.014572             0.596           30.575                    0      0                 0         3                  224                  0.000000                 0.000000      0
19      marishkamoon        191           1      0.005236       0   35.302759  1.326083e+01     13.260832    1.000000             0.000            1.042                    0     24               311         0                    0                  0.000000                 0.000000      0
20          xee-shan        198           8      0.040404       0   52.254091  3.666877e+01     36.668770    1.000000             0.076           72.899                    0     27              1516         3                   44                  0.000000                 9.000000      0
21       marcmichael        300          78      0.260000       2   39.739864  1.504864e+01      1.925210    7.816625             0.001            2.900                    0     80               769         2                   91                  0.000000                40.000000      0
22       ridvanunver        425          53      0.124706       0   57.346462  1.128660e+02    112.865950    1.000000             0.211          265.501                    1     86               927        16                   62                  0.062500                 5.375000      0
23            cheaky       1131          48      0.042440       0   66.096330  1.512192e+03   1512.191914    1.000000            65.840         3511.447                    2     71               309        30                   27                  0.000000                 2.366667      0
24            dbroze       2385          76      0.031866      10   66.849140  2.073023e+03   2073.022665    1.000000            56.955         3972.949                    0     60              2492         4                   88                  0.250000                15.000000      0
25        jarendesta       1911          55      0.028781       6   70.460557  1.206929e+04  12069.286644    1.000000           235.837         9692.369                    0    101              1771        91                   68                  0.000000                 1.109890      0
26   bitcoinmarketss        628         606      0.964968       3   -1.445307  1.412191e+01      0.646501   21.843597             0.007            0.078                    3      2               137         1                    9                  0.000000                 2.000000      0
27            revehs        180           0      0.000000       0   25.000000  1.505181e+01      0.109308  137.700692             0.000            0.017                    0     33               531         0                    0                  0.000000                 0.000000      0
28          skyshine        504         847      1.680556       0   52.665926  4.753474e+01     47.534744    1.000000             0.118           94.111                    1     38               296         4                   29                  0.000000                 9.500000      0
29        adamkokesh      14144        8882      0.627969     100   73.792439  5.002053e+05   5791.887941   86.363087          4782.833        22908.451                    0    127              2114        23                  122                  0.043478                 5.521739      0
..               ...        ...         ...           ...     ...         ...           ...           ...         ...               ...              ...                  ...    ...               ...       ...                  ...                       ...                      ...    ...

Let's first look at the dataset. We can expect for example that:

  • scammers and comment spammers have low reputation
  • scammers and comment spammers add a lot of comments
  • most of the comments from scammers have link to some site
  • bid-bots have the highest effective STEEM POWER
  • bid-bots have most of the STEEM POWER from delegation
  • bid-bots observe fewer users

You can also add many more options here. Now let's look at the average values of the features of each class.

Featurecontent-creator averagescammer averagecomment-spammer averagebid-bot average
followers1056.673340.810577.5452328.080
followings535.990382.760456.3541753.760
follow ratio0.4480.7640.7570.420
muters5.6735.5202.30314.820
reputation49.28513.53841.88445.506
effective sp5353.143104.365226.147214445.739
own sp731.94696.3321224.50012243.641
sp ratio33167243.07322.16428.494121.184
curation_rewards155.4732.19613.3054703.424
posting_rewards1800.86178.733149.129938.840
witnesses_voted_for4.0203.5403.0102.270
posts55.98020.54064.67715.620
average_post_len876.465961.980939.0401930.990
comments101.495121.200360.5963390.350
average_comment_len71.99067.58027.818408.540
comments_with_link_ratio0.0450.1790.0250.481
posts_to_comments_ratio3.3952.6920.3070.159

We can see great differences in average values here. However, this is not enough. Let's look also at some charts.

followers + followings:



muters + rep:

follow ratio + sp ratio:



curation rewards + posting rewards:




posts + comments:




average_post_len + average_comment_len:




comments_with_link_ratio + posts_to_comments_ratio:




This gives us a much better picture of the situation. It is very important to have a good understanding of our data

Preprocessing

Preprocessing is a process of data preparation so that it is more valuable for the model being built. We will use QuantileTransformer.

X_cols = columns
y_cols = ['class']
X = pd.DataFrame(QuantileTransformer().fit_transform(df[X_cols]))
y = to_categorical(df[y_cols])

What is the difference between raw and preprocessed data is shown in the charts below.

Raw dataStandardScalerQuantileTransformer
  • StandardScaler standardizes features by removing the mean and scaling to unit variance.
  • QuantileTransformer makes the data not only in the range of [0,1] but also evenly distributed throughout the area.

For the neural network, the best input is standardized to the range of [0, 1].

Building model for multiclass classification problem

The code building the neural network model is relatively simple. There are only two layers:

  • input layer with 17 neurons
  • output layer with 4 neurons
  • (there is no hidden layer at the moment)
model = Sequential()
model.add(Dense(17, input_dim=17, activation='relu'))
model.add(Dense(4, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X_train, y_train,epochs=40, batch_size=1, verbose=0)
score = model.evaluate(X_test, y_test,verbose=0)
y_pred = model.predict_classes(X_test)


A short glossary:

NameDescription
model.addadds new layer
Densefully connected layer
reluReLU activation function
sigmoidsigmoid activation function
categorical_crossentropyloss used for multiclass classification problem
adamAdaptive Moment Estimation optimizer, basically RMSProp with momentum
accuracypercentage of correctly classified inputs

Measuring model performance

After training the neural network on the training set, we want to check its effectiveness on the test set. To do this, you will display the accuracy and the confusions matrix, which counts the records assigned to the given class.

print('accuracy: %.2f' % score[1])
cm = confusion_matrix(np.argmax(y_test, axis=1), y_pred)
plot_confusion_matrix(cm)

accuracy: 0.81



The obtained accuracy is 81 %. It is acceptable, but it could have been better.

Comparing different neural network architectures

We will now try to build a larger neural network. We will increase the number of neurons in the first layer and add two hidden layers.

model = Sequential()
model.add(Dense(85, input_dim=17, activation='relu'))
model.add(Dense(40, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(4, activation='softmax'))

accuracy: 0.83



Accuracy is slightly better, but it is not a big difference.

Comparing different neural network optimizers

We will try to use other optimizers. The optimizer is an algorithm for updating neural network weights during the learning process.

for optimizer in ['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam']:
    model = Sequential()
    model.add(Dense(85, input_dim=17, activation='relu'))
    model.add(Dense(40, activation='relu'))
    model.add(Dense(20, activation='relu'))
    model.add(Dense(4, activation='softmax'))

    model.compile(loss='categorical_crossentropy',
                  optimizer=optimizer,
                  metrics=['accuracy'])
           
    model.fit(X_train, y_train,epochs=40, batch_size=1, verbose=0)
    score = model.evaluate(X_test, y_test,verbose=0)
    y_pred = model.predict_classes(X_test)
    cm = confusion_matrix(np.argmax(y_test, axis=1), y_pred)
    plot_confusion_matrix(cm)
    print('%s|accuracy: %.2f' % (optimizer, score[1]))

OptimizerAccuracyConfusion matrix
sgd0.82
rmsprop0.78
adagrad0.80
adadelta0.80
adam0.80
adamax0.81
nadam0.81

The results are almost the same as for the Adam optimizer used earlier. In my opinion, changing the neural network itself will not help much here. In order to increase accuracy, a much larger data would need to be collected and possibly new attributes would need to be added.

Curriculum

  1. Machine learning (Keras) and Steem #1: User vs bid-bot binary classification

Conclusions

  • the bigger the dataset, the better we can train the neural network
  • we should have a good understanding of our data
  • preprocessing can make a huge difference
  • building the optimal neural network model requires experimenting
  • it is worth to experiment with different architecture of the neural network / different optimizers

Proof of Work Done

Collecting data
Building classifier

Sort:  

Nicely written. I have been teaching myself neural networks and have only reached till ADALINE, have not yet covered Adam. They also did not tell us in the book that pre processing is needed. It might be helpful though to show your weight matrix, which should be 4 columns and 17 rows? Also is there a good guide for ADALINE? I solved it with hand and cannot seem to match the weights. They do not seem to converge.

Hey @hispeedimagins
Here's a tip for your valuable feedback! @Utopian-io loves and incentivises informative comments.

Contributing on Utopian
Learn how to contribute on our website.

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Thank you for your contribution.

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Hey @jacekw.dev
Thanks for contributing on Utopian.
Congratulations! Your contribution was Staff Picked to receive a maximum vote for the tutorials category on Utopian for being of significant value to the project and the open source community.

We’re already looking forward to your next contribution!

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Cool stuff! :-)

Congratulations @jacekw.dev! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of upvotes

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

Do you like SteemitBoard's project? Then Vote for its witness and get one more award!

Excelent post, I want your permission to translate this to spanish and share in my blog, the information that are you sharing is very valuable!

Thanks @jacekw.dev

Coin Marketplace

STEEM 0.09
TRX 0.30
JST 0.033
BTC 111385.35
ETH 3933.36
USDT 1.00
SBD 0.57