Steem Tag CovariancesteemCreated with Sketch.

in #mathematics6 years ago (edited)

I looked at the Steem tags used in May 2018 using a covariance matrix. Covariance measures how closely two random variables match each other's values. Here, the random variables are binary variables (0 or 1) indicating whether or not a tag is present, and our observations are 1172186 top-level discussions. To make the analysis manageable, I limited it to tags with 50 posts or more, which produced a set of 5180 tags.

The sample covariance can be calculated from this set of observations with some linear algebra. The observations can be represented as a 1172186x5180 matrix --- fortunately a very sparse one. Each row is one discussion, with '1's set in the columns corresponding to the tags attached to that discussion. Using the built-in numpy covariance function would convert this to a dense matrix and thus run out of memory. But I found a calculation on Stackoverflow that preserves sparseness, though the resulting covariance matrix is not itself sparse:

    # Compute the covariance matrix
    rowsum = A.sum(1)
    centering = rowsum.dot(rowsum.T.conjugate()) / n
    C = (A.dot(A.T.conjugate()) - centering) / (n - 1)

I thought I was going to be able to produce a nice heatmap of the results, but unfortunately most of the values are so close to zero that it just looks like a single color! The covariance tells us which tags are most likely to be used together; that is, which tags predict the presence of other tags. The i,j entry in the matrix calculated above indicates the covariance between tags i and j.

The highest covariances in this data set are the tags that are most often used together:

abCov(a,b)
bloglife0.020875335264099508
lifephotography0.02387408675776753
naturephotography0.02543577003133139
cryptocurrencybitcoin0.03138339076182716

But which tags have the lowest covariance? A selection of tags that are most negatively covariant against other tags:

abCov(a,b)
cryptocurrencylife-0.010134353519518548
bitcoinlife-0.00900484452329558
busyesteem-0.008966313762135844
photographycryptocurrency-0.008838014354554504
blockchainlife-0.006794955044753489
newsphotography-0.005262038868601199
cryptolife-0.005215813170776201
gamelife-0.0041995541146152395
icolife-0.004184116319089055
memephotography-0.0041386081179727655
dlivephotography-0.00406223529799713
dtubephotography-0.0038268607171862527
spanishesteem-0.003809049708942849

That is, a discussion that uses #cryptocurrency is somewhat less likely to also use #life or #photography. Unsurprisingly, apps are common here--- nobody uses both #busy and #esteem at the same time.

What about our friend #steemstem?

#steemstemMost popular tagsCovariance
steemstemscience0.001116
steemstemstemng0.000448
steemstemsteemiteducation0.000357
steemstemtechnology0.000330
steemstemstem-espanol0.000226
steemstemhealth0.000220
steemstemeducation0.000217
steemstemnigeria0.000211
steemstemwafrica0.000206
steemstemspanish0.000140
steemstemstach0.000124
steemstemadsactly0.000124
steemstemengineering0.000110
steemstemphysics0.000107
steemstemair-clinic0.000098
steemstemmathematics0.000084
steemstembiology0.000080
steemstempsychology0.000074
steemstemocd-resteem0.000066
steemstemastronomy0.000062
#steemstemLeast popular tagsCovariance
steemstemgame-0.000062
steemstemblog-0.000062
steemstemsteem-0.000065
steemstemdlive-0.000067
steemstemsteepshot-0.000071
steemstemcrypto-0.000072
steemstemblockchain-0.000075
steemstemkr-0.000078
steemstemmeme-0.000082
steemstembusy-0.000083
steemstemphoto-0.000084
steemstemtravel-0.000094
steemstemfunny-0.000103
steemstemesteem-0.000108
steemstemart-0.000113
steemstembitcoin-0.000134
steemstemsteemit-0.000137
steemstemcryptocurrency-0.000137
steemstemlife-0.000293
steemstemphotography-0.000300

With statistical tools like this, we can start to extract some semantic information, as well as better understand how people use Steem.

Tags are constrained in a way that is hard to model: a discussion can normally have only up to five of them. (Using the API directly appears to allow more.) This means that the behavior of tags is not fully captured by their pairwise behavior which we have been looking at here. So while I am trying to model them using a Bayesian approach, I have no idea what the underlying "expected" distribution is when we choose five different tags according to their relative independent distributions. My question asking for help on Quora has thus far gone unanswered: https://www.quora.com/I-have-a-deck-consisting-of-n_k-copies-of-the-card-k-How-do-I-calculate-the-probability-of-drawing-a-five-card-hand-containing-two-specified-cards-a-and-b-What-if-a-hand-may-have-no-duplicates

Sort:  

Expected:

girlfriend | love | 0.000018

Not expected:

girlfriend | multiplayer | 0.000002

Concerning:

girlfriend | news | -0.000002
girlfriend | blockchain | -0.000003
girlfriend | art | -0.000003
girlfriend | bitcoin | -0.000004
girlfriend | life | -0.000004
girlfriend | photography | -0.000004
girlfriend | cryptocurrency | -0.000004
girlfriend | esteem | -0.000004

Multiplayer girlfriend seems to refer to exactly two discussions

48325996 2018-05-13T08:28:33 emsonic 74faccc0-5686-11e8-8cf9-1b93be2e52ab 1195260721 S dlive dlive-video game ocaleni gameplay multiplayer girlfriend dziewczyna
48336725 2018-05-13T10:17:18 emsonic 96952500-5696-11e8-8cf9-1b93be2e52ab 806375034 S dlive dlive-video game gaming gameplay girlfriend dziewczyna ocaleni multiplayer

Loading...

Coin Marketplace

STEEM 0.19
TRX 0.18
JST 0.034
BTC 89358.33
ETH 3187.70
USDT 1.00
SBD 2.82