Steem Tag Covariance

markgritter (59)in #mathematics • 6 years ago (edited)

I looked at the Steem tags used in May 2018 using a covariance matrix. Covariance measures how closely two random variables match each other's values. Here, the random variables are binary variables (0 or 1) indicating whether or not a tag is present, and our observations are 1172186 top-level discussions. To make the analysis manageable, I limited it to tags with 50 posts or more, which produced a set of 5180 tags.

The sample covariance can be calculated from this set of observations with some linear algebra. The observations can be represented as a 1172186x5180 matrix --- fortunately a very sparse one. Each row is one discussion, with '1's set in the columns corresponding to the tags attached to that discussion. Using the built-in numpy covariance function would convert this to a dense matrix and thus run out of memory. But I found a calculation on Stackoverflow that preserves sparseness, though the resulting covariance matrix is not itself sparse:

    # Compute the covariance matrix
    rowsum = A.sum(1)
    centering = rowsum.dot(rowsum.T.conjugate()) / n
    C = (A.dot(A.T.conjugate()) - centering) / (n - 1)

I thought I was going to be able to produce a nice heatmap of the results, but unfortunately most of the values are so close to zero that it just looks like a single color! The covariance tells us which tags are most likely to be used together; that is, which tags predict the presence of other tags. The i,j entry in the matrix calculated above indicates the covariance between tags i and j.

The highest covariances in this data set are the tags that are most often used together:

a	b	Cov(a,b)
blog	life	0.020875335264099508
life	photography	0.02387408675776753
nature	photography	0.02543577003133139
cryptocurrency	bitcoin	0.03138339076182716

But which tags have the lowest covariance? A selection of tags that are most negatively covariant against other tags:

a	b	Cov(a,b)
cryptocurrency	life	-0.010134353519518548
bitcoin	life	-0.00900484452329558
busy	esteem	-0.008966313762135844
photography	cryptocurrency	-0.008838014354554504
blockchain	life	-0.006794955044753489
news	photography	-0.005262038868601199
crypto	life	-0.005215813170776201
game	life	-0.0041995541146152395
ico	life	-0.004184116319089055
meme	photography	-0.0041386081179727655
dlive	photography	-0.00406223529799713
dtube	photography	-0.0038268607171862527
spanish	esteem	-0.003809049708942849

That is, a discussion that uses #cryptocurrency is somewhat less likely to also use #life or #photography. Unsurprisingly, apps are common here--- nobody uses both #busy and #esteem at the same time.

What about our friend #steemstem?

#steemstem	Most popular tags	Covariance
steemstem	science	0.001116
steemstem	stemng	0.000448
steemstem	steemiteducation	0.000357
steemstem	technology	0.000330
steemstem	stem-espanol	0.000226
steemstem	health	0.000220
steemstem	education	0.000217
steemstem	nigeria	0.000211
steemstem	wafrica	0.000206
steemstem	spanish	0.000140
steemstem	stach	0.000124
steemstem	adsactly	0.000124
steemstem	engineering	0.000110
steemstem	physics	0.000107
steemstem	air-clinic	0.000098
steemstem	mathematics	0.000084
steemstem	biology	0.000080
steemstem	psychology	0.000074
steemstem	ocd-resteem	0.000066
steemstem	astronomy	0.000062

#steemstem	Least popular tags	Covariance
steemstem	game	-0.000062
steemstem	blog	-0.000062
steemstem	steem	-0.000065
steemstem	dlive	-0.000067
steemstem	steepshot	-0.000071
steemstem	crypto	-0.000072
steemstem	blockchain	-0.000075
steemstem	kr	-0.000078
steemstem	meme	-0.000082
steemstem	busy	-0.000083
steemstem	photo	-0.000084
steemstem	travel	-0.000094
steemstem	funny	-0.000103
steemstem	esteem	-0.000108
steemstem	art	-0.000113
steemstem	bitcoin	-0.000134
steemstem	steemit	-0.000137
steemstem	cryptocurrency	-0.000137
steemstem	life	-0.000293
steemstem	photography	-0.000300

With statistical tools like this, we can start to extract some semantic information, as well as better understand how people use Steem.

Tags are constrained in a way that is hard to model: a discussion can normally have only up to five of them. (Using the API directly appears to allow more.) This means that the behavior of tags is not fully captured by their pairwise behavior which we have been looking at here. So while I am trying to model them using a Bayesian approach, I have no idea what the underlying "expected" distribution is when we choose five different tags according to their relative independent distributions. My question asking for help on Quora has thus far gone unanswered: https://www.quora.com/I-have-a-deck-consisting-of-n_k-copies-of-the-card-k-How-do-I-calculate-the-probability-of-drawing-a-five-card-hand-containing-two-specified-cards-a-and-b-What-if-a-hand-may-have-no-duplicates

#statistics #covariance #steem #tags

6 years ago in #mathematics by markgritter (59)

Sort:

markgritter (59) 6 years ago

Expected:

girlfriend | love | 0.000018

Not expected:

girlfriend | multiplayer | 0.000002

Concerning:

$0.00

1 vote

[-]

markgritter (59) 6 years ago

Multiplayer girlfriend seems to refer to exactly two discussions

48325996 2018-05-13T08:28:33 emsonic 74faccc0-5686-11e8-8cf9-1b93be2e52ab 1195260721 S dlive dlive-video game ocaleni gameplay multiplayer girlfriend dziewczyna
48336725 2018-05-13T10:17:18 emsonic 96952500-5696-11e8-8cf9-1b93be2e52ab 806375034 S dlive dlive-video game gaming gameplay girlfriend dziewczyna ocaleni multiplayer

$0.00

STEEM 0.19

TRX 0.18

JST 0.034

BTC 89358.33

ETH 3187.70

USDT 1.00

SBD 2.82

Steem Tag CovariancesteemCreated with Sketch.

Expected:

Not expected:

Concerning:

Coin Marketplace

Steem Tag Covariance