Steem Tag Covariance
I looked at the Steem tags used in May 2018 using a covariance matrix. Covariance measures how closely two random variables match each other's values. Here, the random variables are binary variables (0 or 1) indicating whether or not a tag is present, and our observations are 1172186 top-level discussions. To make the analysis manageable, I limited it to tags with 50 posts or more, which produced a set of 5180 tags.
The sample covariance can be calculated from this set of observations with some linear algebra. The observations can be represented as a 1172186x5180 matrix --- fortunately a very sparse one. Each row is one discussion, with '1's set in the columns corresponding to the tags attached to that discussion. Using the built-in numpy covariance function would convert this to a dense matrix and thus run out of memory. But I found a calculation on Stackoverflow that preserves sparseness, though the resulting covariance matrix is not itself sparse:
# Compute the covariance matrix
rowsum = A.sum(1)
centering = rowsum.dot(rowsum.T.conjugate()) / n
C = (A.dot(A.T.conjugate()) - centering) / (n - 1)
I thought I was going to be able to produce a nice heatmap of the results, but unfortunately most of the values are so close to zero that it just looks like a single color! The covariance tells us which tags are most likely to be used together; that is, which tags predict the presence of other tags. The i,j entry in the matrix calculated above indicates the covariance between tags i and j.
The highest covariances in this data set are the tags that are most often used together:
a | b | Cov(a,b) |
---|---|---|
blog | life | 0.020875335264099508 |
life | photography | 0.02387408675776753 |
nature | photography | 0.02543577003133139 |
cryptocurrency | bitcoin | 0.03138339076182716 |
But which tags have the lowest covariance? A selection of tags that are most negatively covariant against other tags:
a | b | Cov(a,b) |
---|---|---|
cryptocurrency | life | -0.010134353519518548 |
bitcoin | life | -0.00900484452329558 |
busy | esteem | -0.008966313762135844 |
photography | cryptocurrency | -0.008838014354554504 |
blockchain | life | -0.006794955044753489 |
news | photography | -0.005262038868601199 |
crypto | life | -0.005215813170776201 |
game | life | -0.0041995541146152395 |
ico | life | -0.004184116319089055 |
meme | photography | -0.0041386081179727655 |
dlive | photography | -0.00406223529799713 |
dtube | photography | -0.0038268607171862527 |
spanish | esteem | -0.003809049708942849 |
That is, a discussion that uses #cryptocurrency is somewhat less likely to also use #life or #photography. Unsurprisingly, apps are common here--- nobody uses both #busy and #esteem at the same time.
What about our friend #steemstem?
#steemstem | Most popular tags | Covariance |
---|---|---|
steemstem | science | 0.001116 |
steemstem | stemng | 0.000448 |
steemstem | steemiteducation | 0.000357 |
steemstem | technology | 0.000330 |
steemstem | stem-espanol | 0.000226 |
steemstem | health | 0.000220 |
steemstem | education | 0.000217 |
steemstem | nigeria | 0.000211 |
steemstem | wafrica | 0.000206 |
steemstem | spanish | 0.000140 |
steemstem | stach | 0.000124 |
steemstem | adsactly | 0.000124 |
steemstem | engineering | 0.000110 |
steemstem | physics | 0.000107 |
steemstem | air-clinic | 0.000098 |
steemstem | mathematics | 0.000084 |
steemstem | biology | 0.000080 |
steemstem | psychology | 0.000074 |
steemstem | ocd-resteem | 0.000066 |
steemstem | astronomy | 0.000062 |
#steemstem | Least popular tags | Covariance |
---|---|---|
steemstem | game | -0.000062 |
steemstem | blog | -0.000062 |
steemstem | steem | -0.000065 |
steemstem | dlive | -0.000067 |
steemstem | steepshot | -0.000071 |
steemstem | crypto | -0.000072 |
steemstem | blockchain | -0.000075 |
steemstem | kr | -0.000078 |
steemstem | meme | -0.000082 |
steemstem | busy | -0.000083 |
steemstem | photo | -0.000084 |
steemstem | travel | -0.000094 |
steemstem | funny | -0.000103 |
steemstem | esteem | -0.000108 |
steemstem | art | -0.000113 |
steemstem | bitcoin | -0.000134 |
steemstem | steemit | -0.000137 |
steemstem | cryptocurrency | -0.000137 |
steemstem | life | -0.000293 |
steemstem | photography | -0.000300 |
With statistical tools like this, we can start to extract some semantic information, as well as better understand how people use Steem.
Tags are constrained in a way that is hard to model: a discussion can normally have only up to five of them. (Using the API directly appears to allow more.) This means that the behavior of tags is not fully captured by their pairwise behavior which we have been looking at here. So while I am trying to model them using a Bayesian approach, I have no idea what the underlying "expected" distribution is when we choose five different tags according to their relative independent distributions. My question asking for help on Quora has thus far gone unanswered: https://www.quora.com/I-have-a-deck-consisting-of-n_k-copies-of-the-card-k-How-do-I-calculate-the-probability-of-drawing-a-five-card-hand-containing-two-specified-cards-a-and-b-What-if-a-hand-may-have-no-duplicates
Expected:
girlfriend | love | 0.000018
Not expected:
girlfriend | multiplayer | 0.000002
Concerning:
girlfriend | news | -0.000002
girlfriend | blockchain | -0.000003
girlfriend | art | -0.000003
girlfriend | bitcoin | -0.000004
girlfriend | life | -0.000004
girlfriend | photography | -0.000004
girlfriend | cryptocurrency | -0.000004
girlfriend | esteem | -0.000004
Multiplayer girlfriend seems to refer to exactly two discussions
48325996 2018-05-13T08:28:33 emsonic 74faccc0-5686-11e8-8cf9-1b93be2e52ab 1195260721 S dlive dlive-video game ocaleni gameplay multiplayer girlfriend dziewczyna
48336725 2018-05-13T10:17:18 emsonic 96952500-5696-11e8-8cf9-1b93be2e52ab 806375034 S dlive dlive-video game gaming gameplay girlfriend dziewczyna ocaleni multiplayer