Analysis of non-English communities on Steem

in utopian-io •  2 months ago

Repository

https://github.com/steemit/steem

Introduction

There are many communities on the Steem network that use a different language than English. For example, I am actively involved in the #polish community using the @jacekw account. The purpose of the following analysis is to find the most active communities of this type, compare them and observe how they have changed over time.

Outline

  • Scope of the analysis
  • Tools
  • Verification of initial data
  • Different tags for the same languages
  • Posts
  • Payouts
  • Average payout per post
  • Authors
  • Tags with prefixes
  • Conclusions
  • Proof of work

Scope of the analysis

The data has been downloaded from the SteemSQL database (tableComments) and refer to the first 6 months of 2018. The following script was used to download posts from the given tag.

SELECT url, total_payout_value, active_votes, json_metadata, created, body_language
FROM Comments (NOLOCK) c
WHERE depth = 0 AND
      (CONTAINS(json_metadata, 'spanish') AND json_metadata LIKE '%"spanish"%') AND
      YEAR(created) = 2018 AND
      MONTH(created) <= 6

Tools

Verification of initial data

Potential language community tags have been manually selected based on 1000 most popular tags (from a 14 day time period).

lang_tags = [
    'indonesia', 'spanish', 'aceh', 'kr', 'cervantes', 'cn', 'deutsch', 'castellano', 'venezuela',
    'tr', 'polish', 'fr', 'myanmar', 'japanese', 'ru', 'pt', 'thai', 'ua',
    'morocco', 'arab', 'pilipinas', 'steemit-austria', 'mexico', 'vn', 'rusteemteam', 'cesky', 'bangladesh',
    'russian', 'hindi', 'br', 'arabic', 'teamserbia', 'steemromania', 'teamukraine', 'filipino', 'serbia'
]

There are different conventions here:

In my opinion, the name of the country is not the best choice for the language community tag, because this is the first tag that comes to mind if we want to add a post regarding given country in English.
The tags have been selected manually, so additional verification is needed, as some of them may not be relevant to language communities at all.

For this purpose, I used the column body_language from the tableComments. Below you will find charts showing the share of individual languages in given tags.




We can see that some tags are dominated by English. Such tags will be omitted from further analysis. Since we are not sure to what extent the body_language can be trusted, we will set the threshold quite low, at 30%.

.TagRatioLang
1castellano94.1es
2br92.5pt
3cervantes89.4es
4kr87.4ko
5thai87.3th
6spanish87.0es
7teamukraine85.5uk
8venezuela84.7es
9pt81.0pt
10polish79.2pl
11myanmar77.4my
12tr68.7tr
13japanese68.4ja
14deutsch67.9de
15fr65.1fr
16steemit-austria63.1de
17rusteemteam61.2ru
18cesky59.7cs
19indonesia53.2id
20cn51.0zh
21arabic50.6ar
22ru48.3ru
23mexico47.4es
24aceh46.2id
25morocco45.6ar
26arab41.2ar
27ua34.6uk
28pilipinas33.1tl
29russian32.9ru
30hindi30.1hi
----
31filipino19.7tl
32vn17.8vi
33serbia15.9sr
34bangladesh15.4bn
35teamserbia15.0sr
36steemromania13.9ro

The last 6 tags will be deleted.

Different tags for the same languages

It also appears that some communities use the same language as others. Let's see what the relations are between them, whether they have a large common part.


es : castellano, cervantes, spanish, venezuela, mexico
pt : br, pt
uk : teamukraine, ua
de : deutsch, steemit-austria
ru : rusteemteam, ru, russian
id : indonesia, aceh
ar : arabic, morocco, arab

def plot_venn(tags):
    plt.figure(figsize=(6, 6))
    fn = venn2 if len(tags) == 2 else venn3
    fn([tag_urls_dict[tag] for tag in tags], map(lambda t: '#' + t, tags))
    plt.show()
    
for lang_code, tags in same_lang_dict.items():  
    if 2 <= len(tags) <= 3:
        plot_venn(tags)
            
plot_venn(['spanish', 'cervantes', 'castellano'])
plot_venn(['spanish', 'mexico', 'venezuela'])
pt : #br, #pt



uk : #teamukraine, #ua



de : #deutsch, #steemit-austria



ru : #rusteemteam, #ru, #russian



id : #indonesia, #aceh



ar : #arabic, #morocco, #arab



es: #spanish, #cervantes, #castellano



es : #spanish, #mexico, #venezuela



The #teamukraine tag is practically completely contained in #ua, so it is going to be omitted from further analysis, especially that both of them concern the Ukrainian community. For #pt and #br the situation is similar, but #br seems to be a separate (Brazilian) community.

Posts

Let's see how many posts were added in each tag.

.TagPosts
1spanish506603
2indonesia407362
3aceh364128
4cervantes336008
5kr249492
6castellano153505
7venezuela130094
8cn79019
9deutsch77069
10tr58423
11polish29788
12ru21432
13japanese20931
14fr20873
15myanmar20638
16ua18228
17pt18212
18thai17132
19morocco11397
20steemit-austria11280
21arab9373
22mexico6830
23pilipinas6441
24rusteemteam5873
25teamukraine4148
26cesky4057
27russian3266
28br2477
29hindi2001
30arabic1673



We can see a great diversity here, communities from leading places have several hundred times more posts than those from the end. The Spanish language tags are top of the list, which may indicate that Steem is well known in South America.

Author rewards

However, the number of posts is an insufficient indicator, because it is very easy to spam a given tag with low value posts, and this will not indicate the popularity of a given tag at all. Therefore, let's also look at the sum of rewards in individual tags.

.TagAuthor rewards
1kr346032
2spanish286315
3cervantes194563
4cn159464
5deutsch97971
6indonesia97138
7castellano82621
8tr64475
9aceh57992
10venezuela38036
11fr37928
12japanese29508
13pt26778
14steemit-austria24372
15ru21037
16myanmar19279
17polish19049
18ua16448
19morocco12993
20thai12330
21mexico10204
22arab9940
23br8826
24cesky4375
25pilipinas3443
26rusteemteam3071
27hindi1326
28russian847
29teamukraine790
30arabic528



The table looks quite similar to the previous one. It is also worth to look at the average rewards in a given tag. This will allow us to find out how rich the community is.

Average payout per post

.TagAverage author rewards per post
1br3.563
2steemit-austria2.161
3cn2.018
4fr1.817
5mexico1.494
6pt1.470
7japanese1.410
8kr1.387
9deutsch1.271
10morocco1.140
11tr1.104
12cesky1.078
13arab1.060
14ru0.982
15myanmar0.934
16ua0.902
17thai0.720
18hindi0.663
19polish0.639
20cervantes0.579
21spanish0.565
22castellano0.538
23pilipinas0.535
24rusteemteam0.523
25arabic0.316
26venezuela0.292
27russian0.259
28indonesia0.238
29aceh0.159



We can see that the differences are significant. It should also be taken into account, however, that tags with a small number of posts may not have reliable results.

Authors

We can look at the popularity of a given tag from a different angle - looking at the number of authors, not posts. This indicator seems even better, because it omits situations where a given user spams a very large number of posts.

.TagAuthors
1spanish22315
2indonesia20511
3aceh19852
4cervantes13775
5kr13761
6venezuela10520
7castellano7052
8cn5451
9deutsch4497
10tr3496
11polish2488
12japanese2020
13mexico1468
14fr1450
15myanmar1409
16thai1070
17ru1051
18pt799
19russian666
20arab646
21morocco590
22pilipinas475
23hindi469
24steemit-austria401
25arabic342
26ua330
27rusteemteam210
28br191
29cesky132
30teamukraine27

Now we see how small some communities are! And if a community does not exceed a certain threshold, its members may prefer to post in English, which makes the growth of such a community more difficult.



Tags with prefixes

A common problem in language communities (I say this as someone who is actively involved in the #polish community) is how to use other tags. If I make a post in Polish and tag it: #polish #bitcoin #cryptocurrency, these last two tags will make the post reach also English-speaking audiences, so it is not a good solution.

Another idea seems to be to use tags in the mother tongue, but then you may find tags that function as a word in both languages, e.g. #film. This shows that it is not a good solution either. Therefore, the Polish community decided (based on the convention adopted by the Korean community) on the convention of tags with the prefix pl-. This makes it possible to separate posts in Polish from the English-speaking audience.

Let us therefore look at which other communities use such a convention. The table below shows how many unique tags are with the specified prefix.

.PrefixCount
1kr2302
2pl815
3cn221
4ru168
5jp64
6de31
7tr24
8pt12
9fr8
10ua5
11es5

We can see that the Korean community has the largest number of such unique tags, followed by the Polish and Chinese communities. As far as the Polish community is concerned we can see a tree of these tags on the website: https://steemweb.pl/categories (by @rafalski)

Let's see what are the most popular tags of this type.

.Prefix-tagCount
1#kr-newbie83291
2#kr-life23164
3#kr-writing19077
4#kr-event13164
5#cn-reader13133
6#kr-daily12263
7#kr-art8748
8#kr-travel7077
9#kr-food6037
10#pl-artykuly5987
11#kr-pen5678
12#kr-coin5494
13#kr-join5484
14#kr-news5011
15#jp-newbie4171
16#kr-overseas4055
17#kr-diary3812
18#kr-gazua3752
19#kr-series2980
20#kr-funfun2738
21#kr-dev2665
22#kr-book2601
23#kr-youth2563
24#kr-story2355
25#cn-malaysia1957
26#kr-economy1777
27#kr-hobby1708
28#kr-steemit1706
29#kr-game1693
30#pl-fotografia1612

As we can see, most of the tags are Korean.

Conclusions

We have managed to find 30 language communities that do not use English. There are probably more, but the process of finding them is not a simple one, because of the different tagging conventions that have been adopted. All these communities are probably waiting for the appearance of the functionality Communities, because this will make it easier for them to function.

It also appears that the size of language communities is not always correlated with the number of people using a given language / country size. An example of this is tag #kr, which is high in all lists and has a relatively small population compared to e.g. the Chinese community.

Some communities use almost only their mother tongue, while others have chosen to use English (e.g. #serbia). It is also worth remembering that the Russian community decided at some point to move to its own blockchain: https://golos.io, but on the other hand, this did not make it completely disappear from the Steem network. Interestingly, there are also quite a big countries that do not have a language community.

There are many other issues that could be explored here, such as how the community formation process looks like, who the pioneers are, who the leaders are (if any).

Proof of work

Scripts used in this work (as Jupyter Notebook)

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  
·

#polish is the bestest

·

#polish to the moon!

·

I have changed the way of calculating rewards so that the result is independent of the STEEM price and now this plot also looks interesting :)

I've just found this your post and was pleasantly surprised to see my native Ukrainian community mentioned.
You couldn't know this but there are several additional tags used in our community: #ukraine and #ua-by ( because we have a few members from Belorussia in our Discord-community). I think this last tag is even more popular than #ua.

just info, in case if you decide to make another report in the future;-)
Good work, btw. I'm a bit late here, so I voted your most recent post.

·

Thanks for the feedback!
Finding these tags was not easy, so I realize I could have missed some of them. And frankly, I just hoped that maybe there would be supplementary comments like yours :)

It is hard to analyze "payouts through time", because in all communities those payouts are down... but mostly because of drop in STEEM price.

If you would use STEEM as a unit for this chart... then we could compare which national community is growing and which don't

·

I must admit that I was wondering how to present this. And converting to STEEM seems like a good idea, so I'll add it.

·

Hi @jacekw.dev, thanks for your contribution! Please keep in mind that analyses about social and behavior aspects of an open source project are not directly in the scope of utopian as per the guidelines. Nevertheless, you covered a broad range of tags in a remarkable piece of work to find the non-english community tags on Steem. I like the use of the Venn plots for the visualization of tag overlap! :)
The correlation of the payouts with the steem price was already mentioned in another comment. I think this was not changed to STEEM in the post yet? The corresponding steem value is available in the author_reward field as a multiple of 0.001 STEEM. But I think there may still be some variations due to changing reward pool size and vote activity. The most reliable metric in this regards would probably be to sum up the corresponding vote rshares...

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

·

Thanks for valuable feedback! I have just changed the tables and charts so that the values are now independent of the STEEM price (using the author_rewards column). Now it's looks much better :)

Hey @jacekw.dev
Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

I would say this analysis reveals what's pretty obvious.
It's all about how many people speak your language.

If the audience you can hit with your native language is big enough, you will more likely prefer native language to comment & blog.

Congratulations @jacekw.dev! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

You published your First Post
You got a First Vote
Award for the number of upvotes received
Award for the number of upvotes

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

To support your work, I also upvoted your post!

Do not miss the last post from @steemitboard:
SteemitBoard World Cup Contest - The results, the winners and the prizes

Do you like SteemitBoard's project? Then Vote for its witness and get one more award!

Congratulations @jacekw.dev! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of upvotes

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

Do not miss the last post from @steemitboard:
SteemitBoard World Cup Contest - The results, the winners and the prizes

Do you like SteemitBoard's project? Then Vote for its witness and get one more award!

Congratulations @jacekw.dev! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

You got your First payout
Award for the total payout received

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

Do you like SteemitBoard's project? Then Vote for its witness and get one more award!

fantastic analysis - sorry I missed it when you posted it

Congratulations @jacekw.dev! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of comments

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

Do you like SteemitBoard's project? Then Vote for its witness and get one more award!

Congratulations @jacekw.dev! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of upvotes received

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

Do not miss the last post from @steemitboard:
SteemitBoard World Cup Contest - The results, the winners and the prizes

Do you like SteemitBoard's project? Then Vote for its witness and get one more award!