教你做一个 ChinaDaily 排名分析程序 - 来算一下你在中文区的排名吧

in #cn4 years ago (edited)

DQmYVYrq1ed279QNqPiKdmduUuB2tHuPCCszmobHfDpTXku_1680x8400.png
(图片来自网络, 与本文无直接关系)

初衷

每天都能看到 ChinaDaily 的数据分析报告, 奈何我这种新人总是上不了榜, 想知道自己在中文区的排名情况更是不知道要等到什么时候. 想必有不少中文区的朋友也想知道自己的排名吧, 甭管几百.

目标

ChinaDaily 关于 Author 的榜单有 3 个, 分别是按照 Rep, 预估收益, 和字母表顺序排序.
本文的目标是通过阅读本文, 你也可以上手做一个包含上面功能的排名分析的小程序.

准备

想要数据分析, 先要有数据, 想获取 Steem Author 的全网数据在区块链这种开放环境下很容易, 做好的半成品也很多, 比如 SteemSQL, Steemd, SteemData, Steem API, 这里我们方便起见, 选择 SteemData . SteemData 是一个全开放的基于 MongoDB 的 Steem 区块链数据库, 不仅包括 Author 信息, 还包括全网的文章, 点赞等信息, 本身有提供很好的 API 和 SDK 可以访问数据, 实在方便.

远端数据库选好了, 需要一个 UI 界面展示一下, 方便比对调试, 这里我们选择免费开源的 RoboMongo, 全平台支持一键安装.

安装好 RoboMongo 之后, 可以直接建立一个链接到 SteemData

Host: mongo1.steemdata.com 
Port: 27017 
Database: SteemData 
Username: steemit 
Password: steemit 

连接成功, 全网 331351 位作者的信息尽收眼底, 爽不爽?
Screen Shot 2017-08-30 at 11.53.54.png
Screen Shot 2017-08-30 at 11.53.38.png

至此, 准备工作结束. 好戏刚刚开始.

思路

我们的目标是获取7日内中文区活跃用户, 然后按照 rep, 预估收益, 和首字母进行排序, 我们来一个一个思考一下.
第零个问题是获取7日内中文区活跃用户. 这个问题很容易, 可以通过查询 Posts 表, 以最近7天为条件, 取得文章标签为 cn 的全部文章作者集合, 就定义为7日内中文区活跃用户’, 在用这个用户名集合到 Accounts 表去做查询, 就取得了全部的待处理数据;

这里还有很多细碎的问题值得探讨, 比如加了 cn 标签, 但不是第一个, 这算不算中文区活跃用户? 类似这种问题由于也很容易解决, 就不在本文的讨论范围了, 如果感兴趣, 可以留言.

第一个问题是如何按照 rep 排序, 很简单, Accounts 表本身有一列就是 rep, 直接逆序输出就可以了;
第二个问题是如何按照 预估收益 计算, 这个也不难, 就是要知道预估收益是如何计算的, 这里经过一些 google 和咨询, 找到了一个计算公式

Value = (SP + Steem) * medianPrice + SBD

其中, medianPrice 可以使用 Steem API 获得, 或者从 steemd 手动查询(网页内搜索关键字: feed price). 而参与计算的其他参数: SP, Steem, SBD, 均可从 Accounts 表获取.
第三个问题其实不算是问题, 为了节约篇幅, 不讨论了. 后面直接看代码.

至此, 思路和关键要素准备完毕, 动手写代码.

代码

app.js

var MongoClient = require('mongodb').MongoClient, assert = require('assert');
var steem = require('steem');
var fs = require('fs’);

// Connect DB
var url = 'mongodb://steemit:[email protected]:27017/SteemData';

var postsCount = 0;
var authorsCount = 0;
var authorObjects = {};
var medianPrice = 1;
var start = new Date();
var end = new Date();

var basepath = "/Users/bullda/Workspace/SteemBot/";

createOutputFile();

steem.api.getCurrentMedianHistoryPrice(function(err, result) {
    if (!err) {
        medianPrice = (result['base'].split(" "))[0];
        console.log("==== medianPrice ==== " + medianPrice);
        updateOutputFile("$medianPrice$", medianPrice);

        // Connect to the db
        MongoClient.connect(url, function(err, db) {
            assert.equal(null, err);
             console.log("Connected successfully to server");

            findAuthorByPosts(db, function(docs) {
                // get posts count
                postsCount = docs.length;
                console.log("==== postsCount ==== " + postsCount);
                updateOutputFile("$postsCount$", postsCount);

                // authors count & authors set
                for (var i = 0; i < docs.length; i++) {
                    authorObjects[docs[i]['author']] = 1 + (authorObjects[docs[i]['author']] || 0);
                }
                authorsCount = Object.keys(authorObjects).length;
                console.log("==== authorsCount ==== " + authorsCount);
                updateOutputFile("$authorsCount$", authorsCount);

                // authors details
                getAuthorsDetail(db, function(docs) {
                    // sort by value
                    var topValue = "";
                    var topValueLimit = 20;    // get top 20 value, you can change it as your wish

                    docs.sort(sortByValue);

                    for (var i = 0; i < topValueLimit; i++) {
                        var name = docs[i]['name'];
                        var value = ((docs[i]['sp'] + docs[i]['balances']['total']['STEEM']) * medianPrice + docs[i]['balances']['total']['SBD']).toFixed(2);
                        var steemValue = docs[i]['balances']['total']['STEEM'];
                        var SPValue = docs[i]['sp'];
                        var SBDValue = docs[i]['balances']['total']['SBD'];
                        
                        topValue += ((i + 1) + "|@" + name + "|" + value + "|" + steemValue + "|" + SPValue + "|" + SBDValue + "\n");
                    }
                    console.log("==== topValue ==== " + topValue);
                    updateOutputFile("$topValue$", topValue);

                    // sort by rep
                    var topRep = "";
                    var topRepLimit = 20; // get top 20 rep, you can change it as your wish

                    docs.sort(sortByRep);

                    for (var i = 0; i < topRepLimit; i++) {
                        var name = docs[i]['name'];
                        var rep = docs[i]['rep'];
                        var vp = (docs[i]['voting_power'] / 100).toFixed(2);
                        var online = daysBetween(new Date(), new Date(docs[i]['created']));
                        var value = ((docs[i]['sp'] + docs[i]['balances']['total']['STEEM']) * medianPrice + docs[i]['balances']['total']['SBD']).toFixed(2);
                        
                        topRep += ((i + 1) + "|@" + name + "|" + rep + "|" + vp + "|" + online + "|" + value + "\n");
                    }
                    console.log("==== topRep ==== " + topRep);
                    updateOutputFile("$topRep$", topRep);

                    db.close();    
                });
            });
        });
    };
});

function findAuthorByPosts(db, callback) {
    var collection = db.collection('Posts');

    start.setDate(start.getDate() - 7);
    start = new Date(start.toISOString().slice(0, 10));
    console.log("==== start ==== " + start.toISOString());
    updateOutputFile("$start$", start.toISOString().slice(0, 19));

    end = new Date(end.toISOString().slice(0, 10));
    console.log("==== end ==== " + end.toISOString());
    updateOutputFile("$end$", end.toISOString().slice(0, 19));
    updateOutputFile("$print$", end.toISOString().slice(0, 19));

     var query = { "tags":"cn" , "created":{$gte: start, $lt: end}};
     var fetchDataLimit = 5000;

     collection.find(query).skip(0).limit(fetchDataLimit).toArray(function(err, docs) {
        assert.equal(err, null);
        callback(docs);
     });
}

function getAuthorsDetail(db, callback) {
    var collection = db.collection('Accounts');

    var query = {"name": {$in: Object.keys(authorObjects)}};
    var fetchDataLimit = 5000;

    collection.find(query).skip(0).limit(fetchDataLimit).toArray(function(err, docs) {
        assert.equal(err, null);
        callback(docs);
     });
}

function daysBetween(date1, date2) {
    // The number of milliseconds in one day
    var ONE_DAY = 1000 * 60 * 60 * 24

    // Convert both dates to milliseconds
    var date1_ms = date1.getTime()
    var date2_ms = date2.getTime()

    // Calculate the difference in milliseconds
    var difference_ms = Math.abs(date1_ms - date2_ms)

    // Convert back to days and return
    return Math.round(difference_ms / ONE_DAY)
}

function createOutputFile() {
    var from = basepath + "cn_report_template.md";
    var to = basepath + "cn_" + end.toISOString().slice(0, 10);
    var content = fs.readFileSync(from, 'utf8');
    fs.writeFileSync(to, content);
}

function updateOutputFile(key, value) {
    var filepath = basepath + "cn_" + end.toISOString().slice(0, 10);
    var content = fs.readFileSync(filepath, 'utf8');
    content = content.replace(key, value);
    fs.writeFileSync(filepath, content);
    console.log("==== saved ==== " + key);
}

function sortByRep(authorA, authorB) {
    return (authorB['rep'] - authorA['rep']);
}

function sortByValue(authorA, authorB) {
    var valueA = (authorA['sp'] + authorA['balances']['total']['STEEM']) * medianPrice + authorA['balances']['total']['SBD'];
    var valueB = (authorB['sp'] + authorB['balances']['total']['STEEM']) * medianPrice + authorB['balances']['total']['SBD'];

    return (valueB - valueA);
}

function sortByAlphabeta(authorA, authorB) {
    var nameA = authorA['name'].toLowerCase()
    var nameB = authorB['name'].toLowerCase();

     if (nameA < nameB) {
         return -1;
     } else if (nameA > nameB) {
         return 1;
     } else {
         return 0;
     }
}

cn_report_template.md

# Information

数据来源: https://steemdata.com/
生成时间: `$print$(UTC)`
时间覆盖: `$start$(UTC)`至`$end$(UTC)`
用户范围: 近7日中文区发帖用户(包含`cn`标签)
数据统计: 文章数`$postsCount$`, 发帖用户数`$authorsCount$`
中间价取值: $`$medianPrice$` / Steem

# 财富榜 (sorted by Estimated Account Value)
Rank.|ID|Value|STEEM|SP|SBD
----|----|----|----|----|----
$topValue$

# 信誉榜 (sorted by Reputation score)
Rank.|ID|Rep|VP(%)|Online|Value
----|----|----|----|----|----
$topRep$

# 说明
* **SP**: Steem Power, 持有越多, 点赞得到回报越丰厚, 对别人文章收益影响越大
* **Value**: Estimated Account Value, 用户账户估值
* **Rep**: Reputation, Steemit 网站上用于衡量用户信誉度的分值
* **VP**: Voting Power, 用户剩余的投票能量,满值是100%,投票越多下降越快,随时间缓慢恢复
* **Online**: Online days, 用户在线天数,从注册时算起

结果

Information

数据来源: https://steemdata.com/
生成时间: 2017-08-30T00:00:00(UTC)
时间覆盖: 2017-08-23T00:00:00(UTC)2017-08-30T00:00:00(UTC)
用户范围: 近7日中文区发帖用户(包含cn标签)
数据统计: 文章数1525, 发帖用户数477
中间价取值: $1.470 / Steem

财富榜 (sorted by Estimated Account Value)

Rank.IDValueSTEEMSPSBD
1@davidding152887.981.31103658.425508.173
2@czechglobalhosts139632.74094976.31617.559
3@nextgen62246068.221316.0630020.9182.861
4@rea43086.56029129.515266.175
5@lawrenceho8425689.53017438.94754.273
6@ace10823363.751147.74812455.8333366.483
7@oflyhigh23066.6618.1315375.857437.503
8@bullionstackers21950.2510.21713861.4031558.971
9@deanliu20345.52199.28612068.5982311.729
10@arcange19033.81012948.1690
11@helene18683.9421.14511977.3891046.098
12@chhaylin17609.60011969.5614.344
13@isaaclab17047.10504.00110780.412459.014
14@lemooljiang12346.6722.9528298.25114.504
15@joythewanderer12055.1607441.1681116.645
16@skt111495.5707681.671203.517
17@rivalhw11392.2706532.0651790.136
18@shieha10773.0107304.60535.24
19@myfirst9460.140.456361.966107.389
20@chinadaily9440.92488.5675805.647188.422

信誉榜 (sorted by Reputation score)

Rank.IDRepVP(%)OnlineValue
1@chinadaily73.5489.713969440.92
2@myfirst72.2331.954039460.14
3@elfkitchen72.2190.984014625.23
4@oflyhigh72.1777.7040123066.66
5@birds9071.7381.653584708.97
6@ace10871.5153.5240923363.75
7@deanliu70.9327.3441220345.52
8@helene70.3448.0039618683.94
9@shieha69.9764.0938310773.01
10@rivalhw69.9468.5039811392.27
11@lemooljiang69.9173.1441012346.67
12@bullionstackers69.6882.1240821950.25
13@rea68.3696.0641243086.56
14@cnfund67.8894.123796427.19
15@arcange67.6356.8541319033.81
16@germanlifestyle67.5896.802674691.74
17@jademont67.4486.014727781.51
18@jubi66.7930.612353552.68
19@justyy66.7563.713745151.56
20@curiesea66.7238.843451568.75

说明

  • SP: Steem Power, 持有越多, 点赞得到回报越丰厚, 对别人文章收益影响越大
  • Value: Estimated Account Value, 用户账户估值
  • Rep: Reputation, Steemit 网站上用于衡量用户信誉度的分值
  • VP: Voting Power, 用户剩余的投票能量,满值是100%,投票越多下降越快,随时间缓慢恢复
  • Online: Online days, 用户在线天数,从注册时算起

排版略有调整, 一级标题变成二级标题
篇幅所限, 结果中去掉了所有用户的首字母排序排名
如果想取得自己的排名(如果你在20开外的话), 可以把代码中的 limit 改大, 如果有问题, 请评论, 我会回复

收工

代码还有蛮多的优化空间, 因为是练手, 没有细扣算法和 IO 问题, 仅讨论实现过程.
如果觉得本文对开启思路有价值, 请不吝关注点赞, 感谢:)

Sort:  

赞。我搞了一个 类似的 @dailystats
代码我能说,没有这么麻烦么?改天我分享一下。我用的是 Python.

欢迎贴出来分享学习。

谢谢分享

这篇收藏了,谢谢分享。

@bullda原来你是技术大牛啊

不敢当, 刚接触 Steem, 练练手, 顺便满足一下好奇心. 数据库里面有蛮多有趣的东西, 你也可以去研究下.

:( 为什么我把你的代码输进去,报错哦

Screen Shot 2017-08-30 at 3.43.09 PM.png

参考一下我上一篇帖子关于 node 的 Run 章节步骤,应该是缺少 npm init 和 npm install steem 等操作。

建议先 hello world 一个 node 程序,然后再把 hello 代码替换成我上面写的代码,应该会更顺利。保持沟通。

Congratulations @bullda! You have completed some achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of upvotes

Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here

If you no longer want to receive notifications, reply to this comment with the word STOP

By upvoting this notification, you can help all Steemit users. Learn how here!

謝謝分享🙏

专家,赞

wow nice post
You also want money, you can contact me, make a site in. If you work with us then you can earn money, you work alone
I do not know when the money was made but if you and we work together, maybe a lot of money can earn in a month.If you have to talk to me then you can contact me on mobile so that you can help people
My WhatsApp Number----9737525233

你这技术与数据帮了不少Steemians. 感恩有你!

谢谢,希望能有点启发作用,抛砖引玉