In my previous post I shared about the analysis of the group stage of world cup competition prediction hosted by @blocktrades. In this analysis, will be the follow-up competition for Round of 16 predictions.
The analysis extraction program is hosted on github which is a forked version of the previous analysis. https://github.com/superoo7/worldcup-2
- Data Extraction
- Data Visualization
- Possible errors
- Tools and Scripts
- Relevant Links and Resources
The data is extracted from SteemSQL on 3/7/2018, where the timeframe for the analysis is between 2018-06-28 20:26:03 (GMT +0) and 2018-06-30 13:59:00 (GMT +0). The first timestamp is when the post is made, whereas the second timestamp is the deadline given on the post.
There are a total of 1351 posts which fulfill the previous timeframe, but the finalize data is 1226 posts only. This means that there are 9.25% of invalid entries from all the participants which could be due to post without following the format, violation of participation rule, post that used both #blocktradesworldcup and #mypicks but not meant for entry of the competition.
Here are some rules that are considered in this analysis:
- Last edit and created should be before 2018-06-30 13:59:00 (GMT +0)
- Use of both tags #blocktradesworldcup and #mypicks
- Only submission in English for the choices made (e.g. Spain instead of Espanyol)
- Follow the markdown table templates given
- Reputation > 35
The data extraction process is as followed:
- Check reputation of the authors.
- Extract out data by using string replacement and regex
- Check whether an author make multiple posts
- Stored data in both JSON and CSV format
There are 1351 posts extracted from SteemSQL
Check reputation of the authors
In this section of filtering, there are 63 authors that violated the rule of a minimum reputation of 35.
Extract out data from the post body
In this section, there are 54 posts that violated this section. This could be caused by the post used both tags but not intend to participate in the competition, invalid table format, post entry that does not use English and etc.
Check whether an author make multiple posts
In this section of filtering, there are 4 authors, a total of 12 posts violated the rule of submitting duplicate entries.
After the filter, there are 1226 posts left, with a 9.25% of violated post rate from all the post extracted.
By using Tableau, a tool for doing data visualization, I am able to share a few plotted out data.
This is the choices made by the participants on the goal score between both teams at 90 minutes full-time mark.
Win, Lose or Tie based on choices made
The green color indicates the LHS as the winner, the blue color indicates a draw, red color indicates LHS as the loser and RHS as the winner, and orange indicates invalid data.
Based on the data previously made, I further investigate on which team are more favorable in terms of winning instead of goal scores.
Final Score result
Since the Round of 16 is over, these are the result of goal scores and condition.
|France vs Argentina||4:3||w|
|Uruguay vs Portugal||2:1||w|
|Spain vs Russia||1:1||t|
|Croatia vs Denmark||1:1||t|
|Brazil vs Mexico||2:0||w|
|Belgium vs Japan||3:2||w|
|Sweden vs Switzerland||1:0||w|
|Colombia vs England||1:1||t|
- The script also only accept the country being sorted in the order given in the template.
- The script is able to extract only markdown table and most of the HTML table. Without using the markdown table given, the script is not able to extract out the data.
I am glad that I was again being assigned to carry out the task of making analysis for this competition.
If you interested in joining, there is another round going on by @worldcup-russia: https://steemit.com/blocktradesworldcup/@worldcup-russia/quarter-finals-stage-submit-your-entry-to-keep-competing-for-the-2-000-sbd-prize-pool
Tools and Scripts
- SteemSQL (MSSQL) - Extracting data of all the posts
- TypeScript - To run data extraction with Regex and string replacement
- Jest - To test individual functions created for data extraction
- Tableau - Data visualize tools
Relevant Links and Resources
- SQL Query
SELECT 'https://steemit.com' + Comments.url AS link, Comments.author, Comments.title, Comments.created, Comments.last_update, Comments.body, Accounts.reputation FROM Comments LEFT JOIN Accounts ON Comments.author = Accounts.name WHERE depth = 0 AND Comments.created > '2018-06-28 20:26:03' AND Comments.created < '2018-06-30 13:59:00' AND Comments.last_update < '2018-06-30 13:59:00' AND CONTAINS(Comments.json_metadata, 'blocktradesworldcup') AND CONTAINS(Comments.json_metadata, 'mypicks')