Distrowatch classification script

sxiii (56)in #utopian-io • 6 years ago (edited)

I always was excited about distrowatch linux distribution database (up to 1000 distros at some points). The only problem was that database didn't seemed to be in a comfortable to view or to work with format.

So I decided to write a crawler that will download the entire database fast - in, let's say, few hours, and then re-formats it to suite my needs. With my own data locally, I would build some infographics, create differently structured distributions database, and do other research stuff.

Starting from that moment, I've wrote a script for Ubuntu, which used just default tools except, probably, html2text converter which helped to make text from html. While I was upgrading the script later in Manjaro, I found that html2text works differently here, and it was a good place to re-write the script to make it more flexible and modern. That's what I've done.

I've pulled my fresh work to github so you could test the code and take part in development.

Current distrowatch database statistics

Number of all distributions in the database: 883
Number of active distributions in the database: 307
Number of dormant distributions: 52
Number of discontinued distributions: 524
Number of distributions on the waiting list: 177
Number of distributions waiting for evaluation: 40
(data from here)

Distrowatch scraper/crawler (spider)

Download whole distrowatch database with information on each distribution to separate files

Why do you need this

You like to survey or find information about distributions
You're writing a diploma or analytical work
You're curious on stastistics
You're studying how to write scripts and/or crawlers/scrapers

Requirements

Works with ubuntu & arch. Recent version is developed on Arch (Manjaro)
html2text
wget
sed
grep
Bash/linux

How to use the script in 6 steps

Install the requirements (arch: sudo pacman -S html2text wget git)
Clone this repository (git clone https://github.com/sxiii/distrowatch-scraper)
Enter the cloned folder (cd distr*)
Make the script executable (chmod +x parse.sh)
Run it (./parse.sh)
Review it's console output or file output (files are created in current date folder!)

How to view the results

They are layed out in $(current.date) directory (if today is 12.12.2012, the directory will be 12.12.2012). Inside this folder you'll find more then 800 files. Most of the files are named ".results" and ".desc". Desc - it's downloaded web pages with full HTML source of distribution description. ".results" is files with sorted results according to the following scheme:

Results scheme

"Based On" - name of the distro, that current was based off,
"Origin" - country of distribution origin,
"Architecture" - distribution architecture,
"Desktops" - desktop that distro officially supports,
"Category" - which are main use-cases for this distribution,
"Status" - is the distribution active, dormant, discounted, on waiting list or evaluting (statuses according to distrowatch)
"Description" - the description itself,
"Website" - official web portal of the distro,
"Latest version" - latest published version of the distro.

There'd also be a linux-clean.list, which is list of all current distribution names.

Note: as it's Linux world, you might port any of distributions from supported platform architecture to unsupported (rewrite, recheck and recompile it), you might compile another desktop environment for it. Distributions statuses might be incorrect because information delay or just a human error. So to be sure, just check all fields and know, that this data "is not a diagnosis".

Future plans

Make the script output data & generate some fancy infographics after downlading database
Support of different output formats
Port the script to support some other distribution websites
(maybe) get rid of html2text?
make it work faster (parallelly?)
make some sort of menu for this script

Bugs or errors

This script has a little difference in handling the html2text because of difference in these programs in ArchLinux and Ubuntu. ArchLinux does create markdown text from HTML, while Ubuntu creates plain text. That's why you might edit the script or take the older (ubuntu) version to use with debian/ubuntu OS. Pastebin older ubuntu version is here (tho it's not so improved): https://pastebin.com/nnuVAJdJ

If you notice any other bugs, please create an issue.

Help and development

You might help to improve this script. Read the "future plan section"
That's a good idea to implement your own ideas and commit them to this repository
Contact me on telegram (fakesnowden) for your ideas and knowledge exchange

Source code of the project

https://github.com/sxiii/distrowatch-scraper/

may the source be with you.

Useful links on the topic

Yours, independent steemit and golos author,

Den Ivanov aka @sxiii from Rostov-on-Don

Posted on Utopian.io - Rewarding Open Source Contributors

#linux #script #opensource #contribution

6 years ago in #utopian-io by sxiii (56)

$62.97

Sort:

Trending

[-]

utopian-io (71) 6 years ago

Hey @sxiii I am @utopian-io. I have just upvoted you!

Achievements

You have less than 500 followers. Just gave you a gift to help you succeed!
Seems like you contribute quite often. AMAZING!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

Vote for my Witness With SteemConnect
Proxy vote to Utopian Witness with SteemConnect
Or vote/proxy on Steemit Witnesses

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

$0.00

1 vote

[-]

vladimir-simovic (67) 6 years ago

Thank you for the contribution. It has been approved.

Please add a license file to the project. That's one of the rules here on Utopian.

You can contact us on Discord.
[utopian-moderator]

$0.00

[-]

sxiii (56) 6 years ago

Greetings @vladimir-simovic
Thank you very much for your approval!
I've added the LICENSE file to the repo. Thanks for noticing.
Have a nice day :)
Den

$0.00

STEEM 0.20

TRX 0.13

JST 0.030

BTC 63793.25

ETH 3410.80

USDT 1.00

SBD 2.59

Distrowatch classification script

Current distrowatch database statistics

Distrowatch scraper/crawler (spider)

Why do you need this

Requirements

How to use the script in 6 steps

How to view the results

Results scheme

Future plans

Bugs or errors

Help and development

Source code of the project

https://github.com/sxiii/distrowatch-scraper/

Useful links on the topic

Hey @sxiii I am @utopian-io. I have just upvoted you!

Achievements

Community-Driven Witness!

Coin Marketplace