Distrowatch classification script

in #utopian-io6 years ago (edited)

I always was excited about distrowatch linux distribution database (up to 1000 distros at some points). The only problem was that database didn't seemed to be in a comfortable to view or to work with format.

So I decided to write a crawler that will download the entire database fast - in, let's say, few hours, and then re-formats it to suite my needs. With my own data locally, I would build some infographics, create differently structured distributions database, and do other research stuff.

Starting from that moment, I've wrote a script for Ubuntu, which used just default tools except, probably, html2text converter which helped to make text from html. While I was upgrading the script later in Manjaro, I found that html2text works differently here, and it was a good place to re-write the script to make it more flexible and modern. That's what I've done.

I've pulled my fresh work to github so you could test the code and take part in development.

Current distrowatch database statistics

  • Number of all distributions in the database: 883
  • Number of active distributions in the database: 307
  • Number of dormant distributions: 52
  • Number of discontinued distributions: 524
  • Number of distributions on the waiting list: 177
  • Number of distributions waiting for evaluation: 40
    (data from here)

Distrowatch scraper/crawler (spider)

Download whole distrowatch database with information on each distribution to separate files

Img

Why do you need this

  • You like to survey or find information about distributions
  • You're writing a diploma or analytical work
  • You're curious on stastistics
  • You're studying how to write scripts and/or crawlers/scrapers

Requirements

  • Works with ubuntu & arch. Recent version is developed on Arch (Manjaro)
  • html2text
  • wget
  • sed
  • grep
  • Bash/linux

How to use the script in 6 steps

  1. Install the requirements (arch: sudo pacman -S html2text wget git)
  2. Clone this repository (git clone https://github.com/sxiii/distrowatch-scraper)
  3. Enter the cloned folder (cd distr*)
  4. Make the script executable (chmod +x parse.sh)
  5. Run it (./parse.sh)
  6. Review it's console output or file output (files are created in current date folder!)

How to view the results

They are layed out in $(current.date) directory (if today is 12.12.2012, the directory will be 12.12.2012). Inside this folder you'll find more then 800 files. Most of the files are named ".results" and ".desc". Desc - it's downloaded web pages with full HTML source of distribution description. ".results" is files with sorted results according to the following scheme:

Results scheme

  • "Based On" - name of the distro, that current was based off,
  • "Origin" - country of distribution origin,
  • "Architecture" - distribution architecture,
  • "Desktops" - desktop that distro officially supports,
  • "Category" - which are main use-cases for this distribution,
  • "Status" - is the distribution active, dormant, discounted, on waiting list or evaluting (statuses according to distrowatch)
  • "Description" - the description itself,
  • "Website" - official web portal of the distro,
  • "Latest version" - latest published version of the distro.

There'd also be a linux-clean.list, which is list of all current distribution names.

Note: as it's Linux world, you might port any of distributions from supported platform architecture to unsupported (rewrite, recheck and recompile it), you might compile another desktop environment for it. Distributions statuses might be incorrect because information delay or just a human error. So to be sure, just check all fields and know, that this data "is not a diagnosis".

Future plans

  • Make the script output data & generate some fancy infographics after downlading database
  • Support of different output formats
  • Port the script to support some other distribution websites
  • (maybe) get rid of html2text?
  • make it work faster (parallelly?)
  • make some sort of menu for this script

Bugs or errors

This script has a little difference in handling the html2text because of difference in these programs in ArchLinux and Ubuntu. ArchLinux does create markdown text from HTML, while Ubuntu creates plain text. That's why you might edit the script or take the older (ubuntu) version to use with debian/ubuntu OS. Pastebin older ubuntu version is here (tho it's not so improved): https://pastebin.com/nnuVAJdJ

If you notice any other bugs, please create an issue.

Help and development

  • You might help to improve this script. Read the "future plan section"
  • That's a good idea to implement your own ideas and commit them to this repository
  • Contact me on telegram (fakesnowden) for your ideas and knowledge exchange

Source code of the project

https://github.com/sxiii/distrowatch-scraper/

may the source be with you.

Useful links on the topic

Yours, independent steemit and golos author,

Den Ivanov aka @sxiii from Rostov-on-Don



Posted on Utopian.io - Rewarding Open Source Contributors

Sort:  

Hey @sxiii I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • Seems like you contribute quite often. AMAZING!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

Thank you for the contribution. It has been approved.

Please add a license file to the project. That's one of the rules here on Utopian.

You can contact us on Discord.
[utopian-moderator]

Greetings @vladimir-simovic
Thank you very much for your approval!
I've added the LICENSE file to the repo. Thanks for noticing.
Have a nice day :)
Den

Coin Marketplace

STEEM 0.20
TRX 0.13
JST 0.030
BTC 63793.25
ETH 3410.80
USDT 1.00
SBD 2.59