Data sets of any type and links on where to find them.
Here I wanted to collect and share some links on where to find data of about any type .
Types of data such as spatial data, international statistical data, aggregated mobile activity data, georeferenced social network data,weather data, polution data, even electricty consumption data.
1-spatial data
Here can be found spatial data in shape file format .shp Is the best free service I know so far : Download data by country (http://www.diva-gis.org/gdata)
Another data source providing spatial data is also Urban Atlas - European Environment Agency (http://www.eea.europa.eu/data-and-maps/data/urban-atlas#tab-methodology)
It provides data on land use and land cover data for Large Urban Zones with more than 100.000 inhabitants. The GIS data can be downloaded together with a map for each urban area covered and a report with the metadata.
Another source of spatial data is :
https://explorer.earthengine.google.com/#index
of google earth.
2-Mobility data as GPS traces
GeoLife: Building social networks using human location history (http://research.microsoft.com/en-us/projects/geolife/default.aspx)
**3- international statistics about countries **and their demographics, GDP, GINI index, facts etc
IndexMundi - Country Facts (http://www.indexmundi.com)
Data | The World Bank (http://data.worldbank.org)
The World Factbook (https://www.cia.gov/library/publications/the-world-factbook/)
http://www.palgraveconnect.com/pc/archives/ihs.html (http://www.palgraveconnect.com/pc/archives/ihs.html)
3.1 - for american data there is PEW Research : Download Datasets (http://www.pewresearch.org/data/download-datasets/)
3.2 - for italian data there is Istat : Istat.it (http://www.istat.it/it/)
4-A singular dataset is also GDELT : Global Dataset of Events Language and Tone. It collects events in a global scale and it has a spatio-temporal and semantic dimension.
The GDELT Project (http://www.gdeltproject.org/#downloading)
5-Geo-referenced tweets .This dataset brings data from Twitter as tweets over the Milan city, Italy.
Open Data Institute - node Trento (http://theodi.fbk.eu/openbigdata/#portfolioModal11)
6-Aggregated mobile phone data: Open Data Institute - node Trento (http://theodi.fbk.eu/openbigdata/#portfolioModal2)
5.1- Such kind of data can be find in the data challenges D4D (Data 4 Development) organized by Orange telecom company released as in data for development of Senegal : data for development (http://www.d4d.orange.com/en/Accueil) and Ivory Coast: presentation (http://www.d4d.orange.com/en/presentation)
These challenges are now over but the company repeats the challenge almost every year so stay tuned.
7- Weather data as temerature measurements : Open Data Institute - node Trento (http://theodi.fbk.eu/openbigdata/#portfolioModal16) and precipitation data Open Data Institute - node Trento (http://theodi.fbk.eu/openbigdata/#portfolioModal17)
It contains measurements about temperature, precipitation and wind speed/direction taken in 36 Weather Stations and 15 minutes time interval
8- Electricity usage data : Open Data Institute - node Trento (http://theodi.fbk.eu/openbigdata/#portfolioModal18)
The dataset supplies information regarding the current flowing through the distribution lines and details about how the distribution lines are spread over the Trentino territory.
9- Geo- referenced News data i.e. news with location
Open Data Institute - node Trento (http://theodi.fbk.eu/openbigdata/#portfolioModal19)
10- Air quality Open Data Institute - node Trento (http://theodi.fbk.eu/openbigdata/#portfolioModal7) .
The type and the intensity of the pollution are continuously measured by different sensors located within the city limits. Each sensor has a unique ID, a type and a location. Different sensors can share the same location
11-Datset for testing a recommendation system: MovieLens (http://grouplens.org/datasets/movielens/) it has
100,000 ratings from 1000 users on 1700 movies.
**12- Some links on data sets containing images video and text: **
-12.1 ) 2,866 image-text pairs from Wikipedia):
http://www.svcl.ucsd.edu/project... (http://www.svcl.ucsd.edu/projects/crossmodal/)
-12.2 ) One million images with captions : http://vision.cs.stonybrook.edu/... (http://vision.cs.stonybrook.edu/~vicente/sbucaptions/)
or there is also Introducing the Open Images Dataset (https://research.googleblog.com/2016/09/introducing-open-images-dataset.html) you can find a dataset of 9 million images already labeled accross 6000 categories by Google Research.
-12.3 ) Flickr30K dataset: 31,783 images, 5 captions per image.
https://illinois.edu/fb/sec/229675 (https://illinois.edu/fb/sec/229675)
-12.4 ) Microsoft Research Video Description Corpus (http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/)
In the same domain but by Google Research there is also the video dataset A Large and Diverse Labeled Video Dataset for Video Understanding Research (https://research.google.com/youtube8m/index.html) of 8 million videos classified in different categories.
-12.5) The Multimodal Dyadic Behavior Dataset (http://cbi.gatech.edu/mmdb/overview.php)
-12.6) NASA satellite images of earth since 1999 : http://asterweb.jpl.nasa.gov/gallerymap.asp (http://asterweb.jpl.nasa.gov/gallerymap.asp)
12.7 Insight - BBC Datasets (http://mlg.ucd.ie/datasets/bbc.html) Two news article datasets, originating from BBC News.
There is also the AWS Public Data Set (https://aws.amazon.com/it/datasets/google-books-ngrams/) n-gram dataset that can help you determine when a new word started to be used widely.
Last but not least Data from Google Trends (http://googletrends.github.io/data/) can help you find new trends in online searching.
For more on datasets by Google you can check also Google Public Data Explorer (https://www.google.com/publicdata/directory)
which is updated frequently.
13- Urban big data:
-13.1 ) Chicago city : City of Chicago | Data Portal (https://data.cityofchicago.org/)
-13.2 ) New York city : NYC Open Data (https://data.cityofnewyork.us/)
-13.3) Different cities in China Forecasting Fine-Grained Air Quality Based on Big Data (http://research.microsoft.com/apps/pubs/?id=246398)
-13.4) Quanturb : Data (http://www.quanturb.com/data.html)
-13.5) Open data of any kind : 30 Places to Find Open Data on the Web - ScribbleLive - Your Content Marketing Software (http://www.scribblelive.com/blog/2012/03/30/data-sources/)
14- Panama Papers dataset: ICIJ Offshore Leaks Database (https://offshoreleaks.icij.org/about/download)
15- Fitness dataset from a user’s Fitbit device. The data are collected in one year from May 2015 to May 2016.
There is included different kind of activity such as walking, intense training and sleeping.
Walking and training activity: OneYearFitBitData.csv (https://drive.google.com/open?id=0Bx4yoK5aogTSbGJ2WlkwYjlHejQ)
Sleep activity: OneYearFitBitDataSleep.csv (https://drive.google.com/open?id=0Bx4yoK5aogTSMUFqRjVNcko5WlU)
16- **Quora dataset: **First Quora Dataset Release: Question Pairs by Kornél Csernai on Data @ Quora (https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs)
It is the first dataset of a series of public dataset releases from the Data @ Quora (https://data.quora.com/) . This one in particular is related to the problem of identifying duplicate questions.
17-….
—————————————————————————————————
There might and will be many other sources around. I will update this post every time I find a new data set that is worth of mention.
Last Updated 15 February 2017
Have a pleasant data journey :-)