Appendix C — Data Wrangling: Data Sources

Datasources that provide CSV or Excel

Government data

Diversity, Inclusion, and anti-Racism focused datasets

General Sources

Reality TV shows

Articles about finding datasources

  • https://blog.google/products/search/discovering-millions-datasets-web/

Recent news articles with datasets that might make for useful projects

  • TXDOT just released their response to the I35 expansion public comments. There is a PDF table with tags and responses that could be extracted and analyzed: https://my35capex.com/wp-content/uploads/2023/08/APPROVED-FEIS-ROD_Appendix-G-Comment-Response-Matrix-from-Public-Hearing-Notice-of-Availability-of-DEIS_2023-08-14.pdf

Other data sources:

  • HTML tables are relatively easy to convert to csv using online tools. ConvertCSV has been useful for students (although be conscious that you have to choose the right table on the page):

  • Wikipedia tables can provide useful data. There are a range of tools that can convert them. Students have used http://wikitable2csv.ggor.de/ and http://import.io. Another option is the wikidata project which provides CSV downloads of InfoBoxes (and perhaps other things).

  • PDF files can provide useful data, especially from tables, but they have to be converted. The conversion tool that I like the best is http://tabula.technology/

  • Websites with data in structured formats other than tables can be extracted through “scraping” but that is out the course scope. Students have had luck with http://import.io to set up scraping.

Data that can be hard to use:

Examples of data that are harder to deal with:

  • Images of tables (requires OCR)

  • Images of plots (very, very hard, I’ve heard of people doing this but I never have).

  • “Record format” (requires a sophisticated scraper). e.g., Dog Breed Personalities

  • Sort of a combo of record format and images. e.g., A menu archive

  • Proprietary formats (requires an importer).e.g, https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/6QWX7Q/X1EKIG&version=2.0

  • Nested headers (and merged cells) (can be dealt with, common in Census data). Data in PDFs (try tabula.technology) https://www.cdc.gov/nchs/data/nvsr/nvsr68/nvsr68_13_tables-508.pdf (Also a PDF, see above).

  • Non-rectangular tabular data: https://www.nps.gov/aboutus/visitation-numbers.htm (bottom table). We can deal with this using Python, although there are some challenges.

  • Sort of a mix of lots of types: https://www.daytranslations.com/blog/popular-video-games-continent/