Data Wrangling: SQL, Excel, and beyond

INF 385T.9 - Data Wrangling

Author

James Howison

1 Details

  • Semester: Fall 2023
  • Unique Number:
  • Class Times: Monday 1:30-2:45 and Tuesday 2-3:15
  • Classroom: UTA 1.208 (Large classroom UTA first floor)
  • Office Hours: Book via Calendly booking link

Quick link to Schedule of Classes (with linked materials): Chapter 9

2 Introduction

This course, “Data Wrangling” will enable you to:

  • design database schemas
  • implement those designs using SQL
  • learn how to learn new technologies
  • learn the basics of programming in Python
  • import and export data to/from CSV and Excel, changing schemas as needed
  • conduct basic analyses in Excel
  • prepare a project workflow (or “pipeline”) that imports data from different sources, relates them to each other, and produce a report.

There are no prerequisites for this course; it is appropriate even if you’ve never done any programming or behind the scenes work with computers. I’ll address each topic “from scratch.”

However, if you do have significant, recent, experience, or training with programming or databases this is not the course for you; you will find it too introductory. In this case I require you to drop the course and welcome you to audit parts that you think will extend your knowledge.

Students who have taken (or are taking) Database Management in the Information School may enroll but should reach out to the professor to discuss.

3 Professor

This course is taught by James Howison.

4 Land Acknowledgement

I would like to acknowledge that we will be meeting on Indigenous land. Moreover, I would like to acknowledge and pay our respects to the Carrizo & Comecrudo, Coahuiltecan, Caddo, Tonkawa, Comanche, Lipan Apache, Alabama-Coushatta, Kickapoo, Tigua Pueblo, and all the American Indian and Indigenous Peoples and communities who have been or have become a part of these lands and territories in Texas, here on Turtle Island.

5 Computing resources

The software used in this class will be installed on your laptop. If you do not have a laptop, or yours stops working, the school and unversity has resources available. Please check these university resources. Check the “Before your classes” section; I believe that you reach out to the Texas One Stop).

6 Course Texts

There are no required texts for the course, but you will find these resources to be useful.

Recommended Texts:

We’ll be covering everything with materials in the course, but there are lots of useful online courses that complement this material. As a member of this class you will have free access to the DataCamp site, I believe that access extends for 6 months. I will establish the access a few weeks into the semester, causing an invitation email to come to the email address registered with the University. The most relevant courses are:

7 Assessments

7.1 Weekly Assignments

The bulk of your course grade (75%) comes from Weekly Assignments. There are assignments throughout each week for this course, covering the material addressed that week. The weekly assignments are due 11:59 pm on Sunday.

Late assignments will receive a grade of zero but you can drop your 3 lowest grades. It’s always worth turning in the assignment, even if late, because the assignments test and drive your learning and your performance helps guide me on material. So not completing an assignment is a sure way to fall behind. Students have used their drops in the past and then been very sad when an actual emergency meant they could not complete their homework.

The assignments, and grading rubric, will be released on Canvas before each Tuesday class, so we’ll go over the assignment and ensure everyone knows what’s required. Each assignment will be turned in through Canvas, usually by uploading a PDF, text, or notebook HTML export file.

If you’ve uploaded a PDF there will be comments left on the PDF, in addition to any text comments in Canvas. You can see the comments on the PDF via by using the “viewing feedback” button.

7.2 Project

A portion of the course (25%) is a project to produce a data wrangling workflow (sometimes also called a “data pipeline”). The project will be done in pairs. We will have an initial individual assignment to generate ideas, then opportunities to review each other’s ideas and form pairs.

Working in pairs for this course does not mean dividing up the work; I require each group member to understand everything about their project. If there is a part of your project that you don’t understand or couldn’t work with then you are missing a crucial learning opportunity. We work in pairs to work together (supporting each other’s learning) not to reduce the amount or diversity of the work that we do.

When I use the word “workflow” (or “pipeline”) I mean a process that moves from data you obtain through reading, cleaning, validating, analyzing, and providing results. The workflow includes the data, the scripts that read and query the data, and the final output. We will discuss example projects in class early in the semester.

The project builds up through the semester (e.g., after we’ve learned Database Design you will do a design for the database in your workflow), culminating in a full workflow that you demonstrate and describe through the report.

There are more details on the specific Assignment page on Canvas, but note that each project needs at least three datasets from at least two different organizations (i.e., 3 datasets in total, per idea, no more than 2 of those can come from the same organization). In addition your project must handle synonyms of some form (e.g., modeling a single US state that can be referred to as Texas, TX, State of Texas, Tejas, etc.). This page outlines advice on finding interesting data sources including tools to convert formats.

Example projects from previous years (note that whimsy is encouraged, funny projects keep the focus on wrangling, serious projects keep the focus on the topic. I think you want people to whom you describe this project not to quibble about your analysis, but to appreciate your data wrangling.):

  • Analysis of Major League Soccer Report
  • The correlation between population, college education, and salary by state in 2021 Report
  • State education, teachers salaries, public schools.
  • Visualizing impact of weather on border crossings.
  • How have “Game of Thrones” character names influenced baby names in the US?
  • Lightening strikes and baseball games
  • Average age of Oscar winners by gender
  • How do solar flares affects temperatures on Earth and on Mars

8 iSchool Open Day

Although not required for this class, you may want to present your workflow as a Student Project at the iSchool’s Open Day (typically in May). The Open House, while including Capstones, is also an opportunity to present student projects, including projects from this course (but also from other courses or semesters!).

9 Schedule of classes

The schedule of classes is shown below.

In addition there are tailored HTML pages for each week, linked from the table below.

Recordings of the sessions will be available (using the Panopto tab in Canvas).

Day Topic Weekly Assign Project Assign
Introduction
Mon Aug 21 Project Introduction
Tue Aug 22 Card Exercise Project ideas
Data Modeling
Mon Aug 28 Data Modeling - has_many
Tue Aug 29 Data Modeling - many-to-many ER Diagram of book store
Mon Sep 04 Labor Day - No Class
Tue Sep 05 Data Modeling - has_many through From ER to Database Project Description
SQL Queries
Mon Sep 11 SQL SELECT, operators
Tue Sep 12 SQL ORDER BY, LIMIT Queries 1 and Reverse Engineer Design
Mon Sep 18 SQL Join, Single table
Tue Sep 19 SQL Join, many tables Queries 2 Project Database And Workflow Design
Mon Sep 25 SQL Aggregate functions
Tue Sep 26 SQL Grouping Queries 3
Mon Oct 02 SQL Inner and Outer JOINs
Tue Oct 03 SQL INSERT/UPDATE/DELETE
Excel Queries
Mon Oct 09 Excel - Queries - Sorts, Filters, joins
Tue Oct 10 Excel - Pivot Tables and Graphs Excel Queries
Python
Mon Oct 16 Python - Hello World, Strings and Variables Python - Finegan replacement
Tue Oct 17 Python - if/else/while Guessing game
Mon Oct 23 Python - Lists and dicts "Solar Systems"
Tue Oct 24 INSERT logs INSERT log assignment
Mon Oct 30 Python - CSV Reading/Writing Reading and writing CSV
Tue Oct 31 INSERT log for Project
Mon Nov 06 Python - Python and SQL
Tue Nov 07 Python - Parameterized Queries SQL queries from Python
Mon Nov 13 Python - CSV to SQL via Python
Tue Nov 14 Python - Data transforms/Dates, Regex and splits Project Prototype Due
Mon Nov 20 Thanksgiving - No Classes
Tue Nov 21 Thanksgiving - No Classes
Wrapup
Mon Nov 27 Project Workshop meetings
Tue Nov 28 Project Workshop meetings
Mon Dec 04 Project Workshop meetings Project due
NA Presentations; What's Next?
NA What's Next? Databases and Python

10 Policies

10.1 Academic Integrity

Each student in the course is expected to abide by the University of Texas Honor Code: “As a student of The University of Texas at Austin, I shall abide by the core values of the University and uphold academic integrity.” Plagiarism is taken very seriously at UT. Therefore, if you use words or ideas that are not your own (or that you have used in previous class), you must cite your sources and use quote marks appropriately. Otherwise you will be guilty of plagiarism and subject to academic disciplinary action, including failure of the course. In particular, students are reminded that proper citation requires mentioning sources when you use them, not just in a general list of references at the end of a document. You are responsible for understanding UT’s Academic Honesty and the University Honor Code. If this is at all confusing, please take this Plagarism Tutorial.

In particular, any time you use the copy function from someone else’s writing (e.g., an article, blog post) you must have a plan about how you will use those words, how you will use quote marks (“), and how you will cite the work.

10.1.1 Collaboration policy

The weekly assignments are individual work. However, as long as you meet the condition below, I give you explicit permission to work together with other classmates on the assignments or on your projects. With the same condition, you are also welcome to seek input from people outside the class, such as friends and family.

The one condition is that you add a note to your homework (ideally through a comment in the Canvas submission) indicating how the work was done and identifying with whom you worked and how (thus ensuring that we are following the Academic Integrity policy above). For example you might say “Daria and I worked on this in the lab together, when we started out we were confused about X but I figured it out and shared that with Daria. Our code is very similar because we worked together”. Or perhaps “I was confused about how to pad a string with spaces, and after working at it for 30 minutes I chatted about it with my partner who suggested the xyz method. I was pleased when I got that working myself.” When you have worked together your code will have similarities, but you must not turn in identical code; rather you should take code you’ve worked on together and personalize it through comments that explain what is happening in the code. The comments must be your own, individual, work.

Neither “working together” nor “seeking input” means having others do the work for you; you should always be certain that you are learning and that you understand the code that you have submitted.

If you have questions on this policy please ask in the Assignment Discussion forum on Canvas and I will answer there. I have this policy because learning to program is both individual hard work and learning how to get help from others. Sometimes chatting through with another class member is just what is needed.

10.2 Working with AI

The use of artificial intelligence tools in this class is permitted for students who wish to use them, provided you:

  1. Provide proper attribution
  2. Provide a memo (with screenshots) of the sequence of prompts and responses together with written commentary of how you assessed the quality of the output, and how and why you developed your prompts to achieve your learning aims.

10.3 Sharing of Course Assignment Materials is Prohibited

No assignment materials used in this class, including, but not limited to quizzes, exams, papers, projects, homework assignments, review sheets, and additional problem sets, may be shared online or with anyone outside of the class unless you have my explicit, written permission. Unauthorized sharing of materials promotes cheating. It is a violation of the University’s Student Honor Code and an act of academic dishonesty. I am well aware of the sites used for sharing materials, and any materials found online that are associated with you, or any suspected unauthorized sharing of materials, will be reported to Student Conduct and Academic Integrity in the Office of the Dean of Students. These reports can result in sanctions, including failure in the course.

10.4 Class Recordings

Class recordings are reserved only for students in this class for educational purposes and are protected under FERPA. The recordings should not be shared outside the class in any form. Violation of this restriction by a student could lead to Student Misconduct proceedings.

10.5 COVID Caveats

To help keep everyone at UT and in our community safe, it is critical that students report COVID-19 symptoms and testing, regardless of test results, to University Health Services, and faculty and staff report to the HealthPoint Occupational Health Program (OHP) as soon as possible. Please see this link to understand what needs to be reported. In addition, to help understand what to do if a fellow student in the class (or the instructor or TA) tests positive for COVID, see this University Health Services link.

10.6 Student rights and responsibilities

  • You have a right to a learning environment that supports mental and physical wellness.
  • You have a right to respect.
  • You have a right to be assessed and graded fairly.
  • You have a right to freedom of opinion and expression.
  • You have a right to privacy and confidentiality.
  • You have a right to meaningful and equal participation, and to self-organize groups to improve your learning environment.
  • You have a right to learn in an environment that is welcoming to all people. No student shall be isolated, excluded or diminished in any way.

With these rights come responsibilities:

  • You are responsible for taking care of yourself, managing your time, and communicating with the teaching team and with others if things start to feel out of control or overwhelming.
  • You are responsible for acting in a way that is worthy of respect and always respectful of others.
  • Your experience with this course is directly related to the quality of the energy that you bring to it, and your energy shapes the quality of your peers’ experiences.
  • You are responsible for creating an inclusive environment and for speaking up when someone is excluded. In particular, you are responsible for ensuring that your participation does not exclude the participation of others. Office hours are available for in-depth further discussion of advanced topics or other interests that pursuing in depth during class would exclude others.
  • You are responsible for holding yourself accountable to these standards, holding each other to these standards, and holding the teaching team accountable as well.

10.7 Personal Pronoun Preference

Professional courtesy and sensitivity are especially important with respect to individuals and topics dealing with differences of race, culture, religion, politics, sexual orientation, gender, gender variance, and nationalities. Class rosters are provided to the instructor with the student’s legal name. I will gladly honor your request to address you by an alternate name or gender pronoun. Please advise me of this preference early in the semester so that I may make appropriate changes to my records. I usually add “he/his” to my Zoom name for online meetings and encourage others to do the same.

10.8 Accommodations

The university is committed to creating an accessible and inclusive learning environment consistent with university policy and federal and state law. Please let me know if you experience any barriers to learning so I can work with you to ensure you have equal opportunity to participate fully in this course. If you are a student with a disability, or think you may have a disability, and need accommodations please contact Disability and Access (D&A). Please refer to D&A’s website for contact and more information: http://diversity.utexas.edu/disability/. If you are already registered with D&A , please deliver your Accommodation Letter to me as early as possible in the semester so we can discuss your approved accommodations and needs in this course.

10.9 Drop Policy

If you want to drop a class after the 12th class day, you’ll need to execute a Q drop before the Q-drop deadline, which typically occurs near the middle of the semester. Under Texas law, you are only allowed six Q drops while you are in college at any public Texas institution. For more information, see: http://www.utexas.edu/ugs/csacc/academic/adddrop/qdrop

International students must meet with the international office before dropping a class that would put them below full-time status.

10.10 University Resources for Students

Your success in this class is important to me. We will all need accommodations because we all learn differently. If there are aspects of this course that prevent you from learning or exclude you, please let me know as soon as possible. Together we’ll develop strategies to meet both your needs and the requirements of the course. There are also a range of resources on campus, detailed below.

10.10.1 Services for Students with Disabilities

This class respects and welcomes students of all backgrounds, identities, and abilities. If there are circumstances that make our learning environment and activities difficult, if you have medical information that you need to share with me, or if you need specific arrangements in case the building needs to be evacuated, please let me know.

I am committed to creating an effective learning environment for all students, but I can only do so if you discuss your needs with me as early as possible. Requests for accomodations are quite normal and quite frequent and I promise to maintain the confidentiality of these discussions. If appropriate, also contact Services for Students with Disabilities.

10.10.2 Counseling and Mental Health Center

All of us benefit from support during times of struggle. You are not alone. There are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.

If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. http://www.cmhc.utexas.edu/individualcounseling.html

10.10.3 The Sanger Learning Center

All students, including graduate students, are welcome to take advantage of Sanger Center’s classes and workshops, private learning specialist appointments, peer academic coaching, and tutoring for more than 70 courses in 15 different subject areas. For more information, please visit https://ugs.utexas.edu/slc/grad or call 512-471-3614 (JES A332).

10.10.4 University Writing Center free programs for grad students

10.10.5 Libraries

10.10.6 IT services

10.10.7 Student Emergency Services

10.10.8 Important Safety Information

If you have concerns about the safety or behavior of fellow students, TAs or Professors, call BCAL (the Behavior Concerns Advice Line): 512-232-5050. Your call can be anonymous. If something doesn’t feel right – it probably isn’t. Trust your instincts and share your concerns.