Data Wrangling

INF 385T.9 - Data Wrangling

Author

James Howison

Details

  • Semester: Fall 2024
  • Unique Number: 27890
  • Class Times: Monday 1:30-2:45 and Tuesday 2-3:15
  • Classroom: UTA 1.208 (Large classroom UTA first floor)
  • Office Hours: Book via Calendly booking link

Quick link to Schedule of Classes (with linked materials): Schedule of classes

Previous versions of this course, and materials, are available: Fall 2023

Introduction

This course, “Data Wrangling” will enable you to:

  • understand data schemas (the way that data is organized within and across files)
  • learn a set of verbs used to reorganize data
  • Implement those verbs to transform data in four contexts:
    • R tidyverse
    • SQL
    • Python Pandas
    • Excel
  • prepare a project workflow (or “pipeline”) that imports data from different sources, relates them to each other, and produce a report.

Pre-requisites

There are no prerequisites for this course; it is appropriate even if you’ve never done any programming or behind the scenes work with computers. I’ll address each topic “from scratch.”

However, if you do have significant, recent, experience or training with the technologies we cover in the course you will find it too introductory. In this case I require you to drop the course and welcome you to audit parts that you think will extend your knowledge.

In Fall 2024 Data Wrangling was restructured to be complementary to the Information School courses Database Management and Introduction to Programming. Students will find these three courses can be taken in any order, including one or more taken simultaneously. Data Wrangling focuses on dataframe work (as found in Pandas) and declarative programming in chains/pipelines/workflows, whereas Introduction to Programming focuses on procedural and object oriented python. While Database Management focuses on multi-table SQL databases and modeling data to 3rd normal form and star schemas, Data Wrangling focuses on single table data organization in the tidydata tradition.

Professor

This course is taught by James Howison.

Land Acknowledgement

I would like to acknowledge that we will be meeting on Indigenous land. Moreover, I would like to acknowledge and pay our respects to the Carrizo & Comecrudo, Coahuiltecan, Caddo, Tonkawa, Comanche, Lipan Apache, Alabama-Coushatta, Kickapoo, Tigua Pueblo, and all the American Indian and Indigenous Peoples and communities who have been or have become a part of these lands and territories in Texas, here on Turtle Island.

SB17 statement

Texas Senate Bill 17, the 2024 law that outlaws diversity, equity, and inclusion programs at public colleges and universities in Texas, does not in any way affect content, instruction or discussion in a course at public colleges and universities in Texas. Expectations and academic freedom for teaching and class discussion have not been altered post-SB 17, and students should not feel the need to censor their speech pertaining to topics including race and racism, structural inequality, LGBTQ+ issues, or diversity, equity, and inclusion.

Computing resources

The software used in this class will be installed on your laptop. If you do not have a laptop, or yours stops working, the school and unversity has resources available. Please check these university resources. Check the “Before your classes” section; I believe that you reach out to the Texas One Stop).

Course Texts

There are no required texts for the course.

We’ll be covering everything with materials on this website, but there are lots of useful online courses that complement this material. As a member of this class you will have free access to the DataCamp site, I believe that access extends for 6 months. I will establish the access a few weeks into the semester, causing an invitation email to come to the email address registered with the University.

Communities for Data Careers

There are lots of professional communities of practice that those seeking careers around data should be participating in. A course page highlights those communites for data careers that I follow.

Assessments

Weekly Assignments (60%)

The bulk of your course grade (60%) comes from Weekly Assignments. There are assignments throughout each week for this course, covering the material addressed that week. The weekly assignments are due 11:59 pm on Sunday.

Late assignments will receive a grade of zero but you can drop your 3 lowest grades. It’s always worth turning in the assignment, even if late, because the assignments test and drive your learning and your performance helps guide me on material. So not completing an assignment is a sure way to fall behind. Students have used their drops in the past and then been very sad when an actual emergency meant they could not complete their homework.

The assignments, and grading rubric, will be released on Canvas before each Tuesday class, so we’ll go over the assignment and ensure everyone knows what’s required. Each assignment will be turned in through Canvas, usually by uploading a PDF, text, or notebook HTML export file.

If you’ve uploaded a PDF there will be comments left on the PDF, in addition to any text comments in Canvas. You can see the comments on the PDF via by using the “viewing feedback” button.

Project (30%)

A portion of the course (30%) is a project to produce a data wrangling workflow (sometimes also called a “data pipeline”). The project will be done in pairs. We will have an initial individual assignment to generate ideas, then opportunities to review each other’s ideas and form pairs.

Working in pairs for this course does not mean dividing up the work; I require each group member to understand everything about their project. If there is a part of your project that you don’t understand or couldn’t work with then you are missing a crucial learning opportunity. We work in pairs to work together (supporting each other’s learning) not to reduce the amount or diversity of the work that we do. Individual understanding will be checked through an individual graded post-project interveiew (see below).

When I use the word “workflow” (or “pipeline”) I mean a process that moves from data you obtain through reading, cleaning, validating, analyzing, and providing results. The workflow includes the data, the scripts that read and query the data, and the final output. Each project will be implemented in all four technical approaches (R tidyverse, Python Pandas, SQL, and Excel). We will discuss example projects in class early in the semester.

The project builds up through the semester culminating in a full workflow that you demonstrate and describe through the report.

There are more details on the specific Assignment page on Canvas, but note that each project needs at least three datasets from at least two different organizations (i.e., 3 datasets in total, per idea, no more than 2 of those can come from the same organization). In addition your project must handle synonyms of some form (e.g., modeling a single US state that can be referred to as Texas, TX, State of Texas, Tejas, etc.). This page outlines advice on finding interesting data sources including tools to convert formats.

Example projects from previous years (note that whimsy is encouraged, funny projects keep the focus on wrangling, serious projects keep the focus on the topic. I think you want people to whom you describe this project not to quibble about your analysis, but to appreciate your data wrangling.) In Fall 2024 we changed approaches to the data transforms, so these are primarily useful for their data sources and analyses, not the code used to process the data.

  • Analysis of Major League Soccer Report
  • The correlation between population, college education, and salary by state in 2021 Report
  • Rats and Broadway Report
  • State education, teachers salaries, public schools.
  • Visualizing impact of weather on border crossings.
  • How have “Game of Thrones” character names influenced baby names in the US?
  • Lightening strikes and baseball games
  • Average age of Oscar winners by gender
  • How do solar flares affects temperatures on Earth and on Mars

Individual graded interview (10%)

To ensure individual understanding of the project and class materials, there will be an individual graded interview for each student with the professor. These will occur via Zoom after the project is turned in and last about 10 minutes; we will use the class time from the last class day, but additional time that week will be necessary.

Schedule of classes

The schedule of classes is shown below.

In addition there are tailored HTML pages for each week, linked from the table below.

Recordings of the sessions will be available (using the Panopto tab in Canvas).

Day Topic Weekly Assign Project Assign
Introduction
Mon Aug 26 Project Introduction
Tue Aug 27 All the schemas Project ideas
Mon Sep 02 Labor Day - No Class
Data Transforms
Tue Sep 03 Excluding columns and rows; Data frames; Changing columns; Recode/Lookups
Mon Sep 09 Nest and Unnest lists
Tue Sep 10 No meeting - Instructor Research Travel - DataCamp Assignment
Mon Sep 16 Group and aggregate; Wider/Longer
Tue Sep 17 Worked Example: Survey Rank Question Project Description (pairs)
R Tidyverse
Mon Sep 23 select/filter, separate/extract, mutate
Tue Sep 24 Classroom Office Hours - Career Fair
Mon Sep 30 nest/unnest, recode/join R queries 1
Tue Oct 01 pivot wider/longer; group_by summarize
SQL
Mon Oct 07 SQL intro
Tue Oct 08 SQL intro; SELECT/WHERE
Mon Oct 14 JOIN; inner and outer SQL Queries 1
Tue Oct 15 SUM (aggregates); GROUP BY
Mon Oct 21 SQL and other transforms SQL Queries 2
Python Pandas
Tue Oct 22 query and loc
Mon Oct 28 recode/merge Python queries 1
Tue Oct 29 melt/pivot
Mon Nov 04 groupby/aggregate Python queries 2
Excel
Tue Nov 05 Excel - Queries - Sorts, Filters
Mon Nov 11 No meeting - Instructor Research Travel - Project Pair Work
Tue Nov 12 No meeting - Instructor Research Travel - Project Pair Work
Mon Nov 18 Formulas and Lookup Excel queries 1
Tue Nov 19 Excel - Pivot Tables and Graphs
Mon Nov 25 Thanksgiving - No Classes
Tue Nov 26 Thanksgiving - No Classes
Mon Dec 02 Excel transforms with Powerquery Excel queries 2
Wrapup
Tue Dec 03 Project Workshop meetings
Mon Dec 09 Project Workshop meetings Draft Project (pairs)
NA Project Workshop meetings
NA Individual Interviews (Zoom) Individual Interviews (all week)

Policies

Class Attendance

Students are expected to attend all classes, unless unavoidable circumstances occur. You do not need to inform the instructor when you have to miss class.

While instructor presentations in class are recorded and materials are all provided on the course website, class sessions include practical hands-on exercises (with support from the instructor and TA), so the recordings/materials are not a substitute for participation in class.

Office hours are not a replay of class, but are available to help catch up when life means that you need to miss class; you should have watched the recordings, read the class materials, and attempted the exercises before coming to office hours.

Academic Integrity

Each student in the course is expected to abide by the University of Texas Honor Code: “As a student of The University of Texas at Austin, I shall abide by the core values of the University and uphold academic integrity.” Plagiarism is taken very seriously at UT. Therefore, if you use words or ideas that are not your own (or that you have used in previous class), you must cite your sources and use quote marks appropriately. Otherwise you will be guilty of plagiarism and subject to academic disciplinary action, including failure of the course. In particular, students are reminded that proper citation requires mentioning sources when you use them, not just in a general list of references at the end of a document. You are responsible for understanding UT’s Academic Honesty and the University Honor Code. If this is at all confusing, please take this Plagarism Tutorial.

In particular, any time you use the copy function from someone else’s writing (e.g., an article, blog post) you must have a plan about how you will use those words, how you will use quote marks (“), and how you will cite the work.

Collaboration policy

The weekly assignments are individual work. However, as long as you meet the condition below, I give you explicit permission to work together with other classmates on the assignments or on your projects. With the same condition, you are also welcome to seek input from people outside the class, such as friends and family.

The one condition is that you add a note to your homework (ideally through a comment in the Canvas submission) indicating how the work was done and identifying with whom you worked and how (thus ensuring that we are following the Academic Integrity policy above). For example you might say “Daria and I worked on this in the lab together, when we started out we were confused about X but I figured it out and shared that with Daria. Our code is very similar because we worked together”. Or perhaps “I was confused about how to pad a string with spaces, and after working at it for 30 minutes I chatted about it with my partner who suggested the xyz method. I was pleased when I got that working myself.” When you have worked together your code will have similarities, but you must not turn in identical code; rather you should take code you’ve worked on together and personalize it through comments that explain what is happening in the code. The comments must be your own, individual, work.

Neither “working together” nor “seeking input” means having others do the work for you; you should always be certain that you are learning and that you understand the code that you have submitted.

If you have questions on this policy please ask in the Assignment Discussion forum on Canvas and I will answer there. I have this policy because learning to program is both individual hard work and learning how to get help from others. Sometimes chatting through with another class member is just what is needed.

Working with Gen AI

The use of generative artificial intelligence tools (e.g., ChatGPT or Microsoft CoPilot) in this class is permitted for students who wish to use them, provided you:

  1. Provide proper attribution
  2. Provide a memo (with screenshots) of the sequence of prompts and responses together with written commentary of how you assessed the quality of the output, and how and why you developed your prompts to achieve your learning aims.

Sharing of Course Assignment Materials is Prohibited

No assignment materials used in this class, including, but not limited to quizzes, exams, papers, projects, homework assignments, review sheets, and additional problem sets, may be shared online or with anyone outside of the class unless you have my explicit, written permission. Unauthorized sharing of materials promotes cheating. It is a violation of the University’s Student Honor Code and an act of academic dishonesty. I am well aware of the sites used for sharing materials, and any materials found online that are associated with you, or any suspected unauthorized sharing of materials, will be reported to Student Conduct and Academic Integrity in the Office of the Dean of Students. These reports can result in sanctions, including failure in the course.

Class Recordings

Class recordings are reserved only for students in this class for educational purposes and are protected under FERPA. The recordings should not be shared outside the class in any form. Violation of this restriction by a student could lead to Student Misconduct proceedings.

Student rights and responsibilities

  • You have a right to a learning environment that supports mental and physical wellness.
  • You have a right to respect.
  • You have a right to be assessed and graded fairly.
  • You have a right to freedom of opinion and expression.
  • You have a right to privacy and confidentiality.
  • You have a right to meaningful and equal participation, and to self-organize groups to improve your learning environment.
  • You have a right to learn in an environment that is welcoming to all people. No student shall be isolated, excluded or diminished in any way.

With these rights come responsibilities:

  • You are responsible for taking care of yourself, managing your time, and communicating with the teaching team and with others if things start to feel out of control or overwhelming.
  • You are responsible for acting in a way that is worthy of respect and always respectful of others.
  • Your experience with this course is directly related to the quality of the energy that you bring to it, and your energy shapes the quality of your peers’ experiences.
  • You are responsible for creating an inclusive environment and for speaking up when someone is excluded. In particular, you are responsible for ensuring that your participation does not exclude the participation of others. Office hours are available for in-depth further discussion of advanced topics or other interests that pursuing in depth during class would exclude others.
  • You are responsible for holding yourself accountable to these standards, holding each other to these standards, and holding the teaching team accountable as well.

Personal Pronoun Preference

Professional courtesy and sensitivity are especially important with respect to individuals and topics dealing with differences of race, culture, religion, politics, sexual orientation, gender, gender variance, and nationalities. Class rosters are provided to the instructor with the student’s legal name. I will gladly honor your request to address you by an alternate name or gender pronoun. Please advise me of this preference early in the semester so that I may make appropriate changes to my records. I usually add “he/his” to my Zoom name for online meetings and encourage others to do the same.

Accommodations

The university is committed to creating an accessible and inclusive learning environment consistent with university policy and federal and state law. Please let me know if you experience any barriers to learning so I can work with you to ensure you have equal opportunity to participate fully in this course. If you are a student with a disability, or think you may have a disability, and need accommodations please contact Disability and Access (D&A). Please refer to D&A’s website for contact and more information: http://diversity.utexas.edu/disability/. If you are already registered with D&A , please deliver your Accommodation Letter to me as early as possible in the semester so we can discuss your approved accommodations and needs in this course.

Drop Policy

If you want to drop a class after the 12th class day, you’ll need to execute a Q drop before the Q-drop deadline, which typically occurs near the middle of the semester. You must be conscious of the maximum number of Q-drops allowed under Texas law. For more information, see the University Catalog Drop page and consult the School of Information Graduate Coordinators.

International students must meet with the international office before dropping a class that would put them below full-time status. Contrary to some suggestions, there are circumstances in which international students can hold less than full-time status without penalties, but you must confirm these with the international office.

University Resources for Students

Your success in this class is important to me. We will all need accommodations because we all learn differently. If there are aspects of this course that prevent you from learning or exclude you, please let me know as soon as possible. Together we’ll develop strategies to meet both your needs and the requirements of the course. There are also a range of resources on campus, detailed below.

Services for Students with Disabilities

This class respects and welcomes students of all backgrounds, identities, and abilities. If there are circumstances that make our learning environment and activities difficult, if you have medical information that you need to share with me, or if you need specific arrangements in case the building needs to be evacuated, please let me know.

Requests for accommodations are quite normal and quite frequent and I promise to maintain the confidentiality of these discussions. I am committed to creating an effective learning environment for all students, but I can only do so if you discuss your needs with me as early as possible. Contact Services for Students with Disabilities.

Counseling and Mental Health Center

All of us benefit from support during times of struggle. You are not alone. There are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.

If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. http://www.cmhc.utexas.edu/individualcounseling.html

The Sanger Learning Center

All students, including graduate students, are welcome to take advantage of Sanger Center’s classes and workshops, private learning specialist appointments, peer academic coaching, and tutoring for more than 70 courses in 15 different subject areas. For more information, please visit https://ugs.utexas.edu/slc/grad or call 512-471-3614 (JES A332).

University Writing Center free programs for grad students

Libraries

IT services

Student Emergency Services

Important Safety Information

If you have concerns about the safety or behavior of fellow students, TAs or Professors, call BCAL (the Behavior Concerns Advice Line): 512-232-5050. Your call can be anonymous. If something doesn’t feel right – it probably isn’t. Trust your instincts and share your concerns.