This course, “Data Wrangling” will enable you to:
There are no prerequisites for this course; it is appropriate even if you’ve never done any programming or behind the scenes work with computers. I’ll address each topic “from scratch.”
However, if you do have significant, recent, experience, or training with programming and databases this is not the course for you; you will find it too introductory. In this case I require you to drop the course and welcome you to audit parts that you think will extend your knowledge. In particular this course is not suitable as a interdisciplinary course for Computer Science students. From Spring 2017 onward open registration in the course is restricted to Information School students; students from outside the school are welcome, but have to contact the Instructor for permission to register.
Students who have taken (or are taking) Database Management in the Information School may enroll but should reach out to the professor to discuss.
This course is taught by James Howison.
For the online class, the small group instructions sessions via Zoom are a great opportunity to get to know each others. I am also happy to meet individually via Zoom for office hours.
I would like to acknowledge that when we are meeting on Indigenous land. Moreover, I would like to acknowledge and pay our respects to the Carrizo & Comecrudo, Coahuiltecan, Caddo, Tonkawa, Comanche, Lipan Apache, Alabama-Coushatta, Kickapoo, Tigua Pueblo, and all the American Indian and Indigenous Peoples and communities who have been or have become a part of these lands and territories in Texas, here on Turtle Island.
The software used in this class will be installed on your laptop. If you do not have a laptop, or yours stops working, the school and unversity has resources available. Please check these university resources. Check the “Before your classes” section; I believe that you reach out to the Texas One Stop).
We will all install the relevant software a few weeks into the semester, using a system called Docker, with installs of MySQL (MariaDB), phpmyadmin, and jupyter (which provides python). One advantage of this is that you will end the course with all the software you will use available for your continued use and exploration of data wrangling, data analysis, and data science. I intend to use small group instructions to help with laptop setup.
Follow this link for Tech Setup materials.
There are no required texts for the course, but you will find these resources to be useful.
Recommended Texts:
An intro book for MySQL that’s available online at UT is: Learning MySQL.
We’ll be covering everything with materials in the course, but there are lots of useful online courses that complement this material. As a member of this class you will have free access to the DataCamp site, I believe that access extends for 6 months. I will establish the access a few weeks into the semester, causing an invitation email to come to the email address registered with the University. The most relevant courses are:
The bulk of your course grade (75%) comes from Weekly Assignments. There are assignments throughout each week for this course, covering the material addressed that week. The weekly assignments are due 11:59 pm on Sunday (this is to ensure that we can grade them before Tuesday class). Late assignments will be reduced by 20% but you can drop your 2 lowest grades. It’s always worth turning in the assignment, even if late, because the assignments test and drive your learning and your performance helps guide me on material. So not completing an assignment is a sure way to fall behind. Students have used their 2 drops in the past and then been very sad when an actual emergency meant they could not complete their homework.
The assignments, and grading rubric, will be released on Canvas before each Tues class, so we’ll go over the assignment and ensure everyone knows what’s required. Each assignment will be turned in through Canvas, usually by uploading a PDF, text, or notebook HTML export file.
If you’ve uploaded a PDF there will be comments left on the PDF, in addition to any text comments in Canvas. You can see the comments on the PDF via by using the “viewing feedback” button.
A portion of the course (25%) is a project to produce a data wrangling workflow (sometimes also called a “data pipeline”). The project will be done in pairs. We will have an initial individual assignment to generate ideas, then opportunities to review each other’s ideas and form groups. Working in pairs for this course does not mean dividing up the work; I require each group member to understand everything about their project. If there is a part of your project that you don’t understand or couldn’t work with then you are missing a crucial learning opportunity. We work in pairs to work together (supporting each other’s learning) not to reduce the amount or diversity of the work that we do.
When I use the word “workflow” (or “pipeline”) I mean a process that moves from data you obtain through reading, cleaning, validating, analyzing, and providing results. The workflow includes the data, the scripts that read and query the data, and the final output. We will discuss example projects in class early in the semester.
The project builds up through the semester (e.g., after we’ve learned Database Design you will do a design for the database in your workflow), culminating in a full workflow that you demonstrate and describe through the report.
There are more details on the specific Assignment page on Canvas, but note that each project needs at least three datasets from at least two different organizations (ie 3 datasets in total, per idea, no more than 2 of those can come from the same organization). In addition your project must handle synonyms of some form (e.g., modeling a single US state that can be referred to as Texas, TX, State of Texas, Tejas, etc.). This page outlines advice on finding interesting data sources including tools to convert formats.
Example projects from previous years (note that whimsy is entirely acceptable. Encouraged, in fact):
Although not required for this class, you may want to present your workflow as a Student Project at the iSchool’s Open Day (typically in May). The Open House, while including Capstones, is also an opportunity to present student projects, including projects from this course (but also from other courses or semesters!).
Our first class session will be on Thursday August 27 at 10:30am central. Zoom instructions will be available in Canvas.
Following that first Thursday meeting, for the rest of semester, we will meet in two ways:
In addition there are tailored HTML pages for each week, linked from the table below, and frequently there will be linked screencasts covering the materials (I’m am updating these screencasts for this semester). The table below divides things into Tues/Thurs but both will be covered in the Tuesday synchronous session and the Thursday small groups will address questions from both topics.
Given COVID circumstances, I understand that students may be in different time-zones, or have unpredictable requirements to care for others. I myself will frequently have my four year old daugther at home with me. Neither the Tuesday synchronous session, nor the small group sessions are compulsory. Recordings of the synchronous session will be available, and you are welcome to ask friends to request recording of the small group sessions. In either session, it will be normal to have people, cats, dogs, barking, etc.!
Day | Topic | Weekly Assign | Project Assign |
---|---|---|---|
Introduction | |||
Thu Aug 27 | Introduction; Project Introduction | ||
Tue Sep 01 | Card Exercise | ||
Thu Sep 03 | Small Group Sessions (intro and project idea) | Project ideas | |
Data Modeling | |||
Tue Sep 08 | Data Modeling I - has_many | ||
Thu Sep 10 | Data Modeling I - has_many - discussion | ER Diagram of book store | |
Tue Sep 15 | Data Modeling II - habtm, project speed dating | ||
Thu Sep 17 | Data Modeling III - attributes | Database Design for Social Media | |
SQL Queries | |||
Tue Sep 22 | SQL SELECT, operators | ||
Thu Sep 24 | SQL ORDER BY, LIMIT | Queries 1 and Reverse Engineer Design | |
Tue Sep 29 | SQL Join, Single table | ||
Thu Oct 01 | SQL Join, many tables | Queries 2 | |
Tue Oct 06 | SQL Aggregate functions | ||
Thu Oct 08 | SQL Grouping | Queries 3 | |
Tue Oct 13 | SQL INSERT/UPDATE/DELETE | ||
Thu Oct 15 | SQL joining without keys | ||
Excel Queries | |||
Tue Oct 20 | Excel - Queries - Sorts, Filters, joins | ||
Thu Oct 22 | Excel - Pivot Tables and Graphs | Excel Queries | |
Python | |||
Tue Oct 27 | Python - Hello World, Strings and Variables | ||
Thu Oct 29 | Python - Lists and dicts | Python - Finegan replacement | |
Tue Nov 03 | Python - if/else/while | ||
Thu Nov 05 | Python - CSV Reading | Guessing game | |
Tue Nov 10 | Python - CSV Writing | ||
Thu Nov 12 | Python - Python and SQL | Reading and writing CSV | |
Tue Nov 17 | Python - Parameterized Queries | ||
Thu Nov 19 | Python - CSV to SQL via Python | Read CSVs, insert to Database | Project Prototype Due |
Tue Nov 24 | Thanksgiving Break | ||
Thu Nov 26 | Thanksgiving Break | ||
Tue Dec 01 | Python - Data transforms/Dates, Regex and splits | ||
Thu Dec 03 | Project Workshop meetings; Presentations; What’s Next? | Project due (Friday night) |
Each student in the course is expected to abide by the University of Texas Honor Code: “As a student of The University of Texas at Austin, I shall abide by the core values of the University and uphold academic integrity.” Plagiarism is taken very seriously at UT. Therefore, if you use words or ideas that are not your own (or that you have used in previous class), you must cite your sources and use quote marks appropriately. Otherwise you will be guilty of plagiarism and subject to academic disciplinary action, including failure of the course. In particular, students are reminded that proper citation requires mentioning sources when you use them, not just in a general list of references at the end of a document. You are responsible for understanding UT’s Academic Honesty and the University Honor Code. If this is at all confusing, please take this Plagarism Tutorial.
In particular, any time you use the copy function from someone else’s writing (e.g., an article, blog post) you must have a plan about how you will use those words, how you will use quote marks ("), and how you will cite the work.
The weekly assignments are individual work. However, on one condition, I give you explicit permission to work together with other classmates on the assignments or on your projects. With the same condition, you are also welcome to seek input from people outside the class, such as friends and family. Neither “working together” nor “seeking input” means having others do the work for you; you should always be certain that you are learning and that you understand the code that you have submitted.
The one condition is that you add a note to your homework (ideally through a comment in the Canvas submission) indicating how the work was done and identifying with whom you worked and how (thus ensuring that we are following the Academic Integrity policy above). For example you might say “Daria and I worked on this in the lab together, when we started out we were confused about X but I figured it out and shared that with Daria. Our code is very similar because we worked together”. Or perhaps “I was confused about how to pad a string with spaces, and after working at it for 30 minutes I chatted about it with my partner who suggested the xyz method. I was pleased when I got that working myself.” When you have worked together your code will have similarities, but you must not turn in identical code; rather you should take code you’ve worked on together and personalize it through comments that explain what is happening in the code. The comments must be your own, individual, work.
If you have questions on this policy please ask in the Assignment Discussion forum and I will answer there. I have this policy because learning to program is both individual hard work and learning how to get help from others. Sometimes chatting through with another class member is just what is needed.
Class recordings are reserved only for students in this class for educational purposes and are protected under FERPA. The recordings should not be shared outside the class in any form. Violation of this restriction by a student could lead to Student Misconduct proceedings.
To help keep everyone at UT and in our community safe, it is critical that students report COVID-19 symptoms and testing, regardless of test results, to University Health Services, and faculty and staff report to the HealthPoint Occupational Health Program (OHP) as soon as possible. Please see this link to understand what needs to be reported. In addition, to help understand what to do if a fellow student in the class (or the instructor or TA) tests positive for COVID, see this University Health Services link.
With these rights come responsibilities:
Professional courtesy and sensitivity are especially important with respect to individuals and topics dealing with differences of race, culture, religion, politics, sexual orientation, gender, gender variance, and nationalities. Class rosters are provided to the instructor with the student’s legal name. I will gladly honor your request to address you by an alternate name or gender pronoun. Please advise me of this preference early in the semester so that I may make appropriate changes to my records. I usually add “he/his” to my Zoom name for online meetings and encourage others to do the same.
If you want to drop a class after the 12th class day, you’ll need to execute a Q drop before the Q-drop deadline, which typically occurs near the middle of the semester. Under Texas law, you are only allowed six Q drops while you are in college at any public Texas institution. For more information, see: http://www.utexas.edu/ugs/csacc/academic/adddrop/qdrop
International students must meet with the international office before dropping a class that would put them below full-time status.
Your success in this class is important to me. We will all need accommodations because we all learn differently. If there are aspects of this course that prevent you from learning or exclude you, please let me know as soon as possible. Together we’ll develop strategies to meet both your needs and the requirements of the course. There are also a range of resources on campus, detailed below.
This class respects and welcomes students of all backgrounds, identities, and abilities. If there are circumstances that make our learning environment and activities difficult, if you have medical information that you need to share with me, or if you need specific arrangements in case the building needs to be evacuated, please let me know.
I am committed to creating an effective learning environment for all students, but I can only do so if you discuss your needs with me as early as possible. Requests for accomodations are quite normal and quite frequent and I promise to maintain the confidentiality of these discussions. If appropriate, also contact Services for Students with Disabilities.
All of us benefit from support during times of struggle. You are not alone. There are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.
If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. http://www.cmhc.utexas.edu/individualcounseling.html
All students, including graduate students, are welcome to take advantage of Sanger Center’s classes and workshops, private learning specialist appointments, peer academic coaching, and tutoring for more than 70 courses in 15 different subject areas. For more information, please visit https://ugs.utexas.edu/slc/grad or call 512-471-3614 (JES A332).
If you have concerns about the safety or behavior of fellow students, TAs or Professors, call BCAL (the Behavior Concerns Advice Line): 512-232-5050. Your call can be anonymous. If something doesn’t feel right – it probably isn’t. Trust your instincts and share your concerns.