This course, “Data Wrangling” will enable you to:
There are no prerequisites for this course; it is appropriate even if you’ve never done any programming or behind the scenes work with computers. I’ll address each topic “from scratch.”
However, if you do have significant, recent, experience, or training with programming and databases this is not the course for you; you will find it too introductory. In this case I require you to drop the course and welcome you to audit parts that you think will extend your knowledge. In particular this course is not suitable as a interdisciplinary course for Computer Science students. From Spring 2017 onward enrollment in the course is restricted to Information School students. Students from outside the school have to contact the Instructor for permission to register.
This course is taught by James Howison. My office is UTA 5.404. My office hours are by request (since picking a time doesn’t work for as many as it does), but after class is usually a good time, otherwise please email, I really do want to hear from you.
The class is in the large classroom, rather than the computing lab. The software used in this class will be installed on your laptop which you will need to bring to class. If you do not have a laptop, or yours stops working, the school has a small number available for loan. Power outlets are available for your laptop, but they can be limited, be prepared to share and, ideally, come to class with your laptop charged.
We will all install the relevant software together a few weeks into the semester, using a system called Docker, with installs of MySQL (MariaDB), phpmyadmin, and jupyter (which provides python). One advantage of this is that you will end the course with all the software you will use available for your continued use and exploration of data wrangling, data analysis, and data science.
There are no required texts for the course, but you will find these resources to be useful.
An intro book for MySQL that’s available online at UT is: Learning MySQL.
We’ll be covering everything with materials in the course, but there are lots of useful online courses that complement this material. As a member of this class you will have free access to the DataCamp site, I believe that access extends for 6 months. The most relevant courses are:
The bulk of your course grade (75%) comes from Weekly Assignments. There are assignments throughout each week for this course, covering the material addressed that week. The weekly assignments are due 11:59 pm on Sunday (this is to ensure that we can grade them before the Tuesday class). Late assignments will receive zero but you can drop your 2 lowest grades. However don’t simply not turn in an assignment once it is late; the assignments test and drive your learning and your performance helps guide me on material. So not completing an assignment is a sure way to fall behind. Students have used their 2 drops in the past and then been very sad when an actual emergency meant they could not complete their homework.
The assignments, and grading rubric, will be released on Canvas during the Thursday class, so we’ll go over the assignment and ensure everyone knows what’s required. Each assignment will be turned in through Canvas, usually by uploading a PDF, text, or python file.
If you’ve uploaded a PDF there will be comments left on the PDF, in addition to any text comments in Canvas. You can see the comments on the PDF via by using the “viewing feedback” button.
A portion of the course (25%) is a project to produce a data wrangling workflow (sometimes also called a "data pipeline). The project will be done in pairs. We will have an initial individual assignment to generate ideas, then opportunities to review each other’s ideas and form groups. Working in pairs for this course does not mean dividing up the work; I require each group member to understand everything about their project. If there is a part of your project that you don’t understand or couldn’t work with then you are missing a crucial learning opportunity. We work in pairs to work together (supporting each other’s learning) not to reduce the amount or diversity of the work that we do.
When I use the word “workflow” (or “pipeline”) I mean a process that moves from data you obtain through reading, cleaning, validating, analyzing, and providing results. The workflow includes the data, the scripts that read and query the data, and the final output. We will discuss example projects in class early in the semester.
The project builds up through the semester (e.g., after we’ve learned Database Design you will do a design for the database in your workflow), culminating in a full workflow that you demonstrate and describe through the report. In the past I required a screencast but that is not required now (so ignore any references to producing a screencast).
There are more details on the specific Assignment page: Project Workflow and Report, but note that each project needs at least three datasets from at least two different organizations (ie 3 datasets in total, per idea, no more than 2 of those can come from the same organization).
Example projects from previous years (note that whimsy is entirely acceptable. Encouraged, in fact):
Although not required for this class, you may want to present your workflow as a Student Project at the iSchool’s Open Day (typically in May). The Open House, while including Capstones, is also an opportunity to present student projects, including projects from this course (but also from other courses or semesters!).
Unlike almost all other iSchool classes, our class meets twice a week: Tuesdays and Thursdays 10:30-11:45am in the large classroom UTA 1.208, unless indicated in red below.
If you miss a class it is your responsibility to catch up; I provide screencasts that cover much of the material (but not all). Please identify a classmate early on who will help you catch up on material if there is no screencast available for that week. Office hours are not for personal replays of missed classes, but neither do I want you to fall behind, so please watch the screencasts or meeting with colleagues then come to office hours and we’ll get you back on track.
Shading of rows below alternates to show weeks.
|Day||Topic||Weekly Assign||Project Assign|
|Tue Jan 21||Intro/Card exercise|
|Thu Jan 23||Project Introduction|
|Tue Jan 28||Data Modeling I - has_many, datatypes||Project ideas|
|Thu Jan 30||Data Modeling II - habtm, project speed dating||ER Diagram of book store|
|Tue Feb 04||Data Modeling III - has_many through||Project Description|
|Thu Feb 06||Docker setup and phpMyAdmin||From ER to Database tables|
|Tue Feb 11||SQL SELECT, operators||Project Database and Workflow Design|
|Thu Feb 13||SQL ORDER BY, LIMIT||Queries 1 and Reverse Engineer Design|
|Tue Feb 18||SQL Join, Single table|
|Thu Feb 20||SQL Join, many tables||Queries 2|
|Tue Feb 25||No class—Professor traveling for research|
|Thu Feb 27||No class—Professor traveling for research|
|Tue Mar 03||SQL Aggregate functions|
|Thu Mar 05||SQL Grouping|
|Tue Mar 10||SQL INSERT/UPDATE/DELETE||Queries 3|
|Thu Mar 12||SQL joining without keys|
|Tue Mar 17||No class—Spring break|
|Thu Mar 19||No class—Spring break|
|Tue Mar 24||Excel - Queries - Sorts, Filters, joins|
|Thu Mar 26||Excel - Pivot Tables and Graphs||Excel Queries|
|Tue Mar 31||Python - Hello World; Python - Strings and Variables|
|Thu Apr 02||Python - Lists and dicts|
|Tue Apr 07||Python - if/else/while||Guessing game|
|Thu Apr 09||Python - CSV Reading; Python - CSV Writing|
|Tue Apr 14||Python - Python and SQL||Reading and writing CSV|
|Thu Apr 16||Python - Parameterized Queries|
|Tue Apr 21||Python - CSV to SQL via Python||Read CSVs, insert to Database||Project Prototype Due|
|Thu Apr 23||Python - Data transforms/Dates|
|Tue Apr 28||Python - Regex and splits||School absense code transform|
|Thu Apr 30||Python - Working with large files/Workshop|
|Tue May 05||Project Workshop|
|Thu May 07||What’s Next? Databases and Python||Project Due|
Each student in the course is expected to abide by the University of Texas Honor Code: “As a student of The University of Texas at Austin, I shall abide by the core values of the University and uphold academic integrity.” Plagiarism is taken very seriously at UT. Therefore, if you use words or ideas that are not your own (or that you have used in previous class), you must cite your sources. Otherwise you will be guilty of plagiarism and subject to academic disciplinary action, including failure of the course. In particular, students are reminded that proper citation requires mentioning sources when you use them, not just in a general list of references at the end of a document. You are responsible for understanding UT’s Academic Honesty and the University Honor Code which can be found at the following web address.
The weekly assignments are individual work. However, on one condition, I give you explicit permission to work together with other classmates on the assignments or on your projects. With the same condition, you are also welcome to seek input from people outside the class, such as friends and family. Neither “working together” nor “seeking input” means having others do the work for you; you should always be certain that you are learning and that you understand the code that you have submitted.
The one condition is that you add a note to your homework (ideally through a comment in the Canvas submission) indicating how the work was done and identifying with whom you worked and how (thus ensuring that we are following the Academic Integrity policy above). For example you might say “Daria and I worked on this in the lab together, when we started out we were confused about X but I figured it out and shared that with Daria. Our code is very similar because we worked together”. Or perhaps “I was confused about how to pad a string with spaces, and after working at it for 30 minutes I chatted about it with my partner who suggested the xyz method. I was pleased when I got that working myself.” When you have worked together your code will have similarities, but you must not turn in identical code; rather you should take code you’ve worked on together and personalize it through comments that explain what is happening in the code. The comments must be your own, individual, work.
If you have questions on this policy please ask in the Discussion forum and I will answer there: Assignment Discussions. I have this policy because learning to program is both individual hard work and learning how to get help from others. Sometimes chatting through with another class member is just what is needed.
With these rights come responsibilities:
Professional courtesy and sensitivity are especially important with respect to individuals and topics dealing with differences of race, culture, religion, politics, sexual orientation, gender, gender variance, and nationalities. Class rosters are provided to the instructor with the student’s legal name. I will gladly honor your request to address you by an alternate name or gender pronoun. Please advise me of this preference early in the semester so that I may make appropriate changes to my records.
If you want to drop a class after the 12th class day, you’ll need to execute a Q drop before the Q-drop deadline, which typically occurs near the middle of the semester. Under Texas law, you are only allowed six Q drops while you are in college at any public Texas institution. For more information, see: http://www.utexas.edu/ugs/csacc/academic/adddrop/qdrop
International students must meet with the international office before dropping a class that would put them below full-time status.
Your success in this class is important to me. We will all need accommodations because we all learn differently. If there are aspects of this course that prevent you from learning or exclude you, please let me know as soon as possible. Together we’ll develop strategies to meet both your needs and the requirements of the course. There are also a range of resources on campus, detailed below.
This class respects and welcomes students of all backgrounds, identities, and abilities. If there are circumstances that make our learning environment and activities difficult, if you have medical information that you need to share with me, or if you need specific arrangements in case the building needs to be evacuated, please let me know. I am committed to creating an effective learning environment for all students, but I can only do so if you discuss your needs with me as early as possible. I promise to maintain the confidentiality of these discussions. If appropriate, also contact Services for Students with Disabilities, 512-471-6259 (voice) or 1-866-329- 3986 (video phone). http://ddce.utexas.edu/disability/about/
All of us benefit from support during times of struggle. You are not alone. There are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.
If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. http://www.cmhc.utexas.edu/individualcounseling.html
All students, including graduate students, are welcome to take advantage of Sanger Center’s classes and workshops, private learning specialist appointments, peer academic coaching, and tutoring for more than 70 courses in 15 different subject areas. For more information, please visit https://ugs.utexas.edu/slc/grad or call 512-471-3614 (JES A332).
If you have concerns about the safety or behavior of fellow students, TAs or Professors, call BCAL (the Behavior Concerns Advice Line): 512-232-5050. Your call can be anonymous. If something doesn’t feel right – it probably isn’t. Trust your instincts and share your concerns.
The following recommendations regarding emergency evacuation from the Office of Campus Safety and Security, 512-471-5767, http://www.utexas.edu/safety/ Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside.