Introduction

This course, “Data Wrangling” will enable you to:

  • design database schemas for efficient data representation
  • implement database schemas using MySQL
  • learn how to learn new technologies
  • learn the basics of programming in Python
  • import and export data to/from CVS and Excel, changing schemas as needed
  • conduct basic analyses in Excel
  • prepare a project workflow (or “pipeline”) that imports data from different sources and produces a report.

There are no prerequisites for this course; it is appropriate even if you’ve never done any programming or behind the scenes work with computers. I’ll address each topic “from scratch.”

However, if you do have significant, recent, experience, or training with programming and databases this is not the course for you; you will find it too introductory. In this case I require you to drop the course and welcome you to audit parts that you think will extend your knowledge. In particular this course is not suitable as a interdisciplinary course for Computer Science students. From Spring 2017 onward enrollment in the course is restricted to Information School students. Students from outside the school have to contact the Instructor for permission to register.

Professor

This course is taught by James Howison. My office is UTA 5.404. My office hours are 1-2pm Thursday.

Computing resources

The class is in the large classroom, rather than the computing lab. The software used in this class will be installed on your laptop which you will need to bring to class. If you do not have a laptop, or yours stops working, the school has a small number available for loan. Power outlets are available for your laptop, but they can be limited, be prepared to share and, ideally, come to class with your laptop charged.

We will all install the relevant software together a few weeks into the semester, using a system called Docker, with installs of MySQL (MariaDB), phpmyadmin, and jupyter (which provides python). One advantage of this is that you will end the course with all the software you will use available for your continued use and exploration of data wrangling, data analysis, and data science.

Course Texts

There are no required texts for the course, but you will find these resources to be useful.

Recommended Texts:

An intro book for MySQL that’s available online at UT is: Learning MySQL.

We’ll be covering everything with materials in the course, but there are lots of useful online courses that complement this material. As a member of this class you will have free access to the DataCamp site, I believe that access extends for 6 months. The most relevant courses are:

Assessments

Weekly Assignments

The bulk of your course grade (75%) comes from Weekly Assignments. There are assignments throughout each week for this course, covering the material addressed that week. The weekly assignments are due 11:59 pm on Sunday (this is to ensure that we can grade them before the Tuesday class). Late assignments will receive zero but you can drop your 2 lowest grades. However don’t simply not turn in an assignment once it is late; the assignments test and drive your learning and your performance helps guide me on material. So not completing an assignment is a sure way to fall behind. Students have used their 2 drops in the past and then been very sad when an actual emergency meant they could not complete their homework.

The assignments, and grading rubric, will be released on Canvas during the Thursday class, so we’ll go over the assignment and ensure everyone knows what’s required. Each assignment will be turned in through Canvas, usually by uploading a PDF, text, or python file.

If you’ve uploaded a PDF there will be comments left on the PDF, in addition to any text comments in Canvas. You can see the comments on the PDF via by using the “viewing feedback” button.

Project

A portion of the course (25%) is a project to produce a data wrangling workflow (sometimes also called a "data pipeline). The project will be done in pairs. We will have an initial individual assignment to generate ideas, then opportunities to review each other’s ideas and form groups. Working in pairs for this course does not mean dividing up the work; I require each group member to understand everything about their project. If there is a part of your project that you don’t understand or couldn’t work with then you are missing a crucial learning opportunity. We work in pairs to work together (supporting each other’s learning) not to reduce the amount or diversity of the work that we do.

When I use the word “workflow” (or “pipeline”) I mean a process that moves from data you obtain through reading, cleaning, validating, analyzing, and providing results. The workflow includes the data, the scripts that read and query the data, and the final output. We will discuss example projects in class early in the semester.

The project builds up through the semester (e.g., after we’ve learned Database Design you will do a design for the database in your workflow), culminating in a full workflow that you demonstrate and describe through the report. In the past I required a screencast but that is not required now (so ignore any references to producing a screencast).

There are more details on the specific Assignment page: Project Workflow and Report, but note that each project needs at least three datasets from at least two different organizations (ie 3 datasets in total, per idea, no more than 2 of those can come from the same organization).

Example projects from previous years (note that whimsy is entirely acceptable. Encouraged, in fact):

  • State education, teachers salaries, public schools. Report.
  • Visualizing impact of weather on border crossings. Report.
  • How have “Game of Thrones” character names influenced baby names in the US? Report.
  • Lightening strikes and baseball games
  • Average age of Oscar winners by gender
  • How do solar flares affects temperatures on Earth and on Mars

iSchool Open Day

Although not required for this class, you may want to present your workflow as a Student Project at the iSchool’s Open Day (typically in May). The Open House, while including Capstones, is also an opportunity to present student projects, including projects from this course (but also from other courses or semesters!).

Schedule of class meetings

Unlike almost all other iSchool classes, our class meets twice a week: Tuesdays and Thursdays 10:30-11:45am in the large classroom UTA 1.208, unless indicated in red below.

If you miss a class it is your responsibility to catch up; I provide screencasts that cover much of the material (but not all). Please identify a classmate early on who will help you catch up on material if there is no screencast available for that week. Office hours are not for personal replays of missed classes, but neither do I want you to fall behind, so please watch the screencasts or meeting with colleagues then come to office hours and we’ll get you back on track.

Shading of rows below alternates to show weeks.

Day Topic Weekly Assign Project Assign
Introduction
Tue Jan 22 Intro/Card exercise
Thu Jan 24 Professor traveling for research
Tue Jan 29 Project Introduction Project ideas
Data Modeling
Thu Jan 31 Data Modeling I - has_many, datatypes
Tue Feb 05 Professor traveling for research
Thu Feb 07 Data Modeling II - habtm, project speed dating ER Diagram of book store Project Description
Tue Feb 12 Data Modeling III - has_many through
Thu Feb 14 Docker setup and phpMyAdmin From ER to Database tables Project Database and Workflow Design
SQL Queries
Tue Feb 19 SQL SELECT, operators
Thu Feb 21 SQL ORDER BY, LIMIT Queries 1 and Reverse Engineer Design
Tue Feb 26 SQL Join, Single table
Thu Feb 28 SQL Join, many tables Queries 2
Tue Mar 05 SQL Aggregate functions; SQL Grouping
Thu Mar 07 SQL INSERT/UPDATE/DELETE Queries 3
Excel Queries
Tue Mar 12 Excel - Queries - Sorts, Filters, joins
Thu Mar 14 Excel - Pivot Tables and Graphs Excel Queries
Tue Mar 19 Spring break
Thu Mar 21 Spring break
Python I
Tue Mar 26 Python - Hello World; Python - Strings and Variables
Thu Mar 28 Python - Lists and dicts
Tue Apr 02 Python - if/else/while Guessing game
Thu Apr 04 Python - CSV Reading; Python - CSV Writing
Tue Apr 09 Python - Python and SQL Reading and writing CSV
Thu Apr 11 Python - Parameterized Queries
Tue Apr 16 Python - CSV to SQL via Python Read CSVs, insert to Database
Thu Apr 18 Project Workshop Project Prototype Due
Python II
Tue Apr 23 Python - Data transforms/Dates
Thu Apr 25 Python - Regex and splits School absense code transform
Tue Apr 30 Python - Working with large files/Workshop
Thu May 02 Project Workshop Project Due
Wrapup
Tue May 07 What’s Next? Databases
Thu May 09 What’s Next? Python

Policies

Academic Integrity

Each student in the course is expected to abide by the University of Texas Honor Code: “As a student of The University of Texas at Austin, I shall abide by the core values of the University and uphold academic integrity.” Plagiarism is taken very seriously at UT. Therefore, if you use words or ideas that are not your own (or that you have used in previous class), you must cite your sources. Otherwise you will be guilty of plagiarism and subject to academic disciplinary action, including failure of the course. In particular, students are reminded that proper citation requires mentioning sources when you use them, not just in a general list of references at the end of a document. You are responsible for understanding UT’s Academic Honesty and the University Honor Code which can be found at the following web address.

Collaboration policy

The weekly assignments are individual work. However, on one condition, I give you explicit permission to work together with other classmates on the assignments or on your projects. With the same condition, you are also welcome to seek input from people outside the class, such as friends and family. Neither “working together” nor “seeking input” means having others do the work for you; you should always be certain that you are learning and that you understand the code that you have submitted.

The one condition is that you add a note to your homework (ideally through a comment in the Canvas submission) indicating how the work was done and identifying with whom you worked and how (thus ensuring that we are following the Academic Integrity policy above). For example you might say “Daria and I worked on this in the lab together, when we started out we were confused about X but I figured it out and shared that with Daria. Our code is very similar because we worked together”. Or perhaps “I was confused about how to pad a string with spaces, and after working at it for 30 minutes I chatted about it with my partner who suggested the xyz method. I was pleased when I got that working myself.” When you have worked together your code will have similarities, but you must not turn in identical code; rather you should take code you’ve worked on together and personalize it through comments that explain what is happening in the code. The comments must be your own, individual, work.

If you have questions on this policy please ask in the Discussion forum and I will answer there: Assignment Discussions. I have this policy because learning to program is both individual hard work and learning how to get help from others. Sometimes chatting through with another class member is just what is needed.

Student rights and responsibilities

  • You have a right to a learning environment that supports mental and physical wellness.
  • You have a right to respect.
  • You have a right to be assessed and graded fairly.
  • You have a right to freedom of opinion and expression.
  • You have a right to privacy and confidentiality.
  • You have a right to meaningful and equal participation, and to self-organize groups to improve your learning environment.
  • You have a right to learn in an environment that is welcoming to all people. No student shall be isolated, excluded or diminished in any way.

With these rights come responsibilities:

  • You are responsible for taking care of yourself, managing your time, and communicating with the teaching team and with others if things start to feel out of control or overwhelming.
  • You are responsible for acting in a way that is worthy of respect and always respectful of others.
  • Your experience with this course is directly related to the quality of the energy that you bring to it, and your energy shapes the quality of your peers’ experiences.
  • You are responsible for creating an inclusive environment and for speaking up when someone is excluded. In particular, you are responsible for ensuring that your participation does not exclude the participation of others. Office hours are available for in-depth further discussion of advanced topics or other interests that pursuing in depth during class would exclude others.
  • You are responsible for holding yourself accountable to these standards, holding each other to these standards, and holding the teaching team accountable as well.

Personal Pronoun Preference

Professional courtesy and sensitivity are especially important with respect to individuals and topics dealing with differences of race, culture, religion, politics, sexual orientation, gender, gender variance, and nationalities. Class rosters are provided to the instructor with the student’s legal name. I will gladly honor your request to address you by an alternate name or gender pronoun. Please advise me of this preference early in the semester so that I may make appropriate changes to my records.

Drop Policy

If you want to drop a class after the 12th class day, you’ll need to execute a Q drop before the Q-drop deadline, which typically occurs near the middle of the semester. Under Texas law, you are only allowed six Q drops while you are in college at any public Texas institution. For more information, see: http://www.utexas.edu/ugs/csacc/academic/adddrop/qdrop

International students must meet with the international office before dropping a class that would put them below full-time status.

University Resources for Students

Your success in this class is important to me. We will all need accommodations because we all learn differently. If there are aspects of this course that prevent you from learning or exclude you, please let me know as soon as possible. Together we’ll develop strategies to meet both your needs and the requirements of the course. There are also a range of resources on campus, detailed below.

Services for Students with Disabilities

This class respects and welcomes students of all backgrounds, identities, and abilities. If there are circumstances that make our learning environment and activities difficult, if you have medical information that you need to share with me, or if you need specific arrangements in case the building needs to be evacuated, please let me know. I am committed to creating an effective learning environment for all students, but I can only do so if you discuss your needs with me as early as possible. I promise to maintain the confidentiality of these discussions. If appropriate, also contact Services for Students with Disabilities, 512-471-6259 (voice) or 1-866-329- 3986 (video phone). http://ddce.utexas.edu/disability/about/

Counseling and Mental Health Center

All of us benefit from support during times of struggle. You are not alone. There are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.

If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. http://www.cmhc.utexas.edu/individualcounseling.html

The Sanger Learning Center

All students, including graduate students, are welcome to take advantage of Sanger Center’s classes and workshops, private learning specialist appointments, peer academic coaching, and tutoring for more than 70 courses in 15 different subject areas. For more information, please visit https://ugs.utexas.edu/slc/grad or call 512-471-3614 (JES A332).

Important Safety Information

If you have concerns about the safety or behavior of fellow students, TAs or Professors, call BCAL (the Behavior Concerns Advice Line): 512-232-5050. Your call can be anonymous. If something doesn’t feel right – it probably isn’t. Trust your instincts and share your concerns.

The following recommendations regarding emergency evacuation from the Office of Campus Safety and Security, 512-471-5767, http://www.utexas.edu/safety/ Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside.

  • Familiarize yourself with all exit doors of each classroom and building you may occupy. Remember that the nearest exit door may not be the one you used when entering the building.
  • Students requiring assistance in evacuation shall inform their instructor in writing during the first week of class.
  • In the event of an evacuation, follow the instruction of faculty or class instructors. Do not re-enter a building unless given instructions by the following: Austin Fire Department, The University of Texas at Austin Police Department, or Fire Prevention Services office.
  • Link to information regarding emergency evacuation routes and emergency procedures can be found at: http://www.utexas.edu/emergency