Handling nested data

Until now all our transforms have left us with the same number of rows as we had before.

There are some transforms that change the number of rows.

As we learn about this we have to think hard about what each row means. For example look at these data:

course_name students_enrolled
Divination Harry, Ron
Herbology Hermione, Harry
Flying Lessons Harry, Hermione, Ron

What does a single row represent? Here a single row represents a course. If we count the number of rows we could find out the number of different courses being taught.

Inside the second column, though, we see names of students that are comma-separated. These sort of fields are common, especially with data that has been hand entered. We call these structured “nested”.

We can “unnest” the data, changing the grain of the table so that we have one row per item in the list:

course_name students_enrolled
Divination Harry
Divination Ron
Herbology Hermione
Herbology Harry
Flying Lessons Harry
Flying Lessons Hermione
Flying Lessons Ron
Note

When we un-nest lists inside columns in this way, we get the same number of columns, but a different number of rows. The grain of the table has changed. If we count the rows now, we don’t get the count of the courses.

Nesting

We can also move the other way, moving a list into a cell. Again this changes the grain of the table and usually makes fewer rows.

We can take our longer table, with a row for each combination of course and student, and nest the courses per student. This time we have a row for each student, and a list of the courses they have taken.

student courses_enrolled
Harry Divination, Herbology, Flying Lessons
Hermione Herbology, Flying Lessons
Ron Divination, Flying Lessons

We took data nested in one way, stretched it out by unnesting, and then re-nested it another way. Now that’s some wranglin’.