Handling nested data
Until now all our transforms have left us with the same number of rows as we had before.
There are some transforms that change the number of rows.
As we learn about this we have to think hard about what each row means. For example look at these data:
course_name | students_enrolled |
---|---|
Divination | Harry, Ron |
Herbology | Hermione, Harry |
Flying Lessons | Harry, Hermione, Ron |
What does a single row represent? Here a single row represents a course. If we count the number of rows we could find out the number of different courses being taught.
Inside the second column, though, we see names of students that are comma-separated. These sort of fields are common, especially with data that has been hand entered. We call these structured “nested”.
We can “unnest” the data, changing the grain of the table so that we have one row per item in the list:
course_name | students_enrolled |
---|---|
Divination | Harry |
Divination | Ron |
Herbology | Hermione |
Herbology | Harry |
Flying Lessons | Harry |
Flying Lessons | Hermione |
Flying Lessons | Ron |
When we un-nest lists inside columns in this way, we get the same number of columns, but a different number of rows. The grain of the table has changed. If we count the rows now, we don’t get the count of the courses.
Nesting
We can also move the other way, moving a list into a cell. Again this changes the grain of the table and usually makes fewer rows.
We can take our longer table, with a row for each combination of course and student, and nest the courses per student. This time we have a row for each student, and a list of the courses they have taken.
student | courses_enrolled |
---|---|
Harry | Divination, Herbology, Flying Lessons |
Hermione | Herbology, Flying Lessons |
Ron | Divination, Flying Lessons |
We took data nested in one way, stretched it out by unnesting, and then re-nested it another way. Now that’s some wranglin’.