Worked example of data wrangling

Here is a worked example of some moderate data wrangling, taken from my own work.

I created a survey using Qualtrics and included a ranked choice question. In qualtrics the question looks like this:

Eventually I want a dataframe that looks like this:

response_id fruit_in_order
response1 apple, orange, pear, grape, banana
response2 banana, grape, pear, orange, apple

I filled out the survey three times, adding rankings between 1 and 5 for each fruit. I then downloaded a CSV export of the data (on the Data & Analysis tab) using these options (the radio button at the bottom did not seem to make a difference).

The downloaded file looks like this (scroll across to see all columns):

StartDate EndDate Status IPAddress Progress Duration (in seconds) Finished RecordedDate ResponseId RecipientLastName RecipientFirstName RecipientEmail ExternalReference LocationLatitude LocationLongitude DistributionChannel UserLanguage Q1_1 Q1_2 Q1_3 Q1_4 Q1_5
Start Date End Date Response Type IP Address Progress Duration (in seconds) Finished Recorded Date Response ID Recipient Last Name Recipient First Name Recipient Email External Data Reference Location Latitude Location Longitude Distribution Channel User Language Which fruit do you prefer? - apple Which fruit do you prefer? - orange Which fruit do you prefer? - pear Which fruit do you prefer? - grape Which fruit do you prefer? - banana
{"ImportId":"startDate","timeZone":"America/Chicago"} {"ImportId":"endDate","timeZone":"America/Chicago"} {"ImportId":"status"} {"ImportId":"ipAddress"} {"ImportId":"progress"} {"ImportId":"duration"} {"ImportId":"finished"} {"ImportId":"recordedDate","timeZone":"America/Chicago"} {"ImportId":"_recordId"} {"ImportId":"recipientLastName"} {"ImportId":"recipientFirstName"} {"ImportId":"recipientEmail"} {"ImportId":"externalDataReference"} {"ImportId":"locationLatitude"} {"ImportId":"locationLongitude"} {"ImportId":"distributionChannel"} {"ImportId":"userLanguage"} {"ImportId":"QID1_1"} {"ImportId":"QID1_2"} {"ImportId":"QID1_3"} {"ImportId":"QID1_4"} {"ImportId":"QID1_5"}
2024-09-10 04:53:09 2024-09-10 04:53:26 IP Address 77.241.226.16 100 16 True 2024-09-10 04:53:26 R_2Gv45TBz2xAZBgw NA NA NA NA 51.9172 4.5049 anonymous EN 1 2 3 4 5
2024-09-10 04:53:29 2024-09-10 04:53:38 IP Address 77.241.226.16 100 9 True 2024-09-10 04:53:39 R_2Dc4B4z51U9BbVu NA NA NA NA 51.9172 4.5049 anonymous EN 3 5 2 1 4
2024-09-10 04:53:42 2024-09-10 04:53:51 IP Address 77.241.226.16 100 8 True 2024-09-10 04:53:51 R_2VjavSpNBqeI26l NA NA NA NA 51.9172 4.5049 anonymous EN 5 4 3 2 1

This is a curious file … there are lots of headers. Most of them are meta-data about the survey response. Let’s exclude columns until we can see the data a little more easily. Here I’m just showing the ResponseId and the columns that start with “Q” which hold the actual answers.

ResponseId Q1_1 Q1_2 Q1_3 Q1_4 Q1_5
Response ID Which fruit do you prefer? - apple Which fruit do you prefer? - orange Which fruit do you prefer? - pear Which fruit do you prefer? - grape Which fruit do you prefer? - banana
{"ImportId":"_recordId"} {"ImportId":"QID1_1"} {"ImportId":"QID1_2"} {"ImportId":"QID1_3"} {"ImportId":"QID1_4"} {"ImportId":"QID1_5"}
R_2Gv45TBz2xAZBgw 1 2 3 4 5
R_2Dc4B4z51U9BbVu 3 5 2 1 4
R_2VjavSpNBqeI26l 5 4 3 2 1

Now we can see that our five options have become Q1_1 through Q1_5. The numbers in those columns seem like they are the rankings that were typed into the text boxes. At least for the bottom three rows, which correspond to the actual survey responses.

The first two rows, though, are meta-data of some sort. The second row is some sort of export meta-data that we won’t need to use. I’ll drop that row to make this a bit clearer still.

ResponseId Q1_1 Q1_2 Q1_3 Q1_4 Q1_5
Response ID Which fruit do you prefer? - apple Which fruit do you prefer? - orange Which fruit do you prefer? - pear Which fruit do you prefer? - grape Which fruit do you prefer? - banana
R_2Gv45TBz2xAZBgw 1 2 3 4 5
R_2Dc4B4z51U9BbVu 3 5 2 1 4
R_2VjavSpNBqeI26l 5 4 3 2 1

Now we can see that the first data row is still different than the others … looks like it holds the question text and then the actual label that was used. Since this is a different sort of data than the number responses, I’m going to split this into two tables. First I’ll exclude all the data rows (to get just the labels). Then I’ll exclude the labels row, to get just the data.

Labels
ResponseId Q1_1 Q1_2 Q1_3 Q1_4 Q1_5
Response ID Which fruit do you prefer? - apple Which fruit do you prefer? - orange Which fruit do you prefer? - pear Which fruit do you prefer? - grape Which fruit do you prefer? - banana

The first column isn’t relevant here, so we will exclude that. Then this is starting to look like a lookup table that could help us, just on its side. I’ll switch that to a longer format.

Labels
question_label question_fruit
Q1_1 Which fruit do you prefer? - apple
Q1_2 Which fruit do you prefer? - orange
Q1_3 Which fruit do you prefer? - pear
Q1_4 Which fruit do you prefer? - grape
Q1_5 Which fruit do you prefer? - banana

Now we have each of the fruit names in the fruit column. We can more easily work down that column to get rid of the question. It’s easier to work down a single column than to specify all the columns by name to work across. That is particularly true if we add more options in future (adding “dragon fruit” for example to make a Q1_6 option).

We can now separate the fruit name out using the pattern that the question is separated from the fruit name by the - character. We will throw away the question part, keeping just the name of the fruit.

Labels
question_label fruit
Q1_1 apple
Q1_2 orange
Q1_3 pear
Q1_4 grape
Q1_5 banana

Now we have a lookup table that will help us to translate between the column headers in the responses and the actual fruit being ranked.

Turning back to the responses table (dropping the labels row) we have:

ResponseId Q1_1 Q1_2 Q1_3 Q1_4 Q1_5
R_2Gv45TBz2xAZBgw 1 2 3 4 5
R_2Dc4B4z51U9BbVu 3 5 2 1 4
R_2VjavSpNBqeI26l 5 4 3 2 1

To use the lookup table it will be a lot easier to move the column headers into cells, rather than trying to work with the column headers. So we can transform this to a longer format, with a column for the question label and a column to hold the actual ranking.

Responses
ResponseId question ranking
R_2Gv45TBz2xAZBgw Q1_1 1
R_2Gv45TBz2xAZBgw Q1_2 2
R_2Gv45TBz2xAZBgw Q1_3 3
R_2Gv45TBz2xAZBgw Q1_4 4
R_2Gv45TBz2xAZBgw Q1_5 5
R_2Dc4B4z51U9BbVu Q1_1 3
R_2Dc4B4z51U9BbVu Q1_2 5
R_2Dc4B4z51U9BbVu Q1_3 2
R_2Dc4B4z51U9BbVu Q1_4 1
R_2Dc4B4z51U9BbVu Q1_5 4
R_2VjavSpNBqeI26l Q1_1 5
R_2VjavSpNBqeI26l Q1_2 4
R_2VjavSpNBqeI26l Q1_3 3
R_2VjavSpNBqeI26l Q1_4 2
R_2VjavSpNBqeI26l Q1_5 1

Now we can recode the question by looking it up in the labels table, adding a fruit column.

ResponseId question fruit ranking
R_2Gv45TBz2xAZBgw Q1_1 apple 1
R_2Gv45TBz2xAZBgw Q1_2 orange 2
R_2Gv45TBz2xAZBgw Q1_3 pear 3
R_2Gv45TBz2xAZBgw Q1_4 grape 4
R_2Gv45TBz2xAZBgw Q1_5 banana 5
R_2Dc4B4z51U9BbVu Q1_1 apple 3
R_2Dc4B4z51U9BbVu Q1_2 orange 5
R_2Dc4B4z51U9BbVu Q1_3 pear 2
R_2Dc4B4z51U9BbVu Q1_4 grape 1
R_2Dc4B4z51U9BbVu Q1_5 banana 4
R_2VjavSpNBqeI26l Q1_1 apple 5
R_2VjavSpNBqeI26l Q1_2 orange 4
R_2VjavSpNBqeI26l Q1_3 pear 3
R_2VjavSpNBqeI26l Q1_4 grape 2
R_2VjavSpNBqeI26l Q1_5 banana 1

Now we are really close! We data in a tidy long format. We can drop the question column; we don’t need it any more and it’s just an artifact of the qualtrics survey anyway.

ResponseId fruit ranking
R_2Gv45TBz2xAZBgw apple 1
R_2Gv45TBz2xAZBgw orange 2
R_2Gv45TBz2xAZBgw pear 3
R_2Gv45TBz2xAZBgw grape 4
R_2Gv45TBz2xAZBgw banana 5
R_2Dc4B4z51U9BbVu apple 3
R_2Dc4B4z51U9BbVu orange 5
R_2Dc4B4z51U9BbVu pear 2
R_2Dc4B4z51U9BbVu grape 1
R_2Dc4B4z51U9BbVu banana 4
R_2VjavSpNBqeI26l apple 5
R_2VjavSpNBqeI26l orange 4
R_2VjavSpNBqeI26l pear 3
R_2VjavSpNBqeI26l grape 2
R_2VjavSpNBqeI26l banana 1

Now, remember the output format that we eventually wanted?

response_id fruit_in_order
response1 apple, orange, pear, grape, banana
response2 banana, grape, pear, orange, apple

We can get there by nesting the fruit! We do need to make sure that we order the list by the ranking column. Here I show that using the approach used in grouping, with red lines to separate the groups.

ResponseId fruit ranking
R_2Dc4B4z51U9BbVu grape 1
R_2Dc4B4z51U9BbVu pear 2
R_2Dc4B4z51U9BbVu apple 3
R_2Dc4B4z51U9BbVu banana 4
R_2Dc4B4z51U9BbVu orange 5
R_2Gv45TBz2xAZBgw apple 1
R_2Gv45TBz2xAZBgw orange 2
R_2Gv45TBz2xAZBgw pear 3
R_2Gv45TBz2xAZBgw grape 4
R_2Gv45TBz2xAZBgw banana 5
R_2VjavSpNBqeI26l banana 1
R_2VjavSpNBqeI26l grape 2
R_2VjavSpNBqeI26l pear 3
R_2VjavSpNBqeI26l orange 4
R_2VjavSpNBqeI26l apple 5

Now we can see that we want the fruit for each group, joined together into a comma-separated list.

ResponseId fruit_in_order
R_2Dc4B4z51U9BbVu grape, pear, apple, banana, orange
R_2Gv45TBz2xAZBgw apple, orange, pear, grape, banana
R_2VjavSpNBqeI26l banana, grape, pear, orange, apple

Summary

So this shows how our basic data transforms can be combined:

  • We excluded rows and columns
  • We extracted parts of fields into new columns
  • We transformed from wide to long (a few times!)
  • We created then used a recode/lookup table
  • We grouped and sorted data within those groups
  • We nested data

I think it’s entirely reasonable to ask: why doesn’t Qualtrics just export these data in a more usable way. Well, I don’t know the particular answer to that, but Qualtrics export format has to cope with a lot of variety in question type, as well as including meta-data, call within a single CSV download. We could try calling qualtrics support, but I doubt they’ll change their export format just for us :)

The reality is that we have to work with the data as we find it, then combine our data transformation verbs together to get where we want to go.