Wider or Longer

A key idea in data wrangling is that the same data can be organized in different ways.

Sometimes data is presented in tables that are designed for humans to read, or even to spot patterns in. Here we have data on urban populations of countries over time.

Urban population measurements
country 2010 2011 2012 2013
Australia 18767085 19056040 19414834 19775013
Indonesia 383721793 393333604 403171286 413200994
United States 249849720 252208133 254614421 256953576

Notice that we don’t have the word year anywhere in this table … but as humans skilled with language we can recognize years moving across left to right. And this table layout does let us scan across the population numbers for each country and get some general sense that populations are increasing (and that Indonesia is more populous than the USA.)

Another way to represent these data is like this:

country year urban_population
Australia 2010 18767085
Australia 2011 19056040
Australia 2012 19414834
Australia 2013 19775013
Indonesia 2010 383721793
Indonesia 2011 393333604
Indonesia 2012 403171286
Indonesia 2013 413200994
United States 2010 249849720
United States 2011 252208133
United States 2012 254614421
United States 2013 256953576

See how the data become longer? Now we have only three columns, but the data is represented in 12 rows. The pink highlight cell matches up with the pink highlighted cell in the wider format above. This schema for the data is definitely less easy for people to read … but it is easier for some things we want to do in our data frameworks.

For example we now have a column year and we have a column title for population. These can be really useful for analysis, and can be useful as labels for figures and other presentations.

We also have a consistent arrangement of column headers that doesn’t change if we add more data. For example, we could have many columns for other years (such as 2000 through 2010) in the wide presentation … but no matter how many years we have in the data we will always have the same three columns. We will have many more rows, but those can be easier to deal with.

Note

There is nothing inherently wrong or right about either of these ways of organizing the data. The point is not to find one true schema … but in this class we will become very good at moving between different representations. This will really help when working with different data libraries … all of which seem to want the data in some slightly different structure!

Going longer to organize data and observations

One common use for going longer with data is to hold data about observations and different variables that were observed. Here the idea is to organize the data frame into two different sets of columns.

The first set of columns are known as “identifier” columns, and the second set are known as “measurement” columns. The idea is that the identifiers don’t change, but the measurements do. For example, consider these (made up) data on athletics.

competitor country marathon sprint cross_country
Hongzi China 3.4 hours 10.5 seconds 90 minutes
Sunil India 3.3 hours 11 seconds 92 minutes

Here I’ve shaded the identifier columns in light blue … these are things that don’t (can’t) change from observation to observation. The other three columns, though, these are measurements. In this case they are measurements of performance in different sporting events.

We can convert these data to a longer format. In this case, though, only the measurement columns are restructured, the identifier columns don’t change names (they do get more rows though).

competitor country event time
Hongzi China marathon 3.4 hours
Hongzi China sprint 10.5 seconds
Hongzi China cross_country 90 minutes
Sunil India marathon 3.3 hours
Sunil India sprint 11 seconds
Sunil India cross_country 92 minutes

Our two rows have become six rows. The grain of the table has changed from one row per competitor to one row per competitor in each event. The words that were column titles like marathon have moved into cells (become “marathon”). Our identifier columns remain the same competitor and country although the values in them have tripled, now being repeated for each of the three measures.

Here is another example, using data about plant measurements. Here the initial grain of the table is a single plant; there are 150 rows so we know that there are 150 different plants. Again the first two columns are the identifiers.

plant_id species petal_length petal_width
1 1 setosa 1.4 0.2
2 2 setosa 1.4 0.2
3 3 setosa 1.3 0.2
4 4 setosa 1.5 0.2
5 5 setosa 1.4 0.2
6..149
150 150 virginica 5.1 1.8

We can make this representation longer just as we did before.

plant_id species measurement value
1 1 setosa petal_length 1.4
2 1 setosa petal_width 0.2
3 2 setosa petal_length 1.4
4 2 setosa petal_width 0.2
5 3 setosa petal_length 1.3
6..299
300 150 virginica petal_width 1.8

We could then go on to use this representation … for example we could use grouping and aggregation

Making data wider

The reverse of making data longer is to make it wider. This can be useful in lots of circumstances, but is often used to make more readable data presentations. Some graphing libraries also prefer to see data in wider formats. Here we return to our data on cars.

manufacturer year_produced model miles_per_gallon_highway
1 Ford 2017 GT 18
2 Ferrari 2015 458 Speciale 17
3 Ferrari 2015 458 Spider 17
4 Ferrari 2014 458 Italia 17
5 Ferrari 2016 488 GTB 22
6..46
47 Rolls-Royce 2016 Wraith 21

We would like to show data on the average efficiency of cars produced in different years. So we would like to have a column for 2014, and 2015. The rows of the table we can use for manufacturers.

First we can group by manufacturer and year, collapsing each group by taking each row for a model and taking the avg of the mileage.

manufacturer year_produced avg_mpg_h
1 Acura 2017 22.0
2 Aston Martin 2016 20.3
3 Aston Martin 2017 21.0
4 Audi 2015 20.0
5 Audi 2016 26.0
6..25
26 Tesla 2017 -

Now we are ready to change the format to be wider. We want a column for each year, and a row for each manufacturer. So it will look like this the table below. Note that there are lots of empty values, because the data is limited, many manufacturers don’t have cars in some years (and Tesla never had a mileage per gallon number!).

manufacturer 2014 2015 2016 2017
1 Acura - - - 22.0
2 Aston Martin - - 20.3 21.0
3 Audi - 20.0 26.0 -
4 BMW - - 25.4 -
5 Bentley - - 25.0 -
6..18
19 Tesla - - - -

Exercise: To longer

Here are some data on gold medals by country in the Olympics.

country 2000 2004 2008 2012
Australia 16 17 14 8
Canada 2 3 3 2
USA 36 35 35 46

You will take these data and convert to a longer schema. You should end up with three columns.

You should follow these steps:

  1. Create a 3 x 12 row table
  2. Create a row for each of the numbers in the data rows (12 rows).
  3. Decide on the column headers for your new table (3 columns).
  4. Copy the needed data into the other columns.
___________ ___________ gold_medals
1 16
2 17
3 14
4 8
5 2
6 3
7 3
8 2
9 36
10 35
11 35
12 46

Adding types to the columns

This table shows data on both gold and silver medals.

country 2000_GOLD 2000_SILVER 2004_GOLD 2004_SILVER
Australia 16 25 17 16
USA 36 23 35 38

Convert these to a longer format.

There will be four columns and 8 rows (one for each number in the data).

___________ ___________ ___________ medal_counts
1 16
2 25
3 17
4 16
5 36
6 23
7 35
8 38