Pandas Wider and Aggregation

import pandas as pd
from io import StringIO
import numpy as np

Wider / Longer

df_pop = pd.read_csv("../datasets/world_pop_wide.csv")

df_pop
         country       2010       2011       2012       2013
0      Australia   18767085   19056040   19414834   19775013
1      Indonesia  383721793  393333604  403171286  413200994
2  United States  249849720  252208133  254614421  256953576

To make a wide format dataframe into a long format dataframe in Pandas, we will use a function called melt. As with pivot_wider in tidyverse we tell melt the id variables which are the columns that we don’t want to reshape.

(df_pop.
  melt(id_vars = ["country"], var_name = "year", value_name = "population")
)
          country  year  population
0       Australia  2010    18767085
1       Indonesia  2010   383721793
2   United States  2010   249849720
3       Australia  2011    19056040
4       Indonesia  2011   393333604
5   United States  2011   252208133
6       Australia  2012    19414834
7       Indonesia  2012   403171286
8   United States  2012   254614421
9       Australia  2013    19775013
10      Indonesia  2013   413200994
11  United States  2013   256953576

The reverse move (long to wide) in pandas can be done with pivot

# first store the long version.
df_pop_long = (df_pop.
  melt(id_vars = ["country"], var_name = "year", value_name = "population")
)

(df_pop_long
  .pivot(index = "country", columns = "year", values = "population")
)
year                2010       2011       2012       2013
country                                                  
Australia       18767085   19056040   19414834   19775013
Indonesia      383721793  393333604  403171286  413200994
United States  249849720  252208133  254614421  256953576

Notice, though, that pivot moves the “country” column out of the table into an index? So we will use reset_index to bring that column back into the table as a column.

(df_pop_long
  .pivot(index = "country", columns = "year", values = "population")
  .reset_index()
)
year        country       2010       2011       2012       2013
0         Australia   18767085   19056040   19414834   19775013
1         Indonesia  383721793  393333604  403171286  413200994
2     United States  249849720  252208133  254614421  256953576
Note

An index in Pandas is a special kind of name for a row. They are like the numbers identifying rows in an Excel sheet (1, 2 and so on, whereas columns are A, B). The difference for Pandas is that the index for a row can be a string (a word) rather than a number.

Above, after the pivot each row had a special name (Australia etc), but not a country column. Confusingly, though, indexes can have their own names (which is why country appears to the left).

Indexes aren’t columns in the data-frame, per se, but they can be moved in and out from columns (which is what reset_index is doing, shifting the word index into the table, reseting the index to straight numbers).

SQL does not have this concept at all. Everything is inside the table. In fact tables in SQL don’t even have a default order (technically a table is a bag of rows). So if you use SELECT * FROM table you don’t get any guarantee about the order of the rows that will be returned. IN a sadly typical naming partial overlap, SQL does have something called “an index” although that’s best thought of as a separate concept entirely, and is used for efficiency gains in filters on particular columns.

R does have a similar concept row_name although it is not widely used, especially in tidyverse.

Finally, pandas not only has special row name columns which are indexes, but the column names themselves are a horizontal index (notice the “year” label in the table output above). So there are both vertical/y (row name) indexes, and horizontal/x (column_name) indexes. This is why the pandas methods sometimes ask for indexes rather than column names.

Group/Aggregate

Pandas provides the group_by and the agg function, which we can use with a set of aggregation functions, such as sum and mean.

books_with_weights = pd.read_csv('../datasets/books_with_weights.csv')

books_with_weights
                title             series     format  weight_in_grams
0      Harry Potter 1       Harry Potter  paperback             1456
1          Two Towers  Lord of the Rings  paperback             1800
2      Harry Potter 3       Harry Potter  hardcover             2604
3      Harry Potter 2       Harry Potter  hardcover             2900
4  Return of the King  Lord of the Rings  hardcover             1784

For reasons that will be clearer below, we will look at aggregations with grouping first, then look at aggregations without grouping.

We provide a groupby first then specify the aggregation.

(books_with_weights
  .groupby('series')
  .agg(weight_for_group = ('weight_in_grams', 'sum'))
)
                   weight_for_group
series                             
Harry Potter                   6960
Lord of the Rings              3584

The aggregation here is specified as new_column_name = (column_to_aggregate, function). We can do more than one within the agg call.

(books_with_weights
  .groupby('series')
  .agg(total_weight_for_group = ('weight_in_grams', 'sum'),
       mean_weight_for_group = ('weight_in_grams', 'mean'))
)
                   total_weight_for_group  mean_weight_for_group
series                                                          
Harry Potter                         6960                 2320.0
Lord of the Rings                    3584                 1792.0

Note that the columns that we groupby become indexes, and if we group by more than one column we get what is known as a multi-index, which is two levels of indexes.

(books_with_weights
  .groupby(['series', 'format'])
  .agg(weight_for_group = ('weight_in_grams', 'sum'))
)
                             weight_for_group
series            format                     
Harry Potter      hardcover              5504
                  paperback              1456
Lord of the Rings hardcover              1784
                  paperback              1800

These can be useful if you have continued operations that use these groups (other pandas methods can use these indexes), but can be confusing if you are thinking strictly as DataFrames. Whether you have a single index or a multi-index, you can bring those back as familiar columns using reset_index

(books_with_weights
  .groupby(['series', 'format'])
  .agg(weight_for_group = ('weight_in_grams', 'sum'))
  .reset_index()
)
              series     format  weight_for_group
0       Harry Potter  hardcover              5504
1       Harry Potter  paperback              1456
2  Lord of the Rings  hardcover              1784
3  Lord of the Rings  paperback              1800

Aggregation without groupby

We can use agg without groupby to get the total weight of all books without grouping.

(books_with_weights
  .agg(total_weight = ('weight_in_grams', 'sum'),
       mean_weight = ('weight_in_grams', 'mean'))
)

We would expect this output (as though there was just a single group and therefore one row of data):

total_weight, mean_weight
10544,        2108

like:

single_group,    total_weight, mean_weight
(unnamed group), 10544,        2108

but … instead we will get:

(books_with_weights
  .agg(total_weight = ('weight_in_grams', 'sum'),
       mean_weight = ('weight_in_grams', 'mean'))
)
              weight_in_grams
total_weight          10544.0
mean_weight            2108.8

We get a single column with two rows (and a named index for each calculated value). There are probably good reasons for this but it is inconsistent with how agg works without groupby so it can be confusing.

We can switch this into a more familiar format using transpose (which switches the rows and columns). We also drop the named index (which is showing us which column that was originally used, not something we’d expect from SQL or tidyverse).

(books_with_weights
  .agg(total_weight = ('weight_in_grams', 'sum'),
       mean_weight = ('weight_in_grams', 'mean'))
  .transpose()
  .reset_index(drop=True)
)
   total_weight  mean_weight
0       10544.0       2108.8
Cautionagg vs sum

Above I’ve shown using the named column approach for agg. It is also very common in pandas to call the functions like sum directly. Done like that sum works on all columns, so we have to reduce the columns sent:

(books_with_weights
  .filter(items = ['weight_in_grams'])
  .sum()
)
weight_in_grams    10544
dtype: int64

When sum is used like this it does not return a DataFrame. Instead it returns a Series. This is a single column, but outside a DataFrame (i.e., it isn’t a DataFrame with 1 column, it is just a list of items).

So what we have here is an unnamed Series with a single item (10544). The weight_in_grams part is an index for that item.

The tip off that this change in type has happened is that the items show horizontally (like named rows), rather than in columns (and have the cryptic dtype at the bottom).

If sum is called with multiple columns, each of them ends up as an item in this unnamed Series. And text columns can also be sent to sum where it just concantenates the strings.

(books_with_weights
  .sum()
)
title              Harry Potter 1Two TowersHarry Potter 3Harry Po...
series             Harry PotterLord of the RingsHarry PotterHarry...
format                 paperbackpaperbackhardcoverhardcoverhardcover
weight_in_grams                                                10544
dtype: object

agg with named columns (as above) avoids these issues, and keeps everything as DataFrames, albeit with the quirk that agg without groupby produces a transposed dataframe.

Using agg to create a list (opposite of unnest)

As in our other frameworks, the opposite of unnest is implemented with an groupby and aggregation.

If we start with our list of students and courses, where the grain is “a row represents a single student enrolled in a single course”.

       course_name student
0   Data Wrangling   Alice
0   Data Wrangling     Bob
0   Data Wrangling   Carol
1  Database Design   Alice
1  Database Design   David
2     Ethics in AI   Carol
2     Ethics in AI   David
2     Ethics in AI     Eve

And we would like to produce a list of courses taken by each student:

(course_student
  .groupby('student')
  .agg(course_list = ('course_name', list))
  .assign(courses_for_student = lambda df_: df_['course_list'].str.join(', '))
)
                               course_list              courses_for_student
student                                                                    
Alice    [Data Wrangling, Database Design]  Data Wrangling, Database Design
Bob                       [Data Wrangling]                   Data Wrangling
Carol       [Data Wrangling, Ethics in AI]     Data Wrangling, Ethics in AI
David      [Database Design, Ethics in AI]    Database Design, Ethics in AI
Eve                         [Ethics in AI]                     Ethics in AI

Python pandas quick reference

transform primary_command another_option
Exclude rows query
Exclude columns drop filter
New columns assign (with lambda)
Combining columns assign + str_glue
Treating cases differently np.select inside assign
Recode/lookup replace merge
Un-nest explode
Nest groupby/agg/list
Aggregate agg
Group Aggregate groupby + agg
Longer melt
Wider pivot

Exercises

For these exercises, we’ll use a revenue dataset to practice pivot transformations, followed by Olympic medal data:

Exercise 6: Wider (Long to Wide)

Often when working with data, we have information in long format where each region appears on a separate row for each movie. To compare USA vs Global revenue easily, we want to reshape the data so that regions become columns. For example, we might want to know: “How does USA revenue compare to Global revenue for each movie?” This makes it easy to see the difference side-by-side.

In this exercise, load a new revenue dataset and convert it from long to wide format using the pivot() method.

Step 1: Load the revenue dataset:

revenue_csv = """movie_title,release_year,region,revenue_millions
Dune,2021,USA,402
Dune,2021,Global,402
Shang-Chi,2021,USA,224
Shang-Chi,2021,Global,432
No Time to Die,2021,USA,160
No Time to Die,2021,Global,774
Top Gun Maverick,2022,USA,1491
Top Gun Maverick,2022,Global,1491
Doctor Strange 2,2022,USA,411
Doctor Strange 2,2022,Global,955
Avatar 2,2022,USA,2321
Avatar 2,2022,Global,2320
Barbie,2023,USA,756
Barbie,2023,Global,1444
Oppenheimer,2023,USA,636
Oppenheimer,2023,Global,952
Killers of the Flower Moon,2023,USA,156
Killers of the Flower Moon,2023,Global,257"""

revenue_df = pd.read_csv(StringIO(revenue_csv))

revenue_df
                   movie_title  release_year  region  revenue_millions
0                         Dune          2021     USA               402
1                         Dune          2021  Global               402
2                    Shang-Chi          2021     USA               224
3                    Shang-Chi          2021  Global               432
4               No Time to Die          2021     USA               160
5               No Time to Die          2021  Global               774
6             Top Gun Maverick          2022     USA              1491
7             Top Gun Maverick          2022  Global              1491
8             Doctor Strange 2          2022     USA               411
9             Doctor Strange 2          2022  Global               955
10                    Avatar 2          2022     USA              2321
11                    Avatar 2          2022  Global              2320
12                      Barbie          2023     USA               756
13                      Barbie          2023  Global              1444
14                 Oppenheimer          2023     USA               636
15                 Oppenheimer          2023  Global               952
16  Killers of the Flower Moon          2023     USA               156
17  Killers of the Flower Moon          2023  Global               257

Step 2: Use the pivot() method to reshape this data wider so that:

  • Each movie is a row (index)
  • Each region becomes a column
  • The values are the revenue in millions

Use pivot() with index='movie_title', columns='region', and values='revenue_millions'. Remember that pivot() will move the index column out of the table, so you’ll need to use reset_index() to bring it back as a regular column. Follow the pattern shown in the “Wider / Longer” section above.

Expected Output:

Your final result should look like this (showing movies as rows, USA and Global as columns, with revenue in millions as values):

movie_title                  USA  Global
Avatar 2                     2321   2320
Barbie                       756   1444
Doctor Strange 2             411    955
Dune                         402    402
Killers of the Flower Moon   156    257
No Time to Die               160    774
Oppenheimer                  636    952
Shang-Chi                    224    432
Top Gun Maverick            1491   1491

Exercise 7: Wider with Years as Columns

In this exercise, we want to see how global movie revenue has changed over the years. To do this clearly, we reshape the data so that each movie is a row and each year appears as a column, showing the global revenue for that movie in each year.

Step 1: Filter the revenue dataset to only Global revenue:

global_revenue_df = (revenue_df
  .query("region == 'Global'")
)

global_revenue_df
                   movie_title  release_year  region  revenue_millions
1                         Dune          2021  Global               402
3                    Shang-Chi          2021  Global               432
5               No Time to Die          2021  Global               774
7             Top Gun Maverick          2022  Global              1491
9             Doctor Strange 2          2022  Global               955
11                    Avatar 2          2022  Global              2320
13                      Barbie          2023  Global              1444
15                 Oppenheimer          2023  Global               952
17  Killers of the Flower Moon          2023  Global               257

Step 2: Use the pivot() method to reshape this data wider so that:

  • Each movie is a row (index)
  • Each year becomes a column
  • The values are the global revenue in millions

Use pivot(). Remember that pivot() will move the index column out of the table, so you’ll need to use reset_index() to bring it back as a regular column. Follow the pattern shown in the “Wider / Longer” section above.

Expected Output:

Your final result should look like this (showing movies as rows, years as columns, with global revenue in millions as values):

movie_title                   2021   2022   2023
Avatar 2                       NaN   2320    NaN
Barbie                         NaN    NaN   1444
Doctor Strange 2               NaN    955    NaN
Dune                          402    NaN    NaN
Killers of the Flower Moon     NaN    NaN    257
No Time to Die                774    NaN    NaN
Oppenheimer                    NaN    NaN    952
Shang-Chi                     432    NaN    NaN
Top Gun Maverick               NaN   1491    NaN

Exercise 8: Olympics Medals Wider

In this exercise, we work with Olympic medal data. We have medal counts for different countries and medal types (Gold, Silver, Bronze). We want to reshape the data so that each medal type becomes a column, allowing us to easily compare how many of each medal type each country won.

Step 1: Load the Olympic medals dataset:

import pandas as pd
from io import StringIO
import numpy as np

medals_csv = """country,medal_type,medal_count
China,Gold,32
China,Silver,35
China,Bronze,24
Japan,Gold,17
Japan,Silver,8
Japan,Bronze,21
Great Britain,Gold,16
Great Britain,Silver,22
Great Britain,Bronze,20
South Korea,Gold,13
South Korea,Silver,8
South Korea,Bronze,10
Australia,Gold,17
Australia,Silver,7
Australia,Bronze,22"""

medals_df = pd.read_csv(StringIO(medals_csv))

medals_df
          country medal_type  medal_count
0           China       Gold           32
1           China     Silver           35
2           China     Bronze           24
3           Japan       Gold           17
4           Japan     Silver            8
5           Japan     Bronze           21
6   Great Britain       Gold           16
7   Great Britain     Silver           22
8   Great Britain     Bronze           20
9     South Korea       Gold           13
10    South Korea     Silver            8
11    South Korea     Bronze           10
12      Australia       Gold           17
13      Australia     Silver            7
14      Australia     Bronze           22

Step 2: Use the pivot() method to reshape this data wider so that:

  • Each country is a row (index)
  • Each medal type becomes a column (Gold, Silver, Bronze)
  • The values are the medal counts

Use pivot().

Expected Output:

Your final result should look like this (showing countries as rows, medal types as columns, with medal counts as values):

country        Bronze  Gold  Silver
Australia        22    17       7
China            24    32      35
Great Britain    20    16      22
Japan            21    17       8
South Korea      10    13       8

Exercise 9: Olympics Medals with Types as Rows

In this exercise, we work with the same Olympic medal data from Exercise 8, but reshape it differently. This time, we want medal types (Bronze, Silver, Gold) to be the rows, and countries to be the columns. This makes it easy to compare which country won the most of each medal type.

Step 1: Load the same Olympic medals dataset:

medals_df = pd.read_csv(StringIO(medals_csv))

medals_df
          country medal_type  medal_count
0           China       Gold           32
1           China     Silver           35
2           China     Bronze           24
3           Japan       Gold           17
4           Japan     Silver            8
5           Japan     Bronze           21
6   Great Britain       Gold           16
7   Great Britain     Silver           22
8   Great Britain     Bronze           20
9     South Korea       Gold           13
10    South Korea     Silver            8
11    South Korea     Bronze           10
12      Australia       Gold           17
13      Australia     Silver            7
14      Australia     Bronze           22

Step 2: Use the pivot() method to reshape this data wider so that:

  • Each medal type is a row (index)
  • Each country becomes a column
  • The values are the medal counts

Use pivot() with index='medal_type', columns='country', and values='medal_count'. Remember that pivot() will move the index column out of the table, so you’ll need to use reset_index() to bring it back as a regular column. Follow the pattern shown in the “Wider / Longer” section above.

Expected Output:

Your final result should look like this (showing medal types as rows, countries as columns, with medal counts as values):

medal_type   Australia  China  Great Britain  Japan  South Korea
Bronze              22     24              20     21           10
Gold                17     32              16     17           13
Silver               7     35              22      8            8

Exercise 9: Revenue aggregations

Use aggregation and group_by to show the total revenue Globally and for the USA. You should have two rows. Don’t forget to reset the index.

Exercise 10: aggregations without grouping

Step 1: Remove the rows that are revenue for the USA.

Step 2: Use aggregation without grouping to produce an overall sum.

Exercise 11: Same, but with sum directly

Step 1: Remove the rows that are revenue for the USA.

Step 2: This time produce an overall sum for revenue, but don’t use the agg method, use sum.