Pandas Transforms

In Python we will use pandas for data frame wrangling. Here we load a csv formatted string into a dataframe.

import pandas as pd
from io import StringIO

# Sample CSV string
csv_string = """customer_name,customer_age,customer_state
"Jihong Li",34,"California"
"Sarah Brown",66,"Idaho"
"Rupert Brown",89,"California"
"Ravi Shastri",62,"Maharashtra"
"""

# Read the CSV string into a DataFrame
df = pd.read_csv(StringIO(csv_string))

df

  customer_name  customer_age customer_state
0     Jihong Li            34     California
1   Sarah Brown            66          Idaho
2  Rupert Brown            89     California
3  Ravi Shastri            62    Maharashtra

In Pandas, perhaps more than other frameworks, there are quite a few different ways to do things. In this course we will learn a particular flavor of pandas; DataCamp homework may introduce additional techniques. For things submitted for this course, though, you will have to use the techniques presented below as much as possible

Excluding rows

Pandas provides a query method which uses strings quite similar to SQL. For example, we can filter down to just some customers using:

df.query("customer_state == 'California' and customer_age > 80")

  customer_name  customer_age customer_state
2  Rupert Brown            89     California

Note that here I’m using double quotes " for the string in python, which means that I have to use single quotes ' to surround textual query pieces ('California').

We can also use isin() within the query string and provide a list of options.

df.query("customer_state.isin(['California', 'Maharashtra'])")

  customer_name  customer_age customer_state
0     Jihong Li            34     California
2  Rupert Brown            89     California
3  Ravi Shastri            62    Maharashtra

To break this syntax down a little it can help to balance out all the open/close characters. Here I’m using a tripled-quoted string in Python for a multi-line string (allow newlines within the string).

df.query("""
  customer_state.isin(
    [
     'California', 'Maharashtra'
    ]
  )
""")

Excluding columns

We can exclude columns using drop which takes a list of columns to exclude (the python list is always inside square brackets)

df.drop(columns = ['customer_age'])

  customer_name customer_state
0     Jihong Li     California
1   Sarah Brown          Idaho
2  Rupert Brown     California
3  Ravi Shastri    Maharashtra

If you want to specify just the columns to keep this can be done with filter (yes, in R tidyverse filter means dropping rows, but here filter is dropping columns).

df.filter(items = ['customer_age'])

   customer_age
0            34
1            66
2            89
3            62

filter is nifty because it can select columns using partial patterns. There are two modes: like which matches parts of words, and fullon regex.

df.filter(like = 'age')

   customer_age
0            34
1            66
2            89
3            62

df.filter(regex = '_name')

  customer_name
0     Jihong Li
1   Sarah Brown
2  Rupert Brown
3  Ravi Shastri

Renaming columns

df.rename(columns = {'customer_name': 'new_name'})

       new_name  customer_age customer_state
0     Jihong Li            34     California
1   Sarah Brown            66          Idaho
2  Rupert Brown            89     California
3  Ravi Shastri            62    Maharashtra

Sort dataframe

We can sort our dataframe by the values within a column.

df.sort_values("customer_age", ascending = False)

  customer_name  customer_age customer_state
2  Rupert Brown            89     California
1   Sarah Brown            66          Idaho
3  Ravi Shastri            62    Maharashtra
0     Jihong Li            34     California

To sort it in reverse, we change that to ascending = True

df.sort_values("customer_age", ascending = True)

  customer_name  customer_age customer_state
0     Jihong Li            34     California
3  Ravi Shastri            62    Maharashtra
1   Sarah Brown            66          Idaho
2  Rupert Brown            89     California

Chaining pandas methods together

We can string calls together, to end up with a pipeline of commands working with a dataframe. To make this work we have to surround the pipeline in ( and ).

(df
  .query("customer_age < 80")
  .sort_values("customer_state")
  .drop(columns = ['customer_age'])
  .rename(columns = {'customer_name': 'name', 
                    'customer_state': 'state'})
)

           name        state
0     Jihong Li   California
1   Sarah Brown        Idaho
3  Ravi Shastri  Maharashtra

Creating new columns

We will create new columns using assign which works well with chained pandas. First we will read in our other dataframe.

csv_string =  """product_num,product_name,price_in_dollars
"SPR-004456","Skateboard",121
"SPR-003356","Basketball",34
"SPR-004499","Tennis balls",12
"FTW-098756","Tennis shoes",79"""

# pd means pandas, which is a shortcut name we set up when we imported pandas.
df = pd.read_csv(StringIO(csv_string))

df

  product_num  product_name  price_in_dollars
0  SPR-004456    Skateboard               121
1  SPR-003356    Basketball                34
2  SPR-004499  Tennis balls                12
3  FTW-098756  Tennis shoes                79

To create a discounted price column we can use assign. Note that unlike in R tidyverse we can’t just use the column name price_in_dollars inside the assign … we have to include the kinda weird code lambda df_: df_ before it. You can think of this as making a copy of the dataframe at that point in the pipeline and creating a local variable name df_ to refer to that copy.

(df
  .assign(discounted_price = lambda df_: df_.price_in_dollars * 0.9)
)

  product_num  product_name  price_in_dollars  discounted_price
0  SPR-004456    Skateboard               121             108.9
1  SPR-003356    Basketball                34              30.6
2  SPR-004499  Tennis balls                12              10.8
3  FTW-098756  Tennis shoes                79              71.1

We can join multiple columns as with str_glue in tidyverse and CONCAT in SQL using what are called “f-strings” together with apply.

An f-string takes local variables and interpolates them into longer strings. This is very much like str_glue in R. For example:

name = "James"
sport = "Cricket"

joined_up_string = f"The student {name} has a hobby of {sport}"

joined_up_string

'The student James has a hobby of Cricket'

To use this for every row in a dataframe we can use the apply method. This applies a piece of code to each row. The “lambda” is the small piece of code applied. This is quite like copying a formula down beside a table in Excel.

We can think of this as getting a copy of each row, then we can refer to the columns in that row using row['column_name']. The axis = 1 part means “apply the function to every row”.

new_column = df.apply(lambda row: f"{row['product_name']} just {row['price_in_dollars']} + .00. Buy today", axis = 1)

df.assign(product_label = new_column)

  product_num  ...                          product_label
0  SPR-004456  ...   Skateboard just 121 + .00. Buy today
1  SPR-003356  ...    Basketball just 34 + .00. Buy today
2  SPR-004499  ...  Tennis balls just 12 + .00. Buy today
3  FTW-098756  ...  Tennis shoes just 79 + .00. Buy today

[4 rows x 4 columns]

Caution

It is possible to use this f-string approach to interpolating rows to add a new column within a chained pandas pipeline, but it is a bit ugly, involving two lambda functions (one for a copy of the df, one for a copy of the row within the apply) I’d love to know how to do this better.

(
 df.assign(product_label = lambda df_: 
             df_.apply(lambda row: 
                f"{row['product_name']} just {row['price_in_dollars']} + .00. Buy today", 
                axis = 1)
          )
)

  product_num  ...                          product_label
0  SPR-004456  ...   Skateboard just 121 + .00. Buy today
1  SPR-003356  ...    Basketball just 34 + .00. Buy today
2  SPR-004499  ...  Tennis balls just 12 + .00. Buy today
3  FTW-098756  ...  Tennis shoes just 79 + .00. Buy today

[4 rows x 4 columns]

Treating different cases differently

Python provides the np.select operator which can help in providing different output for different conditions within a row. np means that this function comes from the numpy package, so we also have to load that. For example, we can create a new column saying whether an item is expensive, or cheap, depending on its price. We can also specify a default value.

import numpy as np
(
 df.assign(product_label = lambda df_: np.select(
              condlist=[
                  df_['price_in_dollars'] > 100,
                  df_['price_in_dollars'].between(50, 100, inclusive = "both"),
                  df_['price_in_dollars'] < 50
              ],
              choicelist=['Expensive', 'Medium Price', 'Cheap'],
              default='Unknown'
          )
 )
)

  product_num  product_name  price_in_dollars product_label
0  SPR-004456    Skateboard               121     Expensive
1  SPR-003356    Basketball                34         Cheap
2  SPR-004499  Tennis balls                12         Cheap
3  FTW-098756  Tennis shoes                79  Medium Price

Notice that the conditions and the outputs are a bit separated … we provide a list of conditions (condList) and a list of outputs choiceList … they are matched up only by position in the lists (ie first condition is given first output) so beware edits slipping parts in.

Extract/Separate

In pandas, you can extract content out of cells within a column using string functions in pandas. First we use str on a column to get access to the string functions. These work “for each row” without using .apply.

df['product_num'].str.split('-')

0    [SPR, 004456]
1    [SPR, 003356]
2    [SPR, 004499]
3    [FTW, 098756]
Name: product_num, dtype: object

So we get a little 2 piece list back. In python we can refer to parts of lists using numbers, starting at 0. So we can store the first part in one new column, and the second part in the second column.

(df
  .assign(       product_code = lambda df_: df_['product_num'].str.split('-').str[0],
          product_num_no_code = lambda df_: df_['product_num'].str.split('-').str[1])
)

  product_num  product_name  price_in_dollars product_code product_num_no_code
0  SPR-004456    Skateboard               121          SPR              004456
1  SPR-003356    Basketball                34          SPR              003356
2  SPR-004499  Tennis balls                12          SPR              004499
3  FTW-098756  Tennis shoes                79          FTW              098756

Recode

Pandas can accomplish recoding values either across the whole dataframe, or restricted to just one column.

(df
  .assign(       product_code = lambda df_: df_['product_num'].str.split('-').str[0],
          product_num_no_code = lambda df_: df_['product_num'].str.split('-').str[1])
  # dictionary to replace across entire table
  .replace({"SPR": "Sporting goods",
            "FTW": "Footwear",
            "ACW": "Active Wear"}
  )
)

  product_num  product_name  ...    product_code product_num_no_code
0  SPR-004456    Skateboard  ...  Sporting goods              004456
1  SPR-003356    Basketball  ...  Sporting goods              003356
2  SPR-004499  Tennis balls  ...  Sporting goods              004499
3  FTW-098756  Tennis shoes  ...        Footwear              098756

[4 rows x 5 columns]

If you want to restrict replacements to a particular column, you can provide a named dictionary. Here `‘product_code’ is a key of the outer dictionary, and the value is the set of old and new values.

(df
  .assign(       product_code = lambda df_: df_['product_num'].str.split('-').str[0],
          product_num_no_code = lambda df_: df_['product_num'].str.split('-').str[1])
  # nested dictionary to replace only in the product_code column
  .replace({'product_code': {
      "SPR": "Sporting goods",
      "FTW": "Footwear",
      "ACW": "Active Wear"
      }
  })
)

  product_num  product_name  ...    product_code product_num_no_code
0  SPR-004456    Skateboard  ...  Sporting goods              004456
1  SPR-003356    Basketball  ...  Sporting goods              003356
2  SPR-004499  Tennis balls  ...  Sporting goods              004499
3  FTW-098756  Tennis shoes  ...        Footwear              098756

[4 rows x 5 columns]

Pandas can also recode using a join. In pandas joins are conducted by the merge function.

lookup_values = pd.DataFrame({
    'short_categories': ['SPR', 'FTW', 'ACW'],
    'long_categories': ["Sporting goods", "Footwear", "Active Wear"]
})

(df
  .assign(       product_code = lambda df_: df_['product_num'].str.split('-').str[0],
          product_num_no_code = lambda df_: df_['product_num'].str.split('-').str[1])
  # nested dictionary to replace only in the product_code column
  .merge(lookup_values,
         left_on = 'product_code', 
         right_on = 'short_categories',
         how='left'
  )
)

  product_num  product_name  ...  short_categories long_categories
0  SPR-004456    Skateboard  ...               SPR  Sporting goods
1  SPR-003356    Basketball  ...               SPR  Sporting goods
2  SPR-004499  Tennis balls  ...               SPR  Sporting goods
3  FTW-098756  Tennis shoes  ...               FTW        Footwear

[4 rows x 7 columns]

Note

Remember that code executes sort of “inside out”. This means that the computer first looks at function calls in the middle, resolving them to a value, before executing code outside it.

The principle of replacement in programming means that we can think through code by replacing a function call with the value it returns.

Honestly, it’s hard to put into words, but lets see an example.

In this code we read a csv into a variable lookup_values, then we use that variable name in the pipeline.

lookup_values_csv = pd.read_csv('../datasets/sporting_goods_lookup.csv')

(df
  .assign(       product_code = lambda df_: df_['product_num'].str.split('-').str[0],
          product_num_no_code = lambda df_: df_['product_num'].str.split('-').str[1])
  # nested dictionary to replace only in the product_code column
  .merge(lookup_values_csv,
         left_on = 'product_code', 
         right_on = 'short_categories',
         how='left'
  )
)

  product_num  product_name  ...  short_categories long_categories
0  SPR-004456    Skateboard  ...               SPR  Sporting goods
1  SPR-003356    Basketball  ...               SPR  Sporting goods
2  SPR-004499  Tennis balls  ...               SPR  Sporting goods
3  FTW-098756  Tennis shoes  ...               FTW        Footwear

[4 rows x 8 columns]

But we can avoid creating the local variable lookup_values_csv by shifting the read_csv call directly into the call to merge.

(df
  .assign(       product_code = lambda df_: df_['product_num'].str.split('-').str[0],
          product_num_no_code = lambda df_: df_['product_num'].str.split('-').str[1])
  # nested dictionary to replace only in the product_code column
  .merge(pd.read_csv('../datasets/sporting_goods_lookup.csv'),
         left_on = 'product_code', 
         right_on = 'short_categories',
         how='left'
  )
)

  product_num  product_name  ...  short_categories long_categories
0  SPR-004456    Skateboard  ...               SPR  Sporting goods
1  SPR-003356    Basketball  ...               SPR  Sporting goods
2  SPR-004499  Tennis balls  ...               SPR  Sporting goods
3  FTW-098756  Tennis shoes  ...               FTW        Footwear

[4 rows x 8 columns]

We can “replace” the variable name with anything that returns the value of the variable name. This has the advantage of reducing little used variables … but if the operation is time-consuming, then storing the result in a variable can result in speed ups.

Unnest

When we have little lists inside cells we can get separate rows for each piece using explode.

csv_string =  """course_name,students_enrolled
"Divination","Harry, Ron"
"Herbology","Hermione, Harry"
"Flying Lessons","Harry, Hermione, Ron"
"""

# pd means pandas, which is a shortcut name we set up when we imported pandas.
df = pd.read_csv(StringIO(csv_string))

df

      course_name     students_enrolled
0      Divination            Harry, Ron
1       Herbology       Hermione, Harry
2  Flying Lessons  Harry, Hermione, Ron

First we use str.split to cut up the string, leaving a list of the parts inside the cell, rather than the single long string.

(df
  .assign(student_list = lambda df_: df_['students_enrolled'].str.split(', '))
)

      course_name     students_enrolled            student_list
0      Divination            Harry, Ron            [Harry, Ron]
1       Herbology       Hermione, Harry       [Hermione, Harry]
2  Flying Lessons  Harry, Hermione, Ron  [Harry, Hermione, Ron]

Then explode creates a new row for each of the items in the list.

(df
  .assign(student_list = lambda df_: df_['students_enrolled'].str.split(', '))
  .drop(columns = ['students_enrolled'])
  .explode('student_list') 
)

      course_name student_list
0      Divination        Harry
0      Divination          Ron
1       Herbology     Hermione
1       Herbology        Harry
2  Flying Lessons        Harry
2  Flying Lessons     Hermione
2  Flying Lessons          Ron

reset index

In pandas each row has a sort of name, separate to the data itself. You can see that above where the row numbers after explode are doubled.

We can adjust that using reset_index. By default reset_index will copy the old index into the table as a column called index, so we use drop = True to avoid that.

(df
  .assign(student = lambda df_: df_['students_enrolled'].str.split(', '))
  .drop(columns = ['students_enrolled'])
  .explode('student') 
  .reset_index(drop = True)
)

      course_name   student
0      Divination     Harry
1      Divination       Ron
2       Herbology  Hermione
3       Herbology     Harry
4  Flying Lessons     Harry
5  Flying Lessons  Hermione
6  Flying Lessons       Ron

Wider / Longer

df_pop = pd.read_csv("../datasets/world_pop_wide.csv")

df_pop

         country       2010       2011       2012       2013
0      Australia   18767085   19056040   19414834   19775013
1      Indonesia  383721793  393333604  403171286  413200994
2  United States  249849720  252208133  254614421  256953576

To make a wide format dataframe into a long format dataframe in Pandas, we will use a function called melt. As with pivot_wider in tidyverse we tell melt the id variables which are the columns that we don’t want to reshape.

(df_pop.
  melt(id_vars = ["country"], var_name = "year", value_name = "population")
)

          country  year  population
0       Australia  2010    18767085
1       Indonesia  2010   383721793
2   United States  2010   249849720
3       Australia  2011    19056040
4       Indonesia  2011   393333604
5   United States  2011   252208133
6       Australia  2012    19414834
7       Indonesia  2012   403171286
8   United States  2012   254614421
9       Australia  2013    19775013
10      Indonesia  2013   413200994
11  United States  2013   256953576

The reverse move (long to wide) in pandas can be done with pivot

# first store the long version.
df_pop_long = (df_pop.
  melt(id_vars = ["country"], var_name = "year", value_name = "population")
)

(df_pop_long
  .pivot(index = "country", columns = "year", values = "population")
)

year                2010       2011       2012       2013
country                                                  
Australia       18767085   19056040   19414834   19775013
Indonesia      383721793  393333604  403171286  413200994
United States  249849720  252208133  254614421  256953576

Notice, though, that pivot moves the “country” column out of the table into an index? So we will use reset_index to bring that column back into the table as a column.

(df_pop_long
  .pivot(index = "country", columns = "year", values = "population")
  .reset_index()
)

year        country       2010       2011       2012       2013
0         Australia   18767085   19056040   19414834   19775013
1         Indonesia  383721793  393333604  403171286  413200994
2     United States  249849720  252208133  254614421  256953576

Group/Aggregate

Pandas provides the group_by and the agg function, which we can use with a set of aggregation functions, such as sum and mean.

books_with_weights = pd.read_csv('../datasets/books_with_weights.csv')

books_with_weights

                title             series     format  weight_in_grams
0      Harry Potter 1       Harry Potter  paperback             1456
1          Two Towers  Lord of the Rings  paperback             1800
2      Harry Potter 3       Harry Potter  hardcover             2604
3      Harry Potter 2       Harry Potter  hardcover             2900
4  Return of the King  Lord of the Rings  hardcover             1784

For reasons that will be clearer below, we will look at aggregations with grouping first, then look at aggregations without grouping.

We provide a groupby first then specify the aggregation.

(books_with_weights
  .groupby('series')
  .agg(weight_for_group = ('weight_in_grams', 'sum'))
)

                   weight_for_group
series                             
Harry Potter                   6960
Lord of the Rings              3584

The aggregation here is specified as new_column_name = (column_to_aggregation, function). We can do more than one within the agg call.

(books_with_weights
  .groupby('series')
  .agg(total_weight_for_group = ('weight_in_grams', 'sum'),
       mean_weight_for_group = ('weight_in_grams', 'mean'))
)

                   total_weight_for_group  mean_weight_for_group
series                                                          
Harry Potter                         6960                 2320.0
Lord of the Rings                    3584                 1792.0

Note that the columns that we groupby become indexes, and if we group by more than one column we get what is known as a multi-index, which is two levels of indexes.

(books_with_weights
  .groupby(['series', 'format'])
  .agg(weight_for_group = ('weight_in_grams', 'sum'))
)

                             weight_for_group
series            format                     
Harry Potter      hardcover              5504
                  paperback              1456
Lord of the Rings hardcover              1784
                  paperback              1800

These can be useful if you have continued operations that use these groups (other pandas methods can use these indexes), but can be confusing if you are thinking strictly as DataFrames. Whether you have a single index or a multi-index, you can bring those back as familiar columns using reset_index

(books_with_weights
  .groupby(['series', 'format'])
  .agg(weight_for_group = ('weight_in_grams', 'sum'))
  .reset_index()
)

              series     format  weight_for_group
0       Harry Potter  hardcover              5504
1       Harry Potter  paperback              1456
2  Lord of the Rings  hardcover              1784
3  Lord of the Rings  paperback              1800

Aggregation without groupby

We can use agg without groupby to get the total weight of all books without grouping.

(books_with_weights
  .agg(total_weight = ('weight_in_grams', 'sum'),
       mean_weight = ('weight_in_grams', 'mean'))
)

              weight_in_grams
total_weight          10544.0
mean_weight            2108.8

The output is a little different than the output above (and what we might expect from SQL or tidyverse, though, which would be:

total_weight,mean_weight
10544,2108

Rather than a single row with two columns, we get a single column with two rows. There are probably good reasons for this but it is inconsistent with how agg works without groupby so it can be confusing. We can switch this into a more familiar format using transpose (which switches the rows and columns).

(books_with_weights
  .agg(total_weight = ('weight_in_grams', 'sum'),
       mean_weight = ('weight_in_grams', 'mean'))
  .transpose()
  # .reset_index(drop=True)
)

                 total_weight  mean_weight
weight_in_grams       10544.0       2108.8

agg vs sum

Above I’ve shown using the named column approach for agg. It is also very common in pandas to call the functions like sum directly. Done like that sum works on all columns, so we have to reduce the columns sent:

(books_with_weights
  .filter(items = ['weight_in_grams'])
  .sum()
)

weight_in_grams    10544
dtype: int64

When sum is used like this it does not return a DataFrame. Instead it returns a Series. This is a single column, but outside a DataFrame (ie it isn’t a DataFrame with 1 column, it is just a list of items).

So what we have here is an unnamed Series with a single item (10544). The weight_in_grams part is an index for that item.

The tip off that this change in type has happened is that the items show horizontally (like named rows), rather than in columns (and have the cryptic dtype at the bottom).

If sum is called with multiple columns, each of them ends up as an item in this unnamed Series. And text columns can also be sent to sum where it just concantenates the strings.

(books_with_weights
  .sum()
)

title              Harry Potter 1Two TowersHarry Potter 3Harry Po...
series             Harry PotterLord of the RingsHarry PotterHarry...
format                 paperbackpaperbackhardcoverhardcoverhardcover
weight_in_grams                                                10544
dtype: object

agg with named columns (as above) avoids these issues, and keeps everything as DataFrames, albeit with the quirk that agg without groupby produces a transposed dataframe.

Using agg to create a list (opposite of unnest)

As in our other frameworks, the opposite of unnest is implemented with an groupby and aggregation.

If we start with our list of students and courses, where the grain is “a row represents a single student enrolled in a single course”.

      course_name   student
0      Divination     Harry
0      Divination       Ron
1       Herbology  Hermione
1       Herbology     Harry
2  Flying Lessons     Harry
2  Flying Lessons  Hermione
2  Flying Lessons       Ron

And we would like to produce a list of courses taken by each student:

(course_student
  .groupby('student')
  .agg(course_list = ('course_name', list))
  .assign(courses_for_student = lambda df_: df_['course_list'].str.join(', '))
)

                                      course_list                    courses_for_student
student                                                                                 
Harry     [Divination, Herbology, Flying Lessons]  Divination, Herbology, Flying Lessons
Hermione              [Herbology, Flying Lessons]              Herbology, Flying Lessons
Ron                  [Divination, Flying Lessons]             Divination, Flying Lessons

Python pandas quick reference

transform	primary_command	another_option
Exclude rows	query
Exclude columns	drop	filter
New columns	assign (with lambda)
Combining columns	assign + str_glue
Treating cases differently	np.select inside assign
Recode/lookup	replace	merge
Un-nest	explode
Nest	groupby/agg/list
Aggregate	agg
Group Aggregate	groupby + agg
Longer	melt
Wider	pivot