Pandas Transforms

In Python we will use pandas for data frame wrangling. pandas is available as a python package.

If we are working in quarto we first need to make sure that pandas is installed and available. To do this we have to run an R cell.

```{r}
#install.packages("reticulate")
library(reticulate)
py_install(packages = c("pandas","janitor"))
```

Here we load a csv formatted string into a dataframe.

import pandas as pd
from io import StringIO

# Sample CSV string
csv_string = """customer_name,customer_age,customer_state
"Jihong Li",34,"California"
"Sarah Brown",66,"Idaho"
"Rupert Brown",89,"California"
"Ravi Shastri",62,"Maharashtra"
"""

# Read the CSV string into a DataFrame
df = pd.read_csv(StringIO(csv_string))

df

  customer_name  customer_age customer_state
0     Jihong Li            34     California
1   Sarah Brown            66          Idaho
2  Rupert Brown            89     California
3  Ravi Shastri            62    Maharashtra

Note

If this code is giving an error we need to install packages. In Rstudio we can do this using this code, run in an R cell.

```{r}
install.packages("reticulate")
library(reticulate)
py_install("pandas")
```

In Pandas, perhaps more than other frameworks, there are quite a few different ways to do things. In this course we will learn a particular flavor of pandas; DataCamp homework may introduce additional techniques. For things submitted for this course, though, you will have to use the techniques presented below as much as possible

Excluding rows

Pandas provides a query method which uses strings quite similar to SQL. For example, we can filter down to just some customers using:

df.query("customer_state == 'California' and customer_age > 80")

  customer_name  customer_age customer_state
2  Rupert Brown            89     California

Note that here I’m using double quotes " for the string in python, which means that I have to use single quotes ' to surround textual query pieces ('California').

We can also use isin() within the query string and provide a list of options.

df.query("customer_state.isin(['California', 'Maharashtra'])")

  customer_name  customer_age customer_state
0     Jihong Li            34     California
2  Rupert Brown            89     California
3  Ravi Shastri            62    Maharashtra

To break this syntax down a little it can help to balance out all the open/close characters. Here I’m using a tripled-quoted string in Python for a multi-line string (allow newlines within the string).

df.query("""
  customer_state.isin(
    [
     'California', 'Maharashtra'
    ]
  )
""")

Excluding columns

We can exclude columns using drop which takes a list of columns to exclude (the python list is always inside square brackets)

df.drop(columns = ['customer_age'])

  customer_name customer_state
0     Jihong Li     California
1   Sarah Brown          Idaho
2  Rupert Brown     California
3  Ravi Shastri    Maharashtra

If you want to specify just the columns to keep this can be done with filter (yes, in R tidyverse filter means dropping rows, but here filter is dropping columns).

df.filter(items = ['customer_age'])

   customer_age
0            34
1            66
2            89
3            62

filter is nifty because it can select columns using partial patterns. There are two modes: like which matches parts of words, and fullon regex.

df.filter(like = 'age')

   customer_age
0            34
1            66
2            89
3            62

df.filter(regex = '_name')

  customer_name
0     Jihong Li
1   Sarah Brown
2  Rupert Brown
3  Ravi Shastri

Renaming columns

df.rename(columns = {'customer_name': 'new_name'})

       new_name  customer_age customer_state
0     Jihong Li            34     California
1   Sarah Brown            66          Idaho
2  Rupert Brown            89     California
3  Ravi Shastri            62    Maharashtra

Sort dataframe

We can sort our dataframe by the values within a column.

df.sort_values("customer_age", ascending = False)

  customer_name  customer_age customer_state
2  Rupert Brown            89     California
1   Sarah Brown            66          Idaho
3  Ravi Shastri            62    Maharashtra
0     Jihong Li            34     California

To sort it in reverse, we change that to ascending = True

df.sort_values("customer_age", ascending = True)

  customer_name  customer_age customer_state
0     Jihong Li            34     California
3  Ravi Shastri            62    Maharashtra
1   Sarah Brown            66          Idaho
2  Rupert Brown            89     California

Chaining pandas methods together

We can string calls together, to end up with a pipeline of commands working with a dataframe. To make this work we have to surround the pipeline in ( and ).

(df
  .query("customer_age < 80")
  .sort_values("customer_state")
  .drop(columns = ['customer_age'])
  .rename(columns = {'customer_name': 'name', 
                    'customer_state': 'state'})
)

           name        state
0     Jihong Li   California
1   Sarah Brown        Idaho
3  Ravi Shastri  Maharashtra

Creating new columns

We will create new columns using assign which works well with chained pandas. First we will read in our other dataframe.

csv_string =  """product_num,product_name,price_in_dollars
"SPR-004456","Skateboard",121
"SPR-003356","Basketball",34
"SPR-004499","Tennis balls",12
"FTW-098756","Tennis shoes",79"""

# pd means pandas, which is a shortcut name we set up when we imported pandas.
df = pd.read_csv(StringIO(csv_string))

df

  product_num  product_name  price_in_dollars
0  SPR-004456    Skateboard               121
1  SPR-003356    Basketball                34
2  SPR-004499  Tennis balls                12
3  FTW-098756  Tennis shoes                79

To create a discounted price column we can use assign. Note that unlike in R tidyverse we can’t just use the column name price_in_dollars inside the assign … we have to include the kinda weird code lambda df_: df_ before it. You can think of this as making a copy of the dataframe at that point in the pipeline and creating a local variable name df_ to refer to that copy.

(df
  .assign(discounted_price = lambda df_: df_.price_in_dollars * 0.9)
)

  product_num  product_name  price_in_dollars  discounted_price
0  SPR-004456    Skateboard               121             108.9
1  SPR-003356    Basketball                34              30.6
2  SPR-004499  Tennis balls                12              10.8
3  FTW-098756  Tennis shoes                79              71.1

We can join multiple columns as with str_glue in tidyverse and CONCAT in SQL by concatenating string columns together.

We can use the + operator to combine columns into a new string column within a chained pipeline:

(df
  .assign(product_label = lambda df_:
    df_['product_name'] 
      + ' just ' 
      + df_['price_in_dollars'].astype(str) 
      + '.00. Buy today'
  )
)

  product_num  ...                       product_label
0  SPR-004456  ...   Skateboard just 121.00. Buy today
1  SPR-003356  ...    Basketball just 34.00. Buy today
2  SPR-004499  ...  Tennis balls just 12.00. Buy today
3  FTW-098756  ...  Tennis shoes just 79.00. Buy today

[4 rows x 4 columns]

Note that we need to convert the numeric price_in_dollars column to a string using .astype(str) before we can concatenate it with the text columns.

Treating different cases differently

Python provides the np.select operator which can help in providing different output for different conditions within a row. np means that this function comes from the numpy package, so we also have to load that. For example, we can create a new column saying whether an item is expensive, or cheap, depending on its price. We can also specify a default value.

import numpy as np
(
 df.assign(product_label = lambda df_: np.select(
              condlist=[
                  df_['price_in_dollars'] > 100,
                  df_['price_in_dollars'].between(50, 100, inclusive = "both"),
                  df_['price_in_dollars'] < 50
              ],
              choicelist=['Expensive', 'Medium Price', 'Cheap'],
              default='Unknown'
          )
 )
)

  product_num  product_name  price_in_dollars product_label
0  SPR-004456    Skateboard               121     Expensive
1  SPR-003356    Basketball                34         Cheap
2  SPR-004499  Tennis balls                12         Cheap
3  FTW-098756  Tennis shoes                79  Medium Price

Notice that the conditions and the outputs are a bit separated … we provide a list of conditions (condList) and a list of outputs choiceList … they are matched up only by position in the lists (ie first condition is given first output) so beware edits slipping parts in.

Extract/Separate

In pandas, you can extract content out of cells within a column using string functions in pandas. First we use str on a column to get access to the string functions. These work “for each row” without using .apply.

df['product_num'].str.split('-')

0    [SPR, 004456]
1    [SPR, 003356]
2    [SPR, 004499]
3    [FTW, 098756]
Name: product_num, dtype: object

So we get a little 2 piece list back. In python we can refer to parts of lists using numbers, starting at 0. So we can store the first part in one new column, and the second part in the second column.

(df
  .assign(       product_code = lambda df_: df_['product_num'].str.split('-').str[0],
          product_num_no_code = lambda df_: df_['product_num'].str.split('-').str[1])
)

  product_num  product_name  price_in_dollars product_code product_num_no_code
0  SPR-004456    Skateboard               121          SPR              004456
1  SPR-003356    Basketball                34          SPR              003356
2  SPR-004499  Tennis balls                12          SPR              004499
3  FTW-098756  Tennis shoes                79          FTW              098756

Recode

Pandas can accomplish recoding values either across the whole dataframe, or restricted to just one column.

(df
  .assign(       product_code = lambda df_: df_['product_num'].str.split('-').str[0],
          product_num_no_code = lambda df_: df_['product_num'].str.split('-').str[1])
  # dictionary to replace across entire table
  .replace({"SPR": "Sporting goods",
            "FTW": "Footwear",
            "ACW": "Active Wear"}
  )
)

  product_num  product_name  ...    product_code product_num_no_code
0  SPR-004456    Skateboard  ...  Sporting goods              004456
1  SPR-003356    Basketball  ...  Sporting goods              003356
2  SPR-004499  Tennis balls  ...  Sporting goods              004499
3  FTW-098756  Tennis shoes  ...        Footwear              098756

[4 rows x 5 columns]

If you want to restrict replacements to a particular column, you can provide a named dictionary. Here `‘product_code’ is a key of the outer dictionary, and the value is the set of old and new values.

(df
  .assign(       product_code = lambda df_: df_['product_num'].str.split('-').str[0],
          product_num_no_code = lambda df_: df_['product_num'].str.split('-').str[1])
  # nested dictionary to replace only in the product_code column
  .replace({'product_code': {
      "SPR": "Sporting goods",
      "FTW": "Footwear",
      "ACW": "Active Wear"
      }
  })
)

  product_num  product_name  ...    product_code product_num_no_code
0  SPR-004456    Skateboard  ...  Sporting goods              004456
1  SPR-003356    Basketball  ...  Sporting goods              003356
2  SPR-004499  Tennis balls  ...  Sporting goods              004499
3  FTW-098756  Tennis shoes  ...        Footwear              098756

[4 rows x 5 columns]

Pandas can also recode using a join. In pandas joins are conducted by the merge function.

lookup_values = pd.DataFrame({
    'short_categories': ['SPR', 'FTW', 'ACW'],
    'long_categories': ["Sporting goods", "Footwear", "Active Wear"]
})

(df
  .assign(       product_code = lambda df_: df_['product_num'].str.split('-').str[0],
          product_num_no_code = lambda df_: df_['product_num'].str.split('-').str[1])
  # nested dictionary to replace only in the product_code column
  .merge(lookup_values,
         left_on = 'product_code', 
         right_on = 'short_categories',
         how='left'
  )
)

  product_num  product_name  ...  short_categories long_categories
0  SPR-004456    Skateboard  ...               SPR  Sporting goods
1  SPR-003356    Basketball  ...               SPR  Sporting goods
2  SPR-004499  Tennis balls  ...               SPR  Sporting goods
3  FTW-098756  Tennis shoes  ...               FTW        Footwear

[4 rows x 7 columns]

Note

Remember that code executes sort of “inside out”. This means that the computer first looks at function calls in the middle, resolving them to a value, before executing code outside it.

The principle of replacement in programming means that we can think through code by replacing a function call with the value it returns.

Honestly, it’s hard to put into words, but lets see an example.

In this code we read a csv into a variable lookup_values, then we use that variable name in the pipeline.

lookup_values_csv = pd.read_csv('../datasets/sporting_goods_lookup.csv')

(df
  .assign(       product_code = lambda df_: df_['product_num'].str.split('-').str[0],
          product_num_no_code = lambda df_: df_['product_num'].str.split('-').str[1])
  # nested dictionary to replace only in the product_code column
  .merge(lookup_values_csv,
         left_on = 'product_code', 
         right_on = 'short_categories',
         how='left'
  )
)

  product_num  product_name  ...  short_categories long_categories
0  SPR-004456    Skateboard  ...               SPR  Sporting goods
1  SPR-003356    Basketball  ...               SPR  Sporting goods
2  SPR-004499  Tennis balls  ...               SPR  Sporting goods
3  FTW-098756  Tennis shoes  ...               FTW        Footwear

[4 rows x 8 columns]

But we can avoid creating the local variable lookup_values_csv by shifting the read_csv call directly into the call to merge.

(df
  .assign(       product_code = lambda df_: df_['product_num'].str.split('-').str[0],
          product_num_no_code = lambda df_: df_['product_num'].str.split('-').str[1])
  # nested dictionary to replace only in the product_code column
  .merge(pd.read_csv('../datasets/sporting_goods_lookup.csv'),
         left_on = 'product_code', 
         right_on = 'short_categories',
         how='left'
  )
)

  product_num  product_name  ...  short_categories long_categories
0  SPR-004456    Skateboard  ...               SPR  Sporting goods
1  SPR-003356    Basketball  ...               SPR  Sporting goods
2  SPR-004499  Tennis balls  ...               SPR  Sporting goods
3  FTW-098756  Tennis shoes  ...               FTW        Footwear

[4 rows x 8 columns]

We can “replace” the variable name with anything that returns the value of the variable name. This has the advantage of reducing little used variables … but if the operation is time-consuming, then storing the result in a variable can result in speed ups.

Unnest

When we have little lists inside cells we can get separate rows for each piece using explode.

csv_string =  """course_name,students_enrolled
"Divination","Harry, Ron"
"Herbology","Hermione, Harry"
"Flying Lessons","Harry, Hermione, Ron"
"""

# pd means pandas, which is a shortcut name we set up when we imported pandas.
df = pd.read_csv(StringIO(csv_string))

df

      course_name     students_enrolled
0      Divination            Harry, Ron
1       Herbology       Hermione, Harry
2  Flying Lessons  Harry, Hermione, Ron

First we use str.split to cut up the string, leaving a list of the parts inside the cell, rather than the single long string.

(df
  .assign(student_list = lambda df_: df_['students_enrolled'].str.split(', '))
)

      course_name     students_enrolled            student_list
0      Divination            Harry, Ron            [Harry, Ron]
1       Herbology       Hermione, Harry       [Hermione, Harry]
2  Flying Lessons  Harry, Hermione, Ron  [Harry, Hermione, Ron]

Then explode creates a new row for each of the items in the list.

(df
  .assign(student_list = lambda df_: df_['students_enrolled'].str.split(', '))
  .drop(columns = ['students_enrolled'])
  .explode('student_list') 
)

      course_name student_list
0      Divination        Harry
0      Divination          Ron
1       Herbology     Hermione
1       Herbology        Harry
2  Flying Lessons        Harry
2  Flying Lessons     Hermione
2  Flying Lessons          Ron

reset index

In pandas each row has a sort of name, separate to the data itself. You can see that above where the row numbers after explode are doubled.

We can adjust that using reset_index. By default reset_index will copy the old index into the table as a column called index, so we use drop = True to avoid that.

(df
  .assign(student = lambda df_: df_['students_enrolled'].str.split(', '))
  .drop(columns = ['students_enrolled'])
  .explode('student') 
  .reset_index(drop = True)
)

      course_name   student
0      Divination     Harry
1      Divination       Ron
2       Herbology  Hermione
3       Herbology     Harry
4  Flying Lessons     Harry
5  Flying Lessons  Hermione
6  Flying Lessons       Ron

Chained pandas vs. Other Approaches

In this course, we’ve been using chained pandas, where multiple operations are strung together in a single pipeline. However, when you look at pandas code on the web, you’ll often see examples using many intermediate variables instead. Both approaches produce the same results—they’re just different styles.

Chained approach (what we’re using)

(movies_df
  .query("release_year >= 2000")
  .sort_values("imdb_rating", ascending=False)
  .filter(items=['movie_title', 'imdb_rating'])
)

Intermediate variables approach

# Step 1: Filter
movies_2000_plus = movies_df.query("release_year >= 2000")

# Step 2: Sort
movies_sorted = movies_2000_plus.sort_values("imdb_rating", ascending=False)

# Step 3: Select columns
result = movies_sorted.filter(items=['movie_title', 'imdb_rating'])

result

Both approaches do exactly the same thing. The chained approach is more concise and keeps the dataframe transformations organized in a clear pipeline. The intermediate variables approach creates many temporary variables, which uses more memory and can make the code harder to follow.

Some online examples might also use the bracket notation style:

# Another alternative style
result = movies_df[movies_df['release_year'] >= 2000]
result = result.sort_values('imdb_rating', ascending=False)
result = result[['movie_title', 'imdb_rating']]

result

All three approaches are valid pandas, but the chained approach we’ve learned in this course is considered more modern and is part of a broader programming philosophy that emphasizes readable, maintainable data workflows.

Exercises

For these exercises, we’ll use a dataset of movies with information about their release year, genre, rating, budget, studio, and runtime. Copy this CSV string into your code to get started:

movies_csv = """movie_title,release_year,genre,imdb_rating,budget_millions,studio,runtime_minutes
The Matrix,1999,Sci-Fi,8.7,63,Warner Bros,136
Inception,2010,Sci-Fi,8.8,160,Warner Bros,148
Forrest Gump,1994,Drama,8.8,55,Paramount,142
Jurassic Park,1993,Sci-Fi,8.2,63,Universal,127
Titanic,1997,Drama,7.9,200,Paramount,194
Avatar,2009,Sci-Fi,7.8,237,20th Century,162
The Dark Knight,2008,Action,9.0,185,Warner Bros,152
Pulp Fiction,1994,Crime,8.9,8,Miramax,154
Gladiator,2000,Action,8.5,103,Paramount,155
The Shawshank Redemption,1994,Drama,9.3,25,Columbia,142
The Godfather,1972,Crime,9.2,6,Paramount,175
Dune,2021,Sci-Fi,8.0,165,Warner Bros,156
Top Gun Maverick,2022,Action,8.4,170,Paramount,131
Barbie,2023,Comedy,7.0,145,Warner Bros,114
Oppenheimer,2023,Drama,8.4,100,Universal,180"""

movies_df = pd.read_csv(StringIO(movies_csv))
movies_df

                 movie_title  release_year  ...        studio  runtime_minutes
0                 The Matrix          1999  ...   Warner Bros              136
1                  Inception          2010  ...   Warner Bros              148
2               Forrest Gump          1994  ...     Paramount              142
3              Jurassic Park          1993  ...     Universal              127
4                    Titanic          1997  ...     Paramount              194
5                     Avatar          2009  ...  20th Century              162
6            The Dark Knight          2008  ...   Warner Bros              152
7               Pulp Fiction          1994  ...       Miramax              154
8                  Gladiator          2000  ...     Paramount              155
9   The Shawshank Redemption          1994  ...      Columbia              142
10             The Godfather          1972  ...     Paramount              175
11                      Dune          2021  ...   Warner Bros              156
12          Top Gun Maverick          2022  ...     Paramount              131
13                    Barbie          2023  ...   Warner Bros              114
14               Oppenheimer          2023  ...     Universal              180

[15 rows x 7 columns]

Exercise 1: Query and Filtering

Use the query() method to answer these questions:

Find all Drama movies with an IMDb rating above 8.5
Find all movies released after 2010 that are either Sci-Fi or Action genre
Find all movies from Warner Bros or Paramount with a budget over 100 million
Find all movies with a runtime longer than 150 minutes and a rating higher than 8.0

Example solution for: “Find all Action movies released in 2008 or later”

(movies_df
  .query("genre == 'Action' and release_year >= 2008")
)

         movie_title  release_year  ...       studio  runtime_minutes
6    The Dark Knight          2008  ...  Warner Bros              152
12  Top Gun Maverick          2022  ...    Paramount              131

[2 rows x 7 columns]

Exercise 2: Filtering and Chaining

Use chained methods to:

Filter to movies with a budget under 50 million, then sort by IMDb rating in descending order
Get all Sci-Fi movies, exclude the budget_millions column, then sort by movie_title alphabetically
Find Comedy or Drama movies released after 2020, then keep only the title and rating columns using filter()

Example solution for: “Get all movies from 2000 or later, sort by budget (highest first), and keep only title, year, and budget”

(movies_df
  .query("release_year >= 2000")
  .filter(items=['movie_title', 'release_year', 'budget_millions'])
  .sort_values("budget_millions", ascending=False)
)

         movie_title  release_year  budget_millions
5             Avatar          2009              237
6    The Dark Knight          2008              185
12  Top Gun Maverick          2022              170
11              Dune          2021              165
1          Inception          2010              160
13            Barbie          2023              145
8          Gladiator          2000              103
14       Oppenheimer          2023              100

Exercise 3: Creating New Columns with Conditional Logic

Create a new column budget_category that assigns: - “Low Budget” for movies under $50 million - “Medium Budget” for movies $50-150 million - “High Budget” for movies over $150 million

Use np.select() inside an assign() call within a chained pipeline.

Hint: Follow the pattern shown in the “Treating different cases differently” section above.

Exercise 4: Recode with replace()

We are gong to add a new column called studio_region that recodes the studio names into regions:

Warner Bros, 20th Century → “Major Studio”
Paramount, Universal → “Major Studio”
Columbia, Miramax → “Independent Studio”

Use the replace() method with a nested dictionary (restricted to one column) as shown in the Recode section. You’ll need to create the studio_region column first as a copy of studio, then use replace() to recode the studio names.

Incomplete example pattern to follow:

(movies_df
  .assign(studio_region = lambda df_: df_['studio'])
  .replace({'studio_region': {
      'Warner Bros': 'Major Studio',
      'Paramount': 'Major Studio'
      }
  })
)

                 movie_title  release_year  ... runtime_minutes  studio_region
0                 The Matrix          1999  ...             136   Major Studio
1                  Inception          2010  ...             148   Major Studio
2               Forrest Gump          1994  ...             142   Major Studio
3              Jurassic Park          1993  ...             127      Universal
4                    Titanic          1997  ...             194   Major Studio
5                     Avatar          2009  ...             162   20th Century
6            The Dark Knight          2008  ...             152   Major Studio
7               Pulp Fiction          1994  ...             154        Miramax
8                  Gladiator          2000  ...             155   Major Studio
9   The Shawshank Redemption          1994  ...             142       Columbia
10             The Godfather          1972  ...             175   Major Studio
11                      Dune          2021  ...             156   Major Studio
12          Top Gun Maverick          2022  ...             131   Major Studio
13                    Barbie          2023  ...             114   Major Studio
14               Oppenheimer          2023  ...             180      Universal

[15 rows x 8 columns]

Exercise 5: Recode with merge()

Create a lookup table that maps studios to company types, then use merge() to add a new company_type column to the movies dataframe.

Step 1: Create a lookup table that maps each studio to its company type:

studio_csv = """studio,company_type
Warner Bros,Major Conglomerate
Paramount,Major Conglomerate
Universal,Major Conglomerate
20th Century,Major Conglomerate
Columbia,Major Conglomerate
Miramax,Independent"""

studio_lookup = pd.read_csv(StringIO(studio_csv))

studio_lookup

         studio        company_type
0   Warner Bros  Major Conglomerate
1     Paramount  Major Conglomerate
2     Universal  Major Conglomerate
3  20th Century  Major Conglomerate
4      Columbia  Major Conglomerate
5       Miramax         Independent

Step 2: Use merge() to add the company_type column to the movies dataframe, joining on the studio column with a left join. Follow the pattern shown in the Recode section.