In Pandas, perhaps more than other frameworks, there are quite a few different ways to do things. In this course we will learn a particular flavor of pandas; DataCamp homework may introduce additional techniques. For things submitted for this course, though, you will have to use the techniques presented below as much as possible
Excluding rows
Pandas provides a query method which uses strings quite similar to SQL. For example, we can filter down to just some customers using:
df.query("customer_state == 'California' and customer_age > 80")
customer_name customer_age customer_state
2 Rupert Brown 89 California
Note that here I’m using double quotes " for the string in python, which means that I have to use single quotes ' to surround textual query pieces ('California').
We can also use isin() within the query string and provide a list of options.
customer_name customer_age customer_state
0 Jihong Li 34 California
2 Rupert Brown 89 California
3 Ravi Shastri 62 Maharashtra
To break this syntax down a little it can help to balance out all the open/close characters. Here I’m using a tripled-quoted string in Python for a multi-line string (allow newlines within the string).
We can exclude columns using drop which takes a list of columns to exclude (the python list is always inside square brackets)
df.drop(columns = ['customer_age'])
customer_name customer_state
0 Jihong Li California
1 Sarah Brown Idaho
2 Rupert Brown California
3 Ravi Shastri Maharashtra
If you want to specify just the columns to keep this can be done with filter (yes, in R tidyverse filter means dropping rows, but here filter is dropping columns).
df.filter(items = ['customer_age'])
customer_age
0 34
1 66
2 89
3 62
filter is nifty because it can select columns using partial patterns. There are two modes: like which matches parts of words, and fullon regex.
df.filter(like ='age')
customer_age
0 34
1 66
2 89
3 62
df.filter(regex ='_name')
customer_name
0 Jihong Li
1 Sarah Brown
2 Rupert Brown
3 Ravi Shastri
new_name customer_age customer_state
0 Jihong Li 34 California
1 Sarah Brown 66 Idaho
2 Rupert Brown 89 California
3 Ravi Shastri 62 Maharashtra
Sort dataframe
We can sort our dataframe by the values within a column.
df.sort_values("customer_age", ascending =False)
customer_name customer_age customer_state
2 Rupert Brown 89 California
1 Sarah Brown 66 Idaho
3 Ravi Shastri 62 Maharashtra
0 Jihong Li 34 California
To sort it in reverse, we change that to ascending = True
df.sort_values("customer_age", ascending =True)
customer_name customer_age customer_state
0 Jihong Li 34 California
3 Ravi Shastri 62 Maharashtra
1 Sarah Brown 66 Idaho
2 Rupert Brown 89 California
Chaining pandas methods together
We can string calls together, to end up with a pipeline of commands working with a dataframe. To make this work we have to surround the pipeline in ( and ).
name state
0 Jihong Li California
1 Sarah Brown Idaho
3 Ravi Shastri Maharashtra
Creating new columns
We will create new columns using assign which works well with chained pandas. First we will read in our other dataframe.
csv_string ="""product_num,product_name,price_in_dollars"SPR-004456","Skateboard",121"SPR-003356","Basketball",34"SPR-004499","Tennis balls",12"FTW-098756","Tennis shoes",79"""# pd means pandas, which is a shortcut name we set up when we imported pandas.df = pd.read_csv(StringIO(csv_string))df
To create a discounted price column we can use assign. Note that unlike in R tidyverse we can’t just use the column name price_in_dollars inside the assign … we have to include the kinda weird code lambda df_: df_ before it. You can think of this as making a copy of the dataframe at that point in the pipeline and creating a local variable name df_ to refer to that copy.
product_num ... product_label
0 SPR-004456 ... Skateboard just 121.00. Buy today
1 SPR-003356 ... Basketball just 34.00. Buy today
2 SPR-004499 ... Tennis balls just 12.00. Buy today
3 FTW-098756 ... Tennis shoes just 79.00. Buy today
[4 rows x 4 columns]
Note that we need to convert the numeric price_in_dollars column to a string using .astype(str) before we can concatenate it with the text columns.
Treating different cases differently
Python provides the np.select operator which can help in providing different output for different conditions within a row. np means that this function comes from the numpy package, so we also have to load that. For example, we can create a new column saying whether an item is expensive, or cheap, depending on its price. We can also specify a default value.
Notice that the conditions and the outputs are a bit separated … we provide a list of conditions (condList) and a list of outputs choiceList … they are matched up only by position in the lists (ie first condition is given first output) so beware edits slipping parts in.
Extract/Separate
In pandas, you can extract content out of cells within a column using string functions in pandas. First we use str on a column to get access to the string functions. These work “for each row” without using .apply.
So we get a little 2 piece list back. In python we can refer to parts of lists using numbers, starting at 0. So we can store the first part in one new column, and the second part in the second column.
If you want to restrict replacements to a particular column, you can provide a named dictionary. Here `‘product_code’ is a key of the outer dictionary, and the value is the set of old and new values.
(df .assign( product_code =lambda df_: df_['product_num'].str.split('-').str[0], product_num_no_code =lambda df_: df_['product_num'].str.split('-').str[1])# nested dictionary to replace only in the product_code column .replace({'product_code': {"SPR": "Sporting goods","FTW": "Footwear","ACW": "Active Wear" } }))
Remember that code executes sort of “inside out”. This means that the computer first looks at function calls in the middle, resolving them to a value, before executing code outside it.
The principle of replacement in programming means that we can think through code by replacing a function call with the value it returns.
Honestly, it’s hard to put into words, but lets see an example.
In this code we read a csv into a variable lookup_values, then we use that variable name in the pipeline.
lookup_values_csv = pd.read_csv('../datasets/sporting_goods_lookup.csv')(df .assign( product_code =lambda df_: df_['product_num'].str.split('-').str[0], product_num_no_code =lambda df_: df_['product_num'].str.split('-').str[1])# nested dictionary to replace only in the product_code column .merge(lookup_values_csv, left_on ='product_code', right_on ='short_categories', how='left' ))
We can “replace” the variable name with anything that returns the value of the variable name. This has the advantage of reducing little used variables … but if the operation is time-consuming, then storing the result in a variable can result in speed ups.
Unnest
When we have little lists inside cells we can get separate rows for each piece using explode.
csv_string ="""course_name,students_enrolled"Divination","Harry, Ron""Herbology","Hermione, Harry""Flying Lessons","Harry, Hermione, Ron""""# pd means pandas, which is a shortcut name we set up when we imported pandas.df = pd.read_csv(StringIO(csv_string))df
course_name students_enrolled
0 Divination Harry, Ron
1 Herbology Hermione, Harry
2 Flying Lessons Harry, Hermione, Ron
First we use str.split to cut up the string, leaving a list of the parts inside the cell, rather than the single long string.
course_name student_list
0 Divination Harry
0 Divination Ron
1 Herbology Hermione
1 Herbology Harry
2 Flying Lessons Harry
2 Flying Lessons Hermione
2 Flying Lessons Ron
reset index
In pandas each row has a sort of name, separate to the data itself. You can see that above where the row numbers after explode are doubled.
We can adjust that using reset_index. By default reset_index will copy the old index into the table as a column called index, so we use drop = True to avoid that.
course_name student
0 Divination Harry
1 Divination Ron
2 Herbology Hermione
3 Herbology Harry
4 Flying Lessons Harry
5 Flying Lessons Hermione
6 Flying Lessons Ron
Chained pandas vs. Other Approaches
In this course, we’ve been using chained pandas, where multiple operations are strung together in a single pipeline. However, when you look at pandas code on the web, you’ll often see examples using many intermediate variables instead. Both approaches produce the same results—they’re just different styles.
Both approaches do exactly the same thing. The chained approach is more concise and keeps the dataframe transformations organized in a clear pipeline. The intermediate variables approach creates many temporary variables, which uses more memory and can make the code harder to follow.
Some online examples might also use the bracket notation style:
# Another alternative styleresult = movies_df[movies_df['release_year'] >=2000]result = result.sort_values('imdb_rating', ascending=False)result = result[['movie_title', 'imdb_rating']]result
All three approaches are valid pandas, but the chained approach we’ve learned in this course is considered more modern and is part of a broader programming philosophy that emphasizes readable, maintainable data workflows.
Exercises
For these exercises, we’ll use a dataset of movies with information about their release year, genre, rating, budget, studio, and runtime. Copy this CSV string into your code to get started:
movie_title release_year ... studio runtime_minutes
0 The Matrix 1999 ... Warner Bros 136
1 Inception 2010 ... Warner Bros 148
2 Forrest Gump 1994 ... Paramount 142
3 Jurassic Park 1993 ... Universal 127
4 Titanic 1997 ... Paramount 194
5 Avatar 2009 ... 20th Century 162
6 The Dark Knight 2008 ... Warner Bros 152
7 Pulp Fiction 1994 ... Miramax 154
8 Gladiator 2000 ... Paramount 155
9 The Shawshank Redemption 1994 ... Columbia 142
10 The Godfather 1972 ... Paramount 175
11 Dune 2021 ... Warner Bros 156
12 Top Gun Maverick 2022 ... Paramount 131
13 Barbie 2023 ... Warner Bros 114
14 Oppenheimer 2023 ... Universal 180
[15 rows x 7 columns]
Exercise 1: Query and Filtering
Use the query() method to answer these questions:
Find all Drama movies with an IMDb rating above 8.5
Find all movies released after 2010 that are either Sci-Fi or Action genre
Find all movies from Warner Bros or Paramount with a budget over 100 million
Find all movies with a runtime longer than 150 minutes and a rating higher than 8.0
Example solution for: “Find all Action movies released in 2008 or later”
(movies_df .query("genre == 'Action' and release_year >= 2008"))
movie_title release_year ... studio runtime_minutes
6 The Dark Knight 2008 ... Warner Bros 152
12 Top Gun Maverick 2022 ... Paramount 131
[2 rows x 7 columns]
Exercise 2: Filtering and Chaining
Use chained methods to:
Filter to movies with a budget under 50 million, then sort by IMDb rating in descending order
Get all Sci-Fi movies, exclude the budget_millions column, then sort by movie_title alphabetically
Find Comedy or Drama movies released after 2020, then keep only the title and rating columns using filter()
Example solution for: “Get all movies from 2000 or later, sort by budget (highest first), and keep only title, year, and budget”
Exercise 3: Creating New Columns with Conditional Logic
Create a new column budget_category that assigns: - “Low Budget” for movies under $50 million - “Medium Budget” for movies $50-150 million - “High Budget” for movies over $150 million
Use np.select() inside an assign() call within a chained pipeline.
Hint: Follow the pattern shown in the “Treating different cases differently” section above.
Exercise 4: Recode with replace()
We are gong to add a new column called studio_region that recodes the studio names into regions:
Warner Bros, 20th Century → “Major Studio”
Paramount, Universal → “Major Studio”
Columbia, Miramax → “Independent Studio”
Use the replace() method with a nested dictionary (restricted to one column) as shown in the Recode section. You’ll need to create the studio_region column first as a copy of studio, then use replace() to recode the studio names.
movie_title release_year ... runtime_minutes studio_region
0 The Matrix 1999 ... 136 Major Studio
1 Inception 2010 ... 148 Major Studio
2 Forrest Gump 1994 ... 142 Major Studio
3 Jurassic Park 1993 ... 127 Universal
4 Titanic 1997 ... 194 Major Studio
5 Avatar 2009 ... 162 20th Century
6 The Dark Knight 2008 ... 152 Major Studio
7 Pulp Fiction 1994 ... 154 Miramax
8 Gladiator 2000 ... 155 Major Studio
9 The Shawshank Redemption 1994 ... 142 Columbia
10 The Godfather 1972 ... 175 Major Studio
11 Dune 2021 ... 156 Major Studio
12 Top Gun Maverick 2022 ... 131 Major Studio
13 Barbie 2023 ... 114 Major Studio
14 Oppenheimer 2023 ... 180 Universal
[15 rows x 8 columns]
Exercise 5: Recode with merge()
Create a lookup table that maps studios to company types, then use merge() to add a new company_type column to the movies dataframe.
Step 1: Create a lookup table that maps each studio to its company type:
studio company_type
0 Warner Bros Major Conglomerate
1 Paramount Major Conglomerate
2 Universal Major Conglomerate
3 20th Century Major Conglomerate
4 Columbia Major Conglomerate
5 Miramax Independent
Step 2: Use merge() to add the company_type column to the movies dataframe, joining on the studio column with a left join. Follow the pattern shown in the Recode section.