Pandas Transforms
In Python we will use pandas
for data frame wrangling. Here we load a csv formatted string into a dataframe.
import pandas as pd
from io import StringIO
# Sample CSV string
= """customer_name,customer_age,customer_state
csv_string "Jihong Li",34,"California"
"Sarah Brown",66,"Idaho"
"Rupert Brown",89,"California"
"Ravi Shastri",62,"Maharashtra"
"""
# Read the CSV string into a DataFrame
= pd.read_csv(StringIO(csv_string))
df
df
customer_name customer_age customer_state
0 Jihong Li 34 California
1 Sarah Brown 66 Idaho
2 Rupert Brown 89 California
3 Ravi Shastri 62 Maharashtra
In Pandas, perhaps more than other frameworks, there are quite a few different ways to do things. In this course we will learn a particular flavor of pandas; DataCamp homework may introduce additional techniques. For things submitted for this course, though, you will have to use the techniques presented below as much as possible
Excluding rows
Pandas provides a query
method which uses strings quite similar to SQL. For example, we can filter down to just some customers using:
"customer_state == 'California' and customer_age > 80") df.query(
customer_name customer_age customer_state
2 Rupert Brown 89 California
Note that here I’m using double quotes "
for the string in python, which means that I have to use single quotes '
to surround textual query pieces ('California'
).
We can also use isin()
within the query string and provide a list of options.
"customer_state.isin(['California', 'Maharashtra'])") df.query(
customer_name customer_age customer_state
0 Jihong Li 34 California
2 Rupert Brown 89 California
3 Ravi Shastri 62 Maharashtra
To break this syntax down a little it can help to balance out all the open/close characters. Here I’m using a tripled-quoted string in Python for a multi-line string (allow newlines within the string).
"""
df.query( customer_state.isin(
[
'California', 'Maharashtra'
]
)
""")
Excluding columns
We can exclude columns using drop
which takes a list of columns to exclude (the python list is always inside square brackets)
= ['customer_age']) df.drop(columns
customer_name customer_state
0 Jihong Li California
1 Sarah Brown Idaho
2 Rupert Brown California
3 Ravi Shastri Maharashtra
If you want to specify just the columns to keep this can be done with filter
(yes, in R tidyverse filter
means dropping rows, but here filter
is dropping columns).
filter(items = ['customer_age']) df.
customer_age
0 34
1 66
2 89
3 62
filter
is nifty because it can select columns using partial patterns. There are two modes: like
which matches parts of words, and fullon regex.
filter(like = 'age') df.
customer_age
0 34
1 66
2 89
3 62
filter(regex = '_name') df.
customer_name
0 Jihong Li
1 Sarah Brown
2 Rupert Brown
3 Ravi Shastri
Renaming columns
= {'customer_name': 'new_name'}) df.rename(columns
new_name customer_age customer_state
0 Jihong Li 34 California
1 Sarah Brown 66 Idaho
2 Rupert Brown 89 California
3 Ravi Shastri 62 Maharashtra
Sort dataframe
We can sort our dataframe by the values within a column.
"customer_age", ascending = False) df.sort_values(
customer_name customer_age customer_state
2 Rupert Brown 89 California
1 Sarah Brown 66 Idaho
3 Ravi Shastri 62 Maharashtra
0 Jihong Li 34 California
To sort it in reverse, we change that to ascending = True
"customer_age", ascending = True) df.sort_values(
customer_name customer_age customer_state
0 Jihong Li 34 California
3 Ravi Shastri 62 Maharashtra
1 Sarah Brown 66 Idaho
2 Rupert Brown 89 California
Chaining pandas methods together
We can string calls together, to end up with a pipeline of commands working with a dataframe. To make this work we have to surround the pipeline in (
and )
.
(df"customer_age < 80")
.query("customer_state")
.sort_values(= ['customer_age'])
.drop(columns = {'customer_name': 'name',
.rename(columns 'customer_state': 'state'})
)
name state
0 Jihong Li California
1 Sarah Brown Idaho
3 Ravi Shastri Maharashtra
Creating new columns
We will create new columns using assign
which works well with chained pandas. First we will read in our other dataframe.
= """product_num,product_name,price_in_dollars
csv_string "SPR-004456","Skateboard",121
"SPR-003356","Basketball",34
"SPR-004499","Tennis balls",12
"FTW-098756","Tennis shoes",79"""
# pd means pandas, which is a shortcut name we set up when we imported pandas.
= pd.read_csv(StringIO(csv_string))
df
df
product_num product_name price_in_dollars
0 SPR-004456 Skateboard 121
1 SPR-003356 Basketball 34
2 SPR-004499 Tennis balls 12
3 FTW-098756 Tennis shoes 79
To create a discounted price column we can use assign
. Note that unlike in R tidyverse we can’t just use the column name price_in_dollars
inside the assign … we have to include the kinda weird code lambda df_: df_
before it. You can think of this as making a copy of the dataframe at that point in the pipeline and creating a local variable name df_
to refer to that copy.
(df= lambda df_: df_.price_in_dollars * 0.9)
.assign(discounted_price )
product_num product_name price_in_dollars discounted_price
0 SPR-004456 Skateboard 121 108.9
1 SPR-003356 Basketball 34 30.6
2 SPR-004499 Tennis balls 12 10.8
3 FTW-098756 Tennis shoes 79 71.1
We can join multiple columns as with str_glue
in tidyverse and CONCAT
in SQL using what are called “f-strings” together with apply
.
An f-string takes local variables and interpolates them into longer strings. This is very much like str_glue
in R. For example:
= "James"
name = "Cricket"
sport
= f"The student {name} has a hobby of {sport}"
joined_up_string
joined_up_string
'The student James has a hobby of Cricket'
To use this for every row in a dataframe we can use the apply
method. This applies a piece of code to each row. The “lambda” is the small piece of code applied. This is quite like copying a formula down beside a table in Excel.
We can think of this as getting a copy of each row, then we can refer to the columns in that row using row['column_name']
. The axis = 1
part means “apply the function to every row”.
= df.apply(lambda row: f"{row['product_name']} just {row['price_in_dollars']} + .00. Buy today", axis = 1)
new_column
= new_column) df.assign(product_label
product_num ... product_label
0 SPR-004456 ... Skateboard just 121 + .00. Buy today
1 SPR-003356 ... Basketball just 34 + .00. Buy today
2 SPR-004499 ... Tennis balls just 12 + .00. Buy today
3 FTW-098756 ... Tennis shoes just 79 + .00. Buy today
[4 rows x 4 columns]
It is possible to use this f-string approach to interpolating rows to add a new column within a chained pandas pipeline, but it is a bit ugly, involving two lambda functions (one for a copy of the df, one for a copy of the row within the apply) I’d love to know how to do this better.
(= lambda df_:
df.assign(product_label apply(lambda row:
df_.f"{row['product_name']} just {row['price_in_dollars']} + .00. Buy today",
= 1)
axis
) )
product_num ... product_label
0 SPR-004456 ... Skateboard just 121 + .00. Buy today
1 SPR-003356 ... Basketball just 34 + .00. Buy today
2 SPR-004499 ... Tennis balls just 12 + .00. Buy today
3 FTW-098756 ... Tennis shoes just 79 + .00. Buy today
[4 rows x 4 columns]
Treating different cases differently
Python provides the np.select
operator which can help in providing different output for different conditions within a row. np
means that this function comes from the numpy
package, so we also have to load that. For example, we can create a new column saying whether an item is expensive, or cheap, depending on its price. We can also specify a default value.
import numpy as np
(= lambda df_: np.select(
df.assign(product_label =[
condlist'price_in_dollars'] > 100,
df_['price_in_dollars'].between(50, 100, inclusive = "both"),
df_['price_in_dollars'] < 50
df_[
],=['Expensive', 'Medium Price', 'Cheap'],
choicelist='Unknown'
default
)
) )
product_num product_name price_in_dollars product_label
0 SPR-004456 Skateboard 121 Expensive
1 SPR-003356 Basketball 34 Cheap
2 SPR-004499 Tennis balls 12 Cheap
3 FTW-098756 Tennis shoes 79 Medium Price
Notice that the conditions and the outputs are a bit separated … we provide a list of conditions (condList
) and a list of outputs choiceList
… they are matched up only by position in the lists (ie first condition is given first output) so beware edits slipping parts in.
Extract/Separate
In pandas, you can extract content out of cells within a column using string functions in pandas. First we use str
on a column to get access to the string functions. These work “for each row” without using .apply
.
'product_num'].str.split('-') df[
0 [SPR, 004456]
1 [SPR, 003356]
2 [SPR, 004499]
3 [FTW, 098756]
Name: product_num, dtype: object
So we get a little 2 piece list back. In python we can refer to parts of lists using numbers, starting at 0. So we can store the first part in one new column, and the second part in the second column.
(df= lambda df_: df_['product_num'].str.split('-').str[0],
.assign( product_code = lambda df_: df_['product_num'].str.split('-').str[1])
product_num_no_code )
product_num product_name price_in_dollars product_code product_num_no_code
0 SPR-004456 Skateboard 121 SPR 004456
1 SPR-003356 Basketball 34 SPR 003356
2 SPR-004499 Tennis balls 12 SPR 004499
3 FTW-098756 Tennis shoes 79 FTW 098756
Recode
Pandas can accomplish recoding values either across the whole dataframe, or restricted to just one column.
(df= lambda df_: df_['product_num'].str.split('-').str[0],
.assign( product_code = lambda df_: df_['product_num'].str.split('-').str[1])
product_num_no_code # dictionary to replace across entire table
"SPR": "Sporting goods",
.replace({"FTW": "Footwear",
"ACW": "Active Wear"}
) )
product_num product_name ... product_code product_num_no_code
0 SPR-004456 Skateboard ... Sporting goods 004456
1 SPR-003356 Basketball ... Sporting goods 003356
2 SPR-004499 Tennis balls ... Sporting goods 004499
3 FTW-098756 Tennis shoes ... Footwear 098756
[4 rows x 5 columns]
If you want to restrict replacements to a particular column, you can provide a named dictionary. Here `‘product_code’ is a key of the outer dictionary, and the value is the set of old and new values.
(df= lambda df_: df_['product_num'].str.split('-').str[0],
.assign( product_code = lambda df_: df_['product_num'].str.split('-').str[1])
product_num_no_code # nested dictionary to replace only in the product_code column
'product_code': {
.replace({"SPR": "Sporting goods",
"FTW": "Footwear",
"ACW": "Active Wear"
}
}) )
product_num product_name ... product_code product_num_no_code
0 SPR-004456 Skateboard ... Sporting goods 004456
1 SPR-003356 Basketball ... Sporting goods 003356
2 SPR-004499 Tennis balls ... Sporting goods 004499
3 FTW-098756 Tennis shoes ... Footwear 098756
[4 rows x 5 columns]
Pandas can also recode using a join. In pandas joins are conducted by the merge
function.
= pd.DataFrame({
lookup_values 'short_categories': ['SPR', 'FTW', 'ACW'],
'long_categories': ["Sporting goods", "Footwear", "Active Wear"]
})
(df= lambda df_: df_['product_num'].str.split('-').str[0],
.assign( product_code = lambda df_: df_['product_num'].str.split('-').str[1])
product_num_no_code # nested dictionary to replace only in the product_code column
.merge(lookup_values,= 'product_code',
left_on = 'short_categories',
right_on ='left'
how
) )
product_num product_name ... short_categories long_categories
0 SPR-004456 Skateboard ... SPR Sporting goods
1 SPR-003356 Basketball ... SPR Sporting goods
2 SPR-004499 Tennis balls ... SPR Sporting goods
3 FTW-098756 Tennis shoes ... FTW Footwear
[4 rows x 7 columns]
Remember that code executes sort of “inside out”. This means that the computer first looks at function calls in the middle, resolving them to a value, before executing code outside it.
The principle of replacement in programming means that we can think through code by replacing a function call with the value it returns.
Honestly, it’s hard to put into words, but lets see an example.
In this code we read a csv into a variable lookup_values
, then we use that variable name in the pipeline.
= pd.read_csv('../datasets/sporting_goods_lookup.csv')
lookup_values_csv
(df= lambda df_: df_['product_num'].str.split('-').str[0],
.assign( product_code = lambda df_: df_['product_num'].str.split('-').str[1])
product_num_no_code # nested dictionary to replace only in the product_code column
.merge(lookup_values_csv,= 'product_code',
left_on = 'short_categories',
right_on ='left'
how
) )
product_num product_name ... short_categories long_categories
0 SPR-004456 Skateboard ... SPR Sporting goods
1 SPR-003356 Basketball ... SPR Sporting goods
2 SPR-004499 Tennis balls ... SPR Sporting goods
3 FTW-098756 Tennis shoes ... FTW Footwear
[4 rows x 8 columns]
But we can avoid creating the local variable lookup_values_csv
by shifting the read_csv
call directly into the call to merge
.
(df= lambda df_: df_['product_num'].str.split('-').str[0],
.assign( product_code = lambda df_: df_['product_num'].str.split('-').str[1])
product_num_no_code # nested dictionary to replace only in the product_code column
'../datasets/sporting_goods_lookup.csv'),
.merge(pd.read_csv(= 'product_code',
left_on = 'short_categories',
right_on ='left'
how
) )
product_num product_name ... short_categories long_categories
0 SPR-004456 Skateboard ... SPR Sporting goods
1 SPR-003356 Basketball ... SPR Sporting goods
2 SPR-004499 Tennis balls ... SPR Sporting goods
3 FTW-098756 Tennis shoes ... FTW Footwear
[4 rows x 8 columns]
We can “replace” the variable name with anything that returns the value of the variable name. This has the advantage of reducing little used variables … but if the operation is time-consuming, then storing the result in a variable can result in speed ups.
Unnest
When we have little lists inside cells we can get separate rows for each piece using explode
.
= """course_name,students_enrolled
csv_string "Divination","Harry, Ron"
"Herbology","Hermione, Harry"
"Flying Lessons","Harry, Hermione, Ron"
"""
# pd means pandas, which is a shortcut name we set up when we imported pandas.
= pd.read_csv(StringIO(csv_string))
df
df
course_name students_enrolled
0 Divination Harry, Ron
1 Herbology Hermione, Harry
2 Flying Lessons Harry, Hermione, Ron
First we use str.split
to cut up the string, leaving a list of the parts inside the cell, rather than the single long string.
(df= lambda df_: df_['students_enrolled'].str.split(', '))
.assign(student_list )
course_name students_enrolled student_list
0 Divination Harry, Ron [Harry, Ron]
1 Herbology Hermione, Harry [Hermione, Harry]
2 Flying Lessons Harry, Hermione, Ron [Harry, Hermione, Ron]
Then explode
creates a new row for each of the items in the list.
(df= lambda df_: df_['students_enrolled'].str.split(', '))
.assign(student_list = ['students_enrolled'])
.drop(columns 'student_list')
.explode( )
course_name student_list
0 Divination Harry
0 Divination Ron
1 Herbology Hermione
1 Herbology Harry
2 Flying Lessons Harry
2 Flying Lessons Hermione
2 Flying Lessons Ron
reset index
In pandas each row has a sort of name, separate to the data itself. You can see that above where the row numbers after explode are doubled.
We can adjust that using reset_index
. By default reset_index
will copy the old index into the table as a column called index, so we use drop = True
to avoid that.
(df= lambda df_: df_['students_enrolled'].str.split(', '))
.assign(student = ['students_enrolled'])
.drop(columns 'student')
.explode(= True)
.reset_index(drop )
course_name student
0 Divination Harry
1 Divination Ron
2 Herbology Hermione
3 Herbology Harry
4 Flying Lessons Harry
5 Flying Lessons Hermione
6 Flying Lessons Ron
Wider / Longer
= pd.read_csv("../datasets/world_pop_wide.csv")
df_pop
df_pop
country 2010 2011 2012 2013
0 Australia 18767085 19056040 19414834 19775013
1 Indonesia 383721793 393333604 403171286 413200994
2 United States 249849720 252208133 254614421 256953576
To make a wide format dataframe into a long format dataframe in Pandas, we will use a function called melt
. As with pivot_wider
in tidyverse we tell melt
the id
variables which are the columns that we don’t want to reshape.
(df_pop.= ["country"], var_name = "year", value_name = "population")
melt(id_vars )
country year population
0 Australia 2010 18767085
1 Indonesia 2010 383721793
2 United States 2010 249849720
3 Australia 2011 19056040
4 Indonesia 2011 393333604
5 United States 2011 252208133
6 Australia 2012 19414834
7 Indonesia 2012 403171286
8 United States 2012 254614421
9 Australia 2013 19775013
10 Indonesia 2013 413200994
11 United States 2013 256953576
The reverse move (long to wide) in pandas can be done with pivot
# first store the long version.
= (df_pop.
df_pop_long = ["country"], var_name = "year", value_name = "population")
melt(id_vars
)
(df_pop_long= "country", columns = "year", values = "population")
.pivot(index )
year 2010 2011 2012 2013
country
Australia 18767085 19056040 19414834 19775013
Indonesia 383721793 393333604 403171286 413200994
United States 249849720 252208133 254614421 256953576
Notice, though, that pivot
moves the “country” column out of the table into an index? So we will use reset_index
to bring that column back into the table as a column.
(df_pop_long= "country", columns = "year", values = "population")
.pivot(index
.reset_index() )
year country 2010 2011 2012 2013
0 Australia 18767085 19056040 19414834 19775013
1 Indonesia 383721793 393333604 403171286 413200994
2 United States 249849720 252208133 254614421 256953576
Group/Aggregate
Pandas provides the group_by
and the agg
function, which we can use with a set of aggregation functions, such as sum
and mean
.
= pd.read_csv('../datasets/books_with_weights.csv')
books_with_weights
books_with_weights
title series format weight_in_grams
0 Harry Potter 1 Harry Potter paperback 1456
1 Two Towers Lord of the Rings paperback 1800
2 Harry Potter 3 Harry Potter hardcover 2604
3 Harry Potter 2 Harry Potter hardcover 2900
4 Return of the King Lord of the Rings hardcover 1784
For reasons that will be clearer below, we will look at aggregations with grouping first, then look at aggregations without grouping.
We provide a groupby
first then specify the aggregation.
(books_with_weights'series')
.groupby(= ('weight_in_grams', 'sum'))
.agg(weight_for_group )
weight_for_group
series
Harry Potter 6960
Lord of the Rings 3584
The aggregation here is specified as new_column_name = (column_to_aggregation, function)
. We can do more than one within the agg
call.
(books_with_weights'series')
.groupby(= ('weight_in_grams', 'sum'),
.agg(total_weight_for_group = ('weight_in_grams', 'mean'))
mean_weight_for_group )
total_weight_for_group mean_weight_for_group
series
Harry Potter 6960 2320.0
Lord of the Rings 3584 1792.0
Note that the columns that we groupby become indexes, and if we group by more than one column we get what is known as a multi-index, which is two levels of indexes.
(books_with_weights'series', 'format'])
.groupby([= ('weight_in_grams', 'sum'))
.agg(weight_for_group )
weight_for_group
series format
Harry Potter hardcover 5504
paperback 1456
Lord of the Rings hardcover 1784
paperback 1800
These can be useful if you have continued operations that use these groups (other pandas methods can use these indexes), but can be confusing if you are thinking strictly as DataFrames. Whether you have a single index or a multi-index, you can bring those back as familiar columns using reset_index
(books_with_weights'series', 'format'])
.groupby([= ('weight_in_grams', 'sum'))
.agg(weight_for_group
.reset_index() )
series format weight_for_group
0 Harry Potter hardcover 5504
1 Harry Potter paperback 1456
2 Lord of the Rings hardcover 1784
3 Lord of the Rings paperback 1800
Aggregation without groupby
We can use agg
without groupby
to get the total weight of all books without grouping.
(books_with_weights= ('weight_in_grams', 'sum'),
.agg(total_weight = ('weight_in_grams', 'mean'))
mean_weight )
weight_in_grams
total_weight 10544.0
mean_weight 2108.8
The output is a little different than the output above (and what we might expect from SQL or tidyverse, though, which would be:
total_weight,mean_weight
10544,2108
Rather than a single row with two columns, we get a single column with two rows. There are probably good reasons for this but it is inconsistent with how agg
works without groupby
so it can be confusing. We can switch this into a more familiar format using transpose (which switches the rows and columns).
(books_with_weights= ('weight_in_grams', 'sum'),
.agg(total_weight = ('weight_in_grams', 'mean'))
mean_weight
.transpose()# .reset_index(drop=True)
)
total_weight mean_weight
weight_in_grams 10544.0 2108.8
Above I’ve shown using the named column approach for agg
. It is also very common in pandas to call the functions like sum
directly. Done like that sum
works on all columns, so we have to reduce the columns sent:
(books_with_weightsfilter(items = ['weight_in_grams'])
.sum()
. )
weight_in_grams 10544
dtype: int64
When sum
is used like this it does not return a DataFrame. Instead it returns a Series
. This is a single column, but outside a DataFrame (ie it isn’t a DataFrame with 1 column, it is just a list of items).
So what we have here is an unnamed Series
with a single item (10544). The weight_in_grams
part is an index
for that item.
The tip off that this change in type has happened is that the items show horizontally (like named rows), rather than in columns (and have the cryptic dtype
at the bottom).
If sum
is called with multiple columns, each of them ends up as an item in this unnamed Series
. And text columns can also be sent to sum
where it just concantenates the strings.
(books_with_weightssum()
. )
title Harry Potter 1Two TowersHarry Potter 3Harry Po...
series Harry PotterLord of the RingsHarry PotterHarry...
format paperbackpaperbackhardcoverhardcoverhardcover
weight_in_grams 10544
dtype: object
agg
with named columns (as above) avoids these issues, and keeps everything as DataFrames, albeit with the quirk that agg
without groupby
produces a transposed dataframe.
Using agg to create a list (opposite of unnest)
As in our other frameworks, the opposite of unnest
is implemented with an groupby and aggregation.
If we start with our list of students and courses, where the grain is “a row represents a single student enrolled in a single course”.
course_name student
0 Divination Harry
0 Divination Ron
1 Herbology Hermione
1 Herbology Harry
2 Flying Lessons Harry
2 Flying Lessons Hermione
2 Flying Lessons Ron
And we would like to produce a list of courses taken by each student:
(course_student'student')
.groupby(= ('course_name', list))
.agg(course_list = lambda df_: df_['course_list'].str.join(', '))
.assign(courses_for_student )
course_list courses_for_student
student
Harry [Divination, Herbology, Flying Lessons] Divination, Herbology, Flying Lessons
Hermione [Herbology, Flying Lessons] Herbology, Flying Lessons
Ron [Divination, Flying Lessons] Divination, Flying Lessons
Python pandas quick reference
transform | primary_command | another_option |
---|---|---|
Exclude rows | query | |
Exclude columns | drop | filter |
New columns | assign (with lambda) | |
Combining columns | assign + str_glue | |
Treating cases differently | np.select inside assign | |
Recode/lookup | replace | merge |
Un-nest | explode | |
Nest | groupby/agg/list | |
Aggregate | agg | |
Group Aggregate | groupby + agg | |
Longer | melt | |
Wider | pivot |