Changing Data
Another key data transform is to change the data.
Changing or adding Columns
When we operate on data frames we can make changes to the data within a column. This means that we are going to make a change to each and every item in that column, piece by piece. We can either make a whole new column to hold the results, or we can replace the column.
For example, let’s say that we want to take this table and create a nicely formatted version of the number, adding a $
and decimal places.
product_name | price_in_dollars |
---|---|
Skateboard | 121 |
Basketball | 34 |
Tennis balls | 12 |
Tennis shoes | 79 |
We would then be doing the same thing with each of the prices: taking what’s in the column for that row, and putting that value between a $
and the .00
. So for the first row that would give us the output $121.00
. So then the output column would hold those values.
product_name | price_in_dollars | formatted_price |
---|---|---|
Skateboard | 121 | $121.00 |
Basketball | 34 | $34.00 |
Tennis balls | 12 | $12.00 |
Tennis shoes | 79 | $79.00 |
Then, if we wanted to, we could drop the price_in_dollars
input column, just as we excluded regular columns.
The dataframe frameworks we will be using do not require us to write any code to handle “do this for each value”. In regular programming this is called “iteration” and involves things like loops, foreach, and so on. In each of our data wrangling frameworks we will have a straight forward way to say “do this for every item in a column”.
Extracting data inside cells
Another common situation is when each cell contains data that should properly be separated into different columns. For example, product numbers sometimes contain special codes at the start of numbers. We want those in a separate column.
product_num | product_name | price_in_dollars |
---|---|---|
SPR-004456 | Skateboard | 121 |
SPR-003356 | Basketball | 34 |
SPR-004499 | Tennis balls | 12 |
FTW-098756 | Tennis shoes | 79 |
We need to look at examples of the data in the cells in that column. Here we are in luck, the data consistently has a dash character -
between the two bits we want. So the pattern in each cell is <product_code>-<product_num_no_code>
.
product_code | product_num_no_code | product_name | price_in_dollars |
---|---|---|---|
SPR | 004456 | Skateboard | 121 |
SPR | 003356 | Basketball | 34 |
SPR | 004499 | Tennis balls | 12 |
FTW | 098756 | Tennis shoes | 79 |
We will learn to do these extractions during this course, including ones where the pattern is more complicated than just consistent delimeters, such as “regular expressions” which are template strings with special characters (e.g., \d+
matches “any number of digits).
Combining columns
Just as we can separate columns, we can also join them together, making 2 (or more) columns into a single column. As we do that we can also add in additiona text. For example, we could create marketing labels be stitching together the product_name
and price_in_dollars
.
product_num | product_name | price_in_dollars | product_label |
---|---|---|---|
SPR-004456 | Skateboard | 121 | Skateboard just $121.00. Buy today! |
SPR-003356 | Basketball | 34 | Basketball just $34.00. Buy today! |
SPR-004499 | Tennis balls | 12 | Tennis balls just $12.00. Buy today! |
FTW-098756 | Tennis shoes | 79 | Tennis shoes just $79.00. Buy today! |
Treating different cases differently
As we apply code to each cell in a column we sometimes want to do a different thing for some rows/content. For example, we might want a different product label for more expensive items: our current label is fine for items that cost less than $100, but if the price is over $100 we want a different label.
Here we ask the framework to look at the value in the price_in_dollars
column, and ask if it is above $100. We can then supply a different value for the product_label
column.
product_num | product_name | price_in_dollars | product_label |
---|---|---|---|
SPR-004456 | Skateboard | 121 | Treat yourself. Invest $121 for a Skateboard today! |
SPR-003356 | Basketball | 34 | Basketball just $34.00. Buy today! |
SPR-004499 | Tennis balls | 12 | Tennis balls just $12.00. Buy today! |
FTW-098756 | Tennis shoes | 79 | Tennis shoes just $79.00. Buy today! |
Recoding data (aka lookup)
Sometimes we have lots of different cases.For example, our little product_code
that we extracted (e.g., SPR
and FTW
) are not very readable for humans. They would not be good labels for figures.
We can “recode” these values by refering to another data frame that has one row for each of the codes. This is sometimes called a “lookup”.
product_code | human_readable_code |
---|---|
SPR | Sporting goods |
FTW | Footwear |
ACW | Active Wear |
We can recode the label by picking the one that we want based on the value of the product_code
column. We can think of this as a case just like when we created the product_label
but what we want for each case is contained in another table.
Note that we have three rows with “SPR” so those rows end up with “Sporting Goods”, but only one row with “FTW”. We have no rows with “ACW” so none of our rows end up with “Active Wear”.
product_code | human_readable_code | product_name | price_in_dollars |
---|---|---|---|
SPR | Sporting goods | Skateboard | 121 |
SPR | Sporting goods | Basketball | 34 |
SPR | Sporting goods | Tennis balls | 12 |
FTW | Footwear | Tennis shoes | 79 |