Changing Data

Another key data transform is to change the data.

Changing or adding Columns

When we operate on data frames we can make changes to the data within a column. This means that we are going to make a change to each and every item in that column, piece by piece. We can either make a whole new column to hold the results, or we can replace the column.

For example, let’s say that we want to take this table and create a nicely formatted version of the number, adding a $ and decimal places.

product_name price_in_dollars
Skateboard 121
Basketball 34
Tennis balls 12
Tennis shoes 79

We would then be doing the same thing with each of the prices: taking what’s in the column for that row, and putting that value between a $ and the .00. So for the first row that would give us the output $121.00. So then the output column would hold those values.

product_name price_in_dollars formatted_price
Skateboard 121 $121.00
Basketball 34 $34.00
Tennis balls 12 $12.00
Tennis shoes 79 $79.00

Then, if we wanted to, we could drop the price_in_dollars input column, just as we excluded regular columns.

Tip

The dataframe frameworks we will be using do not require us to write any code to handle “do this for each value”. In regular programming this is called “iteration” and involves things like loops, foreach, and so on. In each of our data wrangling frameworks we will have a straight forward way to say “do this for every item in a column”.

Extracting data inside cells

Another common situation is when each cell contains data that should properly be separated into different columns. For example, product numbers sometimes contain special codes at the start of numbers. We want those in a separate column.

product_num product_name price_in_dollars
SPR-004456 Skateboard 121
SPR-003356 Basketball 34
SPR-004499 Tennis balls 12
FTW-098756 Tennis shoes 79

We need to look at examples of the data in the cells in that column. Here we are in luck, the data consistently has a dash character - between the two bits we want. So the pattern in each cell is <product_code>-<product_num_no_code>.

product_code product_num_no_code product_name price_in_dollars
SPR 004456 Skateboard 121
SPR 003356 Basketball 34
SPR 004499 Tennis balls 12
FTW 098756 Tennis shoes 79

We will learn to do these extractions during this course, including ones where the pattern is more complicated than just consistent delimeters, such as “regular expressions” which are template strings with special characters (e.g., \d+ matches “any number of digits).

Combining columns

Just as we can separate columns, we can also join them together, making 2 (or more) columns into a single column. As we do that we can also add in additiona text. For example, we could create marketing labels be stitching together the product_name and price_in_dollars.

product_num product_name price_in_dollars product_label
SPR-004456 Skateboard 121 Skateboard just $121.00. Buy today!
SPR-003356 Basketball 34 Basketball just $34.00. Buy today!
SPR-004499 Tennis balls 12 Tennis balls just $12.00. Buy today!
FTW-098756 Tennis shoes 79 Tennis shoes just $79.00. Buy today!

Treating different cases differently

As we apply code to each cell in a column we sometimes want to do a different thing for some rows/content. For example, we might want a different product label for more expensive items: our current label is fine for items that cost less than $100, but if the price is over $100 we want a different label.

Here we ask the framework to look at the value in the price_in_dollars column, and ask if it is above $100. We can then supply a different value for the product_label column.

product_num product_name price_in_dollars product_label
SPR-004456 Skateboard 121 Treat yourself. Invest $121 for a Skateboard today!
SPR-003356 Basketball 34 Basketball just $34.00. Buy today!
SPR-004499 Tennis balls 12 Tennis balls just $12.00. Buy today!
FTW-098756 Tennis shoes 79 Tennis shoes just $79.00. Buy today!

Recoding data (aka lookup)

Sometimes we have lots of different cases.For example, our little product_code that we extracted (e.g., SPR and FTW) are not very readable for humans. They would not be good labels for figures.

We can “recode” these values by refering to another data frame that has one row for each of the codes. This is sometimes called a “lookup”.

product_code human_readable_code
SPR Sporting goods
FTW Footwear
ACW Active Wear

We can recode the label by picking the one that we want based on the value of the product_code column. We can think of this as a case just like when we created the product_label but what we want for each case is contained in another table.

Note that we have three rows with “SPR” so those rows end up with “Sporting Goods”, but only one row with “FTW”. We have no rows with “ACW” so none of our rows end up with “Active Wear”.

product_code human_readable_code product_name price_in_dollars
SPR Sporting goods Skateboard 121
SPR Sporting goods Basketball 34
SPR Sporting goods Tennis balls 12
FTW Footwear Tennis shoes 79