Changing Data

Another key data transform is to change the data.

Changing or adding Columns

When we operate on data frames we can make changes to the data within a column. This means that we are going to make a change to each and every item in that column, piece by piece. We can either make a whole new column to hold the results, or we can replace the column.

For example, let’s say that we want to take this table and create a nicely formatted version of the number, adding a $ and decimal places.

product_name	price_in_dollars
Skateboard	121
Basketball	34
Tennis balls	12
Tennis shoes	79

We would then be doing the same thing with each of the prices: taking what’s in the column for that row, and putting that value between a $ and the .00. So for the first row that would give us the output $121.00. So then the output column would hold those values.

product_name	price_in_dollars	formatted_price
Skateboard	121	$121.00
Basketball	34	$34.00
Tennis balls	12	$12.00
Tennis shoes	79	$79.00

Then, if we wanted to, we could drop the price_in_dollars input column, just as we excluded regular columns.

Tip

The dataframe frameworks we will be using do not require us to write any code to handle “do this for each value”. In regular programming this is called “iteration” and involves things like loops, foreach, and so on. In each of our data wrangling frameworks we will have a straight forward way to say “do this for every item in a column”.

Extracting data inside cells

Another common situation is when each cell contains data that should properly be separated into different columns. For example, product numbers sometimes contain special codes at the start of numbers. We want those in a separate column.

product_num	product_name	price_in_dollars
SPR-004456	Skateboard	121
SPR-003356	Basketball	34
SPR-004499	Tennis balls	12
FTW-098756	Tennis shoes	79

We need to look at examples of the data in the cells in that column. Here we are in luck, the data consistently has a dash character - between the two bits we want. So the pattern in each cell is <product_code>-<product_num_no_code>.

product_code	product_num_no_code	product_name	price_in_dollars
SPR	004456	Skateboard	121
SPR	003356	Basketball	34
SPR	004499	Tennis balls	12
FTW	098756	Tennis shoes	79

We will learn to do these extractions during this course, including ones where the pattern is more complicated than just consistent delimeters, such as “regular expressions” which are template strings with special characters (e.g., \d+ matches “any number of digits).

Combining columns

Just as we can separate columns, we can also join them together, making 2 (or more) columns into a single column. As we do that we can also add in additiona text. For example, we could create marketing labels be stitching together the product_name and price_in_dollars.

product_num	product_name	price_in_dollars	product_label
SPR-004456	Skateboard	121	Skateboard just $121.00. Buy today!
SPR-003356	Basketball	34	Basketball just $34.00. Buy today!
SPR-004499	Tennis balls	12	Tennis balls just $12.00. Buy today!
FTW-098756	Tennis shoes	79	Tennis shoes just $79.00. Buy today!

Treating different cases differently

As we apply code to each cell in a column we sometimes want to do a different thing for some rows/content. For example, we might want a different product label for more expensive items: our current label is fine for items that cost less than $100, but if the price is over $100 we want a different label.

Here we ask the framework to look at the value in the price_in_dollars column, and ask if it is above $100. We can then supply a different value for the product_label column.

product_num	product_name	price_in_dollars	product_label
SPR-004456	Skateboard	121	Treat yourself. Invest $121 for a Skateboard today!
SPR-003356	Basketball	34	Basketball just $34.00. Buy today!
SPR-004499	Tennis balls	12	Tennis balls just $12.00. Buy today!
FTW-098756	Tennis shoes	79	Tennis shoes just $79.00. Buy today!

Recoding data (aka lookup)

Sometimes we have lots of different cases.For example, our little product_code that we extracted (e.g., SPR and FTW) are not very readable for humans. They would not be good labels for figures.

We can “recode” these values by refering to another data frame that has one row for each of the codes. This is sometimes called a “lookup”.

product_code	human_readable_code
SPR	Sporting goods
FTW	Footwear
ACW	Active Wear

We can recode the label by picking the one that we want based on the value of the product_code column. We can think of this as a case just like when we created the product_label but what we want for each case is contained in another table.

Note that we have three rows with “SPR” so those rows end up with “Sporting Goods”, but only one row with “FTW”. We have no rows with “ACW” so none of our rows end up with “Active Wear”.

product_code	human_readable_code	product_name	price_in_dollars
SPR	Sporting goods	Skateboard	121
SPR	Sporting goods	Basketball	34
SPR	Sporting goods	Tennis balls	12
FTW	Footwear	Tennis shoes	79