Excluding data
Excluding rows
A key thing that we do with data is to exclude some entries based on some set of conditions. This enables us to answer questions like these:
“Show me only the green things”
“How many books were published before the year 1640?”
“Do we have any customers who are older than 60 and live in California?”
After we exclude the entries that don’t meet the condition, the remainder are just those that do meet the condition.
When our data is organized into rows and columns, we can understand this as asking YES/NO questions about each row.
Exercise: post-it tables and shifting template.
We can also multiple questions about each row. For example,
“Do we have any customers who are older than 60 and live in California?”
How many questions would we need to ask of each row of data in order to implement this?
customer_name | customer_age | customer_state |
---|---|---|
Jihong Li | 34 | California |
Sarah Brown | 66 | Idaho |
Rupert Brown | 89 | California |
Ravi Shastri | 62 | Maharashtra |
It is natural to think that we are asking two questions of each row:
- Is the
customer_age
value on this row over 60? - Is the
customer_state
value on this row equal to California?
Yet there is a third question hiding here.
- Is the
customer_age
value on this row over 60? - Is the
customer_state
value on this row equal to California? - Are the answers to q1 and q2 on this row both YES?
Thus when we ask these questions of our data, we enter the world of boolean logic. Boolean logic enables us to ask questions through a combination of AND
and OR
conditions. To resolve these questions we have to solve in a sequence. For consistency here I’m going to change from YES/NO
to TRUE/FALSE
.
Looking just at the row for Jihong Li
- Is the
customer_age
value on this row over 60? –> FALSE - Is the
customer_state
value on this row equal to California? –> TRUE - Are the answers to q1 and q2 both TRUE?
It is best to think of this like this:
over 60? AND customer_state equal to "California"?
customer_age FALSE AND TRUE
That second row is a boolean question, involving only TRUE
, FALSE
together with operators like AND
. Since AND
requires TRUE
on both sides, the single FALSE
means that the boolean resolves to FALSE
FALSE AND TRUE
FALSE
Exercise: Resolve the boolean table (using multiple lines as above) for each of the other three rows.
As we add more conditions to check and introduce more operators (like OR
) these tables get longer.
manufacturer | year | transmission | city_mpg | hwy_mpg |
---|---|---|---|---|
audi | 1999 | auto(l5) | 18 | 29 |
audi | 1999 | manual(m5) | 21 | 29 |
audi | 2008 | manual(m6) | 20 | 31 |
audi | 2008 | auto(av) | 21 | 30 |
ford | 1999 | auto(l4) | 11 | 17 |
ford | 1999 | auto(l4) | 11 | 17 |
ford | 2008 | auto(l6) | 12 | 18 |
ford | 2008 | auto(l5) | 13 | 19 |
If we ask this question:
Get the fords manufactured in 2008 with highway mileage over 17
- Is the manufacturer equal to
ford
? - Is the year equal to 2008?
- Is the highway mileage over 17?
Then we end up (using the first and then last rows as examples), with this three line boolean structure:
1999 29
audi FALSE AND FALSE AND TRUE
FALSE AND TRUE
FALSE
But for the last row:
2008 19
ford TRUE AND TRUE AND TRUE
TRUE AND TRUE
TRUE
When we introduce OR
things work differently. OR
needs only one side to have a TRUE
.
TRUE OR FALSE
TRUE
or a more complex one:
TRUE OR FALSE OR TRUE
TRUE OR TRUE
TRUE
When we start mixing AND
and OR
we have to start worrying about the order (like the order of operations in basic arithmatic); we will return to that later in the course.
Excluding columns
Excluding columns can be a bit simpler. We can think of this as “hiding” columns we don’t want to see, or we can think of this as keeping columns we do want to see.
“Show me just the manufacturers and the year”
manufacturer | year |
---|---|
audi | 1999 |
audi | 1999 |
audi | 2008 |
audi | 2008 |
ford | 1999 |
ford | 1999 |
ford | 2008 |
ford | 2008 |
“Pass through all the columns other than city_mpg”
or
“Drop the city_mpg column”
manufacturer | year | transmission | hwy_mpg |
---|---|---|---|
audi | 1999 | auto(l5) | 29 |
audi | 1999 | manual(m5) | 29 |
audi | 2008 | manual(m6) | 31 |
audi | 2008 | auto(av) | 30 |
ford | 1999 | auto(l4) | 17 |
ford | 1999 | auto(l4) | 17 |
ford | 2008 | auto(l6) | 18 |
ford | 2008 | auto(l5) | 19 |
We can, though, exclude columns based on more complex logic, including things like
Pass on only the columns whose name starts with mpg_
or using a boolean combination of conditions
Show me the manufacturer column and any column whose name is less than four characters long
Or we could ask about the sort of data in the column
Show me only the columns whose values are numbers or dates
In this course we will learn how to do this in all four of our frameworks. In the course I’ll call these “data transforms” because it has changed the data, even if only by excluding some of it. I’ll also refer to these as “verbs” since it is something we are doing to the data. Informally I might call this a data “move” akin to a martial arts or dance “move”.