Excluding data

Excluding rows

A key thing that we do with data is to exclude some entries based on some set of conditions. This enables us to answer questions like these:

“Show me only the green things”

“How many books were published before the year 1640?”

“Do we have any customers who are older than 60 and live in California?”

After we exclude the entries that don’t meet the condition, the remainder are just those that do meet the condition.

When our data is organized into rows and columns, we can understand this as asking YES/NO questions about each row.

Exercise: post-it tables and shifting template.

We can also multiple questions about each row. For example,

“Do we have any customers who are older than 60 and live in California?”

How many questions would we need to ask of each row of data in order to implement this?

customer_name customer_age customer_state
Jihong Li 34 California
Sarah Brown 66 Idaho
Rupert Brown 89 California
Ravi Shastri 62 Maharashtra

It is natural to think that we are asking two questions of each row:

  1. Is the customer_age value on this row over 60?
  2. Is the customer_state value on this row equal to California?

Yet there is a third question hiding here.

  1. Is the customer_age value on this row over 60?
  2. Is the customer_state value on this row equal to California?
  3. Are the answers to q1 and q2 on this row both YES?

Thus when we ask these questions of our data, we enter the world of boolean logic. Boolean logic enables us to ask questions through a combination of AND and OR conditions. To resolve these questions we have to solve in a sequence. For consistency here I’m going to change from YES/NO to TRUE/FALSE.

Looking just at the row for Jihong Li

  1. Is the customer_age value on this row over 60? –> FALSE
  2. Is the customer_state value on this row equal to California? –> TRUE
  3. Are the answers to q1 and q2 both TRUE?

It is best to think of this like this:

customer_age over 60? AND customer_state equal to "California"?
     FALSE          AND      TRUE

That second row is a boolean question, involving only TRUE, FALSE together with operators like AND. Since AND requires TRUE on both sides, the single FALSE means that the boolean resolves to FALSE

     FALSE          AND      TRUE
                FALSE

Exercise: Resolve the boolean table (using multiple lines as above) for each of the other three rows.

As we add more conditions to check and introduce more operators (like OR) these tables get longer.

manufacturer year transmission city_mpg hwy_mpg
audi 1999 auto(l5) 18 29
audi 1999 manual(m5) 21 29
audi 2008 manual(m6) 20 31
audi 2008 auto(av) 21 30
ford 1999 auto(l4) 11 17
ford 1999 auto(l4) 11 17
ford 2008 auto(l6) 12 18
ford 2008 auto(l5) 13 19

If we ask this question:

Get the fords manufactured in 2008 with highway mileage over 17

  1. Is the manufacturer equal to ford?
  2. Is the year equal to 2008?
  3. Is the highway mileage over 17?

Then we end up (using the first and then last rows as examples), with this three line boolean structure:

audi          1999           29
FALSE   AND   FALSE    AND  TRUE
    
       FALSE           AND  TRUE
    
                      FALSE

But for the last row:

ford          2008          19
TRUE   AND   TRUE     AND   TRUE
    
       TRUE           AND   TRUE
    
                      TRUE

When we introduce OR things work differently. OR needs only one side to have a TRUE.

TRUE OR FALSE
    TRUE

or a more complex one:

TRUE OR FALSE OR TRUE
    TRUE OR TRUE
        TRUE
Caution

When we start mixing AND and OR we have to start worrying about the order (like the order of operations in basic arithmatic); we will return to that later in the course.

Excluding columns

Excluding columns can be a bit simpler. We can think of this as “hiding” columns we don’t want to see, or we can think of this as keeping columns we do want to see.

“Show me just the manufacturers and the year”

manufacturer year
audi 1999
audi 1999
audi 2008
audi 2008
ford 1999
ford 1999
ford 2008
ford 2008

“Pass through all the columns other than city_mpg”

or

“Drop the city_mpg column”

manufacturer year transmission hwy_mpg
audi 1999 auto(l5) 29
audi 1999 manual(m5) 29
audi 2008 manual(m6) 31
audi 2008 auto(av) 30
ford 1999 auto(l4) 17
ford 1999 auto(l4) 17
ford 2008 auto(l6) 18
ford 2008 auto(l5) 19

We can, though, exclude columns based on more complex logic, including things like

Pass on only the columns whose name starts with mpg_

or using a boolean combination of conditions

Show me the manufacturer column and any column whose name is less than four characters long

Or we could ask about the sort of data in the column

Show me only the columns whose values are numbers or dates

Note

In this course we will learn how to do this in all four of our frameworks. In the course I’ll call these “data transforms” because it has changed the data, even if only by excluding some of it. I’ll also refer to these as “verbs” since it is something we are doing to the data. Informally I might call this a data “move” akin to a martial arts or dance “move”.