Data Frames

A data frame is a special arrangement of rows, columns and headers.

A data frame has one key condition:

  1. Rectangular: Each row has the same number of items, in the same order.

A spreadsheet has rows and columns, but it isn’t always a data frame.

Exercise: Talk with your neighbor about three ways that a spreadsheet might not be a data frame.

Because each row has the same number of items, we can refer to columns as things. This means that columns have names, even if they are just numbers.

The fourth column

or, if we have textual column names

The “manufacturer”

In our different frameworks, a data frame has different names but all match these conditions.

framework local_name
R tidyverse tibble
SQL table or relation
Python Pandas DataFrame
Excel Sheet with Filter Headers

Data Frames and file formats

A data frame is an abstract concept. We can store a dataframe in lots of different file formats, but the relationship is tricky.

For example, a csv file (comma separated values) is not necessarily a data frame … it can store a data frame, but not all csv files are data frames. For example, here is a perfectly valid (and useful) csv file:

australia,Australia,Commonwealth of Australia
south korea,Democratic Republic of Korea, South Korea, "Korea, South", KOR
usa,United States

This is valid csv … but it is not a data frame, why not?

Conversely, even file formats that look nothing like rows and columns can express a dataframe: json, for example

[
  {
    "customer_name": "Jihong Li",
    "customer_age": 34,
    "customer_state": "California"
  },
  {
    "customer_name": "Sarah Brown",
    "customer_age": 66,
    "customer_state": "Idaho"
  },
  {
    "customer_name": "Rupert Brown",
    "customer_age": 89,
    "customer_state": "California"
  },
  {
    "customer_name": "Ravi Shastri",
    "customer_age": 62,
    "customer_state": "Maharashtra"
  }
] 

This json is a data frame because each item has the same number of elements; we can easily convert this into a rectangular table:

customer_name customer_age customer_state
Jihong Li 34 California
Sarah Brown 66 Idaho
Rupert Brown 89 California
Ravi Shastri 62 Maharashtra

The techniques we will learn in this class work when data is a data frame. We will learn a few techniques for fixing up data that is nearly a data frame: (e.g., filling in missing data due to merged cells in spreadsheets) and we may get to extracting data from well-structured but non rectangular structures in json and xml, but we will not learn things like free-form web scraping.