Data Frames
A data frame is a special arrangement of rows, columns and headers.
A data frame has one key condition:
- Rectangular: Each row has the same number of items, in the same order.
A spreadsheet has rows and columns, but it isn’t always a data frame.
Exercise: Talk with your neighbor about three ways that a spreadsheet might not be a data frame.
Because each row has the same number of items, we can refer to columns as things. This means that columns have names, even if they are just numbers.
The fourth column
or, if we have textual column names
The “manufacturer”
In our different frameworks, a data frame has different names but all match these conditions.
framework | local_name |
---|---|
R tidyverse | tibble |
SQL | table or relation |
Python Pandas | DataFrame |
Excel | Sheet with Filter Headers |
Data Frames and file formats
A data frame is an abstract concept. We can store a dataframe in lots of different file formats, but the relationship is tricky.
For example, a csv file (comma separated values) is not necessarily a data frame … it can store a data frame, but not all csv files are data frames. For example, here is a perfectly valid (and useful) csv file:
australia,Australia,Commonwealth of Australia
south korea,Democratic Republic of Korea, South Korea, "Korea, South", KOR
usa,United States
This is valid csv … but it is not a data frame, why not?
Conversely, even file formats that look nothing like rows and columns can express a dataframe: json, for example
[
{
"customer_name": "Jihong Li",
"customer_age": 34,
"customer_state": "California"
},
{
"customer_name": "Sarah Brown",
"customer_age": 66,
"customer_state": "Idaho"
},
{
"customer_name": "Rupert Brown",
"customer_age": 89,
"customer_state": "California"
},
{
"customer_name": "Ravi Shastri",
"customer_age": 62,
"customer_state": "Maharashtra"
}
]
This json is a data frame because each item has the same number of elements; we can easily convert this into a rectangular table:
customer_name | customer_age | customer_state |
---|---|---|
Jihong Li | 34 | California |
Sarah Brown | 66 | Idaho |
Rupert Brown | 89 | California |
Ravi Shastri | 62 | Maharashtra |
The techniques we will learn in this class work when data is a data frame. We will learn a few techniques for fixing up data that is nearly a data frame: (e.g., filling in missing data due to merged cells in spreadsheets) and we may get to extracting data from well-structured but non rectangular structures in json and xml, but we will not learn things like free-form web scraping.