Data Wrangling Course

James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.

Data Modeling

In this module we learn how to model data. This means understanding the sort of things we have in our dataset and the relationships between them. Thus, we will be learning “relational data modeling.” You will find that this is closely related to our next module, on SQL queries, but data modeling is more abstract and useful regardless of the specific technologies one handles data with.

In this section of the course we’ll be learning abstract modeling approaches and a concrete way of representing them in SQL. The concrete way we’ll learn is a set of conventions. As conventions they are consistent and useful, but we should remember that they aren’t followed by everyone and that there are other ways of representing similar relationships.

When we are data modeling we will always start by writing down some concrete examples of the phenomena that we want to model, just as we began the exercise in the first class with real objects in front of us. I call this a “free-form description” of your data. For example, when we had the objects on the table in front of the class, you might have written something like:

On the table are four things, a red pen, a red mug, a green highlighter, and a green vase. You can write with the pen and the highlighter. The vase might hold some flowers, while the mug might hold some coffee.

If you were modeling Olympic swimming events you might write:

Katie Ledeky swam for the US team at the 2016 Olympics. She swam a number of different events, the 200m, 400m and the 800m freestyle and she won all three in record time. She was joined on the podium by …. She didn’t swim any backstroke events.

So, by “free form description” I mean just jotting down some examples in free text that you are going to use as concrete examples for your modeling.

Three kinds of relationships

Think of the exercise we did with the cards in the first class, where we described our four objects with two attributes: color and function. Each of the individual objects had a single color and function. Conversely each color or function could have many objects; we could have many green objects or many objects that write. Thus we can informally say that there is a “one to many” relationship between color and objects, or function and objects. More formally, as we’ll discuss further below, we will call this a has_many / belongs_to relationship.

On the other hand, we could easily imagine a situation where each object was described with more than one color, so that the relationship was two-way: a single color could have many objects, but each object could have many colors. In that situation we might say that we had a “many-to-many” relationship between objects and colors. More formally, we will call this a has_and_belongs_to_many / has_and_belongs_to_many relationship.

Finally (at least for as far as we go in this course) we can imagine a third situation, in which objects can not only have multiple colors but can be re-painted from time to time. So the relationship between objects and colors is still “many to many” but we also want to model the events which caused the relationship between colors and objects. We’ll call this has_many through.

We’ll cover all three of these types of relationships over three classes, starting today with has_many / belongs_to.

has_many / belongs_to

We use has_many / belongs_to to model “one to many” relationships. The following picture shows a data model for Object, Function, and Color that uses three elements:

  1. An diagram with boxes and lines (“ER diagram”)
  2. Relational Vocabulary (has_many/belongs_to)
  3. A sketch of relational tables with examples (bottom of the page with concrete data “pen”, “yellow” etc.)

Three things to note at this point:

  1. In this course whenever I speak of a “Database design” I mean all three of these elements. It is easy to forget one of them (usually the table sketches or the sample data in them.)
  2. Each of the three parts has to be consistent with each other; they are all different ways of showing the same thing. This means that each step functions as a check against the others.
  3. The database design is a communicative document; our goal is to communicate with other people.

The boxes with the Capitalized words represent the types of things in the data model (called “entities”), and the lines represent the relationships between them. Thus Object, Color, and Function are all entities that have relationships to each other. The numbers at the ends of the lines show the kind of relationship (the * here represents “many”). I describe the numbers in more detail below. The words on the line describe the relationship; the words often come from the free-form description of the data (the concrete examples that you wrote down and are modeling from). Those labels are different than the formal has_many / belongs_to.

To read the lines between the entities, you always begin with “A single [entity]…” Following this, you use the words that describe the relationship (e.g. “has a” or “is a color of”) and then the numbers, which go at the far end of the line. You can read the numbers as “between” [first number] and [second number].” Therefore, 0..* would read as “between zero and many” and 1..1 would be read as “between one and one” or simplified to “a single.”

With those elements we can read the line from Object to Color in both directions. From left to right: “A single object has a single color” and, going right to left: “A single color is the color of between 0 and many objects.”

Reading the line from Object to Function, we would say “A single object performs a single function” and “A single function is the function of between 0 and many objects.”

From diagram to relational vocabulary

When we have the pattern of a line with a maximum number 1 at one end and * at the other, then we know that we have a “one to many” relationship. Then we know that we will use the has_many / belongs_to pair. These pairs of words are called “relationship vocabulary” and they are a good short cut for talking about database structures. Thus we look at the * at Object and the 1 (max) at Color and we know that we will write:

Color has_many Object
Object belongs_to Color

We always write the has_many first and the belongs_to second (and on the following line). These are a pair, so there is no such thing as has_many / has_many nor is there belongs_to / belongs_to. There is also no such thing as has_one / has_many.

From relational vocabulary to table sketch

Recall from our exercise with the cards that we are using id labels to represent the ends of our relationships (rather than the strings we first used). So each table has a column called id which is the primary key for that table. The primary key is one end of the string, while the foreign key is the other, linking the item identified by a primary key to items in another table. The relational vocabulary tells us where the foreign key goes. For has_many / belongs_to the foreign key always (and only) goes in the table for the entity that is to the left of belongs_to.

In our example Object is to the left of belongs_to for both Color and Function. Thus we know that the foreign keys will be in that table. We also know, because of the conventions we are using, that the foreign key column for Color will be titled color_id and the column for Function will be function_id. Because we always use id as the name of the primary key column, when we see a column with the format <table_name>_id we know that is a foreign key. These conventions became very useful down the road when we are writing SQL queries.

We underline the column titles (id, name etc.) and put the foreign keys as the rightmost columns in the table. Like our Diagrams, our table sketches are communicative: they should show the reader how we’ve modeled things. Thus they should include concrete data, but they don’t need to include every column and we can elide extra columns with an underlined ellipsis:

A note on CamelCase vs lowercase and Singular vs Plural

We use specific conventions about how we write entities and how we name tables and columns. Entities (the boxes in the diagram and the words in the relational vocabularies) represent the abstract concept of a thing (the Concept of a Color, the Concept of a Function). To make that elevated status clear we Capitalize Entities (and to cope with multiword entities we use CamelCase). Given that we’re abstracting there is only one Concept of a Color, and thus we also write Entities in the singular. We say Color and not Colors. That is slightly awkward with the nearly english form of the relational vocabulary, causing us to say “Object has_many Color” rather than the more natural “Object has_many Colors” or even “An Object has_many Colors”. You’ll get used to it.

Thus the words in the boxes in the diagram and to the left and right of the relational vocabulary are Capitalized and singular.

For the table sketches, however, we are descending into the concrete world, away from Platonic ideal of Concepts and into specific concrete instantiations of the concepts. Here we deal with “green” and “red,” not the lofty Color. Thus we use lowercase and plural for the name of the table: “colors.” Given that computers do not handle spaces easily we simplify things by using the underscore _ to separate words. We also name columns with lowercase. Foreign keys refer to a single row in their tables and so columns for foreign keys are always singular: color_id not colors_id, even though the table is named colors.

Naming things in these ways is conceptually clear. It is also immensely helpful when it comes to writing queries because we don’t have to think about whether things are capitalized or not when writing queries: we simply lowercase everything. This is also important because database systems are very inconsistent about whether and when they recognize Capitals as distinct from lowercase. We should remember, though, that our use here is convention: databases in the real world will use Capitals and singular in table and column names and even use spaces in these names. The real world is messy enough, for this class we’ll be consistent: when grading I will enforce your use of CamelCase for Entities and lowercase plurals for table names. It’s a learning process, though, so I’ll be reasonable and please let me know if you spot errors in the teaching materials!

One wrinkle of using plurals to name tables, though, is that English has irregular plurals. Thus when modeling people we could have an entity Person but a table people rather than persons. On behalf of english-speaking people everywhere I can only say “Sorry about that!” You should use the correct plural, even when it is irregular.

Quick reference for Database Design syntax.

Below I show a figure that reiterates the format that we will use for our database designs. Right-click to download and print as a “cheat sheet” for quick reference.

Modeling Exercise (in-class)

Form groups of four and you will be assigned one of the relationships below. Each should be modeled with just two entities (two boxes) and one relationship between them (one line).

You should:

  1. Brainstorm some concrete instances. Feel free to start with sentences or jotted notes, but you must have some real world examples to begin your modeling.
  2. Begin to develop your diagram. Work hard to find great illustrative names for your relationship (the labels on the lines). Don’t sweat the minimum numbers too much, it’s the maximum numbers that matter.
  3. Write out the relational vocabulary (check that you are getting the CamelCasing correct)
  4. Sketch tables and insert your concrete example data from step 1. (check that you are correctly using lowercase and underscores).

Examples (all has_many / belongs_to and only two boxes and one line.)

  1. Players on football teams (at one time, assume no changes)
  2. Customers making transactions at a grocery store
  3. Presidents and their terms in office
  4. Species and their genus (e.g., Felis Catus and Felis silvestris)
  5. State/Province and their Country
  6. Cities and states (use some cities with the same name in different states)