In this module we learn how to model data. This means understanding the sort of things we have in our dataset and the relationships between them. Thus, we will be learning “relational data modeling.” You will find that this is closely related to our next module, on SQL queries, but data modeling is more abstract and useful regardless of the specific technologies one handles data with.
In this section of the course we’ll be learning abstract modeling approaches and a concrete way of representing them in SQL. The concrete way we’ll learn is a set of conventions. As conventions they are consistent and useful, but we should remember that they aren’t followed by everyone and that there are other ways of representing similar relationships.
When we are data modeling we will always start by writing down some concrete examples of the phenomena that we want to model, just as we began the exercise in the first class with real objects in front of us. I call this a “free-form description” of your data. For example, when we had the objects on the table in front of the class, you might have written something like:
On the table are four things, a red pen, a red mug, a green highlighter, and a green vase. You can write with the pen and the highlighter. The vase might hold some flowers, while the mug might hold some coffee.
If you were modeling Olympic swimming events you might write:
Katie Ledeky swam for the US team at the 2016 Olympics. She swam a number of different events, the 200m, 400m and the 800m freestyle and she won all three in record time. She was joined on the podium by …. She didn’t swim any backstroke events.
So, by “free form description” I mean just jotting down some examples in free text that you are going to use as concrete examples for your modeling.
Think of the exercise we did with the cards in the first class, where we described our four objects with two attributes: color and function. Each of the individual objects had a single color and function. Conversely each color or function could have many objects; we could have many green objects or many objects that write. Thus we can informally say that there is a “one to many” relationship between color and objects, or function and objects. More formally, as we’ll discuss further below, we will call this a
has_many / belongs_to relationship.
On the other hand, we could easily imagine a situation where each object was described with more than one color, so that the relationship was two-way: a single color could have many objects, but each object could have many colors. In that situation we might say that we had a “many-to-many” relationship between objects and colors. More formally, we will call this a
has_and_belongs_to_many / has_and_belongs_to_many relationship.
Finally (at least for as far as we go in this course) we can imagine a third situation, in which objects can not only have multiple colors but can be re-painted from time to time. So the relationship between objects and colors is still “many to many” but we also want to model the events which caused the relationship between colors and objects. We’ll call this
We’ll cover all three of these types of relationships over three classes, starting today with
has_many / belongs_to.
has_many / belongs_to to model “one to many” relationships. The following picture shows a data model for Object, Function, and Color that uses three elements:
Three things to note at this point:
The boxes with the Capitalized words represent the types of things in the data model (called “entities”), and the lines represent the relationships between them. Thus Object, Color, and Function are all entities that have relationships to each other. The numbers at the ends of the lines show the kind of relationship (the
* here represents “many”). I describe the numbers in more detail below. The words on the line describe the relationship; the words often come from the free-form description of the data (the concrete examples that you wrote down and are modeling from). Those labels are different than the formal
has_many / belongs_to.
To read the lines between the entities, you always begin with “A single [entity]…” Following this, you use the words that describe the relationship (e.g. “has a” or “is a color of”) and then the numbers, which go at the far end of the line. You can read the numbers as “between” [first number] and [second number].” Therefore,
0..* would read as “between zero and many” and
1..1 would be read as “between one and one” or simplified to “a single.”
With those elements we can read the line from Object to Color in both directions. From left to right: “A single object has a single color” and, going right to left: “A single color is the color of between 0 and many objects.”
Reading the line from Object to Function, we would say “A single object performs a single function” and “A single function is the function of between 0 and many objects.”
When we have the pattern of a line with a maximum number
1 at one end and
* at the other, then we know that we have a “one to many” relationship. Then we know that we will use the
has_many / belongs_to pair. These pairs of words are called “relationship vocabulary” and they are a good short cut for talking about database structures. Thus we look at the
* at Object and the
1 (max) at Color and we know that we will write:
Color has_many Object Object belongs_to Color
We always write the
has_many first and the
belongs_to second (and on the following line). These are a pair, so there is no such thing as
has_many / has_many nor is there
belongs_to / belongs_to. There is also no such thing as
has_one / has_many.
Recall from our exercise with the cards that we are using
id labels to represent the ends of our relationships (rather than the strings we first used). So each table has a column called
id which is the
primary key for that table. The primary key is one end of the string, while the foreign key is the other, linking the item identified by a primary key to items in another table. The relational vocabulary tells us where the foreign key goes. For
has_many / belongs_to the foreign key always (and only) goes in the table for the entity that is to the left of
In our example
Object is to the left of
belongs_to for both Color and Function. Thus we know that the foreign keys will be in that table. We also know, because of the conventions we are using, that the foreign key column for Color will be titled
color_id and the column for Function will be
function_id. Because we always use
id as the name of the primary key column, when we see a column with the format
<table_name>_id we know that is a foreign key. These conventions became very useful down the road when we are writing SQL queries.
We underline the column titles (id, name etc.) and put the foreign keys as the rightmost columns in the table. Like our Diagrams, our table sketches are communicative: they should show the reader how we’ve modeled things. Thus they should include concrete data, but they don’t need to include every column and we can elide extra columns with an underlined ellipsis: …
We use specific conventions about how we write entities and how we name tables and columns. Entities (the boxes in the diagram and the words in the relational vocabularies) represent the abstract concept of a thing (the Concept of a Color, the Concept of a Function). To make that elevated status clear we Capitalize Entities (and to cope with multiword entities we use CamelCase). Given that we’re abstracting there is only one Concept of a Color, and thus we also write Entities in the singular. We say Color and not Colors. That is slightly awkward with the nearly english form of the relational vocabulary, causing us to say “Object has_many Color” rather than the more natural “Object has_many Colors” or even “An Object has_many Colors”. You’ll get used to it.
Thus the words in the boxes in the diagram and to the left and right of the relational vocabulary are Capitalized and singular.
For the table sketches, however, we are descending into the concrete world, away from Platonic ideal of Concepts and into specific concrete instantiations of the concepts. Here we deal with “green” and “red,” not the lofty Color. Thus we use lowercase and plural for the name of the table: “colors.” Given that computers do not handle spaces easily we simplify things by using the underscore
_ to separate words. We also name columns with lowercase. Foreign keys refer to a single row in their tables and so columns for foreign keys are always singular:
colors_id, even though the table is named
Naming things in these ways is conceptually clear. It is also immensely helpful when it comes to writing queries because we don’t have to think about whether things are capitalized or not when writing queries: we simply lowercase everything. This is also important because database systems are very inconsistent about whether and when they recognize Capitals as distinct from lowercase. We should remember, though, that our use here is convention: databases in the real world will use Capitals and singular in table and column names and even use spaces in these names. The real world is messy enough, for this class we’ll be consistent: when grading I will enforce your use of CamelCase for Entities and lowercase plurals for table names. It’s a learning process, though, so I’ll be reasonable and please let me know if you spot errors in the teaching materials!
One wrinkle of using plurals to name tables, though, is that English has irregular plurals. Thus when modeling people we could have an entity
Person but a table
people rather than
persons. On behalf of english-speaking people everywhere I can only say “Sorry about that!” You should use the correct plural, even when it is irregular.
Below I show a figure that reiterates the format that we will use for our database designs. Right-click to download and print as a “cheat sheet” for quick reference.
Form groups of four and you will be assigned one of the relationships below. Each should be modeled with just two entities (two boxes) and one relationship between them (one line).
has_many / belongs_to and only two boxes and one line.)