Data Wrangling Course

James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.

Modeling many to many relationships with has_and_belongs_to_many

The second of our three relationship types is used when entities have a “many to many” relationship. In our running example, if our objects can be multi-colored (have more than one color), then we move from a “one to many” to a “many to many”. A cup could be both red and green, while red could still be the color of many different objects.

Another good example of a many to many relationship is a tagging system, such as Book and Genre. A genre like “mystery” or “horror” clearly includes many different books, and each book can be from more than one genre (although many will have only one.) Another good example are people and their hobbies: many people like to paint, and some of those people also like to dance.

One way of recognizing the has_and_belongs_to_many relationship is to draw out some concrete examples and use lines to join up those that are associated with each other.

If you see objects on both sides with multiple lines coming from them, then you know that you have a has_and_belongs_to_many. If only one side has lines with more than one line coming or going from it you have a has_many / belongs_to. Above, for instance, “paint” and “running” each point to two people. On the other side, “Sarah” and “Kyle” each point to two hobbies. Because both the hobbies side and the people side both have data points that send out multiple lines, we have a has_and_belongs_to_many relationship.

When we draw our ER diagram we get a * (for many as our maximum cardinality) on both ends of the line:

We can read this as “A single class enrolls between 0 and many students” and “A single student takes between 0 and many classes”.

Previously we saw that the pattern of 1 ------- * or * ------ 1 meant we had a has_many / belongs_to relationship, as in Object and Color which below has a * ------ 1 because the maximums are * and 1.

In our student and class example, though, the pattern of the maximums is * ------- * which means that we have a has_and_belongs_to_many.

From relational vocab to table sketch.

Representing a has_and_belongs_to_many relationship in tables still uses primary keys and foreign keys. But it isn’t enough to put a foreign key in one table (as we did for has_many). If we just put class_id into the students table, then that would limit each student to taking only 1 class (or we could end up with multiple rows for the student, duplicating primary keys, something we can’t do). Similarly if we just put student_id into the classes table, because each class could only have 1 student! And it’s no help having foreign keys in both tables.

We solve this problem by introducing a new table that sits between the two tables for the entities: we call it a Simple Association Table and we show it between the two tables for the entity.

Each entry in the entity tables has only one row and those tables don’t have foreign keys for this relationship. The table in the middle, though, has both foreign keys. In fact it only has both foreign keys. Simple Association Tables are special because they don’t have an id column of their own, they only have foreign key columns. As before we name a foreign key column with the name of the table it refers to, making it singular, and adding _id. Thus the foreign key referring to the classes table is class_id and the foreign key referring to the students table is student_id.

Now we can have a row for each of the lines that we drew above. Each student_id can show up on multiple rows, and each class_id can show up on multiple rows. You might be wondering, though, why we can double rows in here but not in the entity tables and how we can get away with not having an id column. The answer is that the combination of the foreign keys is unique: we can only have one row linking a specific class and a specific student. This we call a composite primary key.

Naming conventions for has_and_belongs_to_many

The simple association table is named in lowercase but with a name made up by joining the two tables names in alphabetical order. We always work to show it sitting between the entity tables that it joins. Thus when we have Class has_and_belongs_to_many Student and Student has_and_belongs_to_many Class we know that the simple association table will be called classes_students (and not students_classes because c comes before s in the alphabet.) We use the pluralized form because the table names are pluralized. Same rule if we have multiword Entities (which have their own underscores in their table names already.)

Note that the simple association table does not show up in the ER diagram, so there can be more tables than there are boxes in the ER diagram. Tables and foreign keys are a way of modeling entities and relationships but they aren’t the same thing.

habtm as a short cut.

has_and_belongs_to_many is pretty long, especially when we have to write it twice, so I often shorten it to habtm and you can too.

Modeling Exercise (in-class)

Form groups of four and you will be assigned one of the relationships below. Each should be modeled with just two entities (two boxes) and one relationship between them (one line). Hint: they are all has_and_belongs_to_many :)

You should:

  1. Brainstorm some concrete instances. Feel free to start with sentences or jotted notes, but you must have some real world examples to begin your modeling.
  2. Begin to develop your diagram. Work hard to find great illustrative names for your relationship (the labels on the lines). Don’t sweat the minimum numbers too much, it’s the maximum numbers that matter.
  3. Write out the relational vocabulary (check that you are getting the CamelCasing correct)
  4. Sketch tables and insert your concrete example data from step 1. (check that you are correctly using lowercase and underscores).

Examples (all habtm and only two boxes and one line.)

  1. Books and genres
    e.g., The Lord of the Rings is a Fantasy book, but The Hobbit is Young-Adult as is The Hunger Games.
  2. Photos and tags e.g., Photo21 was tagged “cute” “adorable” and “dog”. Photo23 was tagged “dog” and “large”.
  3. Shows that actors have been in (over their career) e.g., Hugh Laurie acted in both House, M.D. and Blackadder. Blackadder also starred Rowan Atkinson and Stephen Fry
  4. Buses and routes they can be used on.
    e.g., Route 37 was served by three busses this morning: 5PC*L23, U89*S32, and U28*628 (The specific vehicles identified by license plates).
  5. Races and competitors (no results) e.g., Shuyen, Sarah, and Swati all competed in the half-marathon. Only Swati competed in the 10,000m.

Don’t forget to brainstorm some more concrete instances and include them as sample data in your table sketches!