The Quirky Quest for Sameness: Normalization Unmasked

 · 9 min read
 · Konstantinos Tsoumas
Table of contents

Recently, I wrote an introductory article on Data Modelling. In this post I’ll talk about normalization, another fundamental concept in the data world.

Definition

To define the term ‘normalisation’, it's the process of decomposing large tables into smaller, more manageable, ones to eliminate redundant and duplicate data but also to avoid problems with DML (Data Manipulation Language) statements. Such actions are the load/insert, update and delete.

Normalisation is mostly applied to relational databases and was first defined by Edgar Codd in 1970, shortly after the birth of the relational database in June 1970 (a bit of history never harmed anybody). Mr. Edgar Codd also introduced the idea of normal forms along with the term. There are sequential main 7 forms (Unnormalized -> Sixth normal form) but in industry you will not probably experience a normalisation beyond the 4NF. Not even the 4NF (I was trying to be optimistic okay?). This procedure should be implemented in the design phases (double optimistic!). Check the Wikipedia hyperlink for all the forms details, if needed. These forms consists of a set of rules that can be used to test a table structure to ensure that is free of problems. SO, what about them?

Advantages

Why should we normalize? Because it makes your database as cool as a cucumber.

  • Consistent (All data is consistent and accurate as each piece of data is only stored in one place).
  • Flexible (Smaller and more specific tables are easily modifiable, thus they are more flexible and adaptable).
  • Secure (Access to sensitive data is restricted - tables are smaller and specific) Ensures data integrity (reduces data redundancy and data is accurate and complete)

Reminds you of something? No? I think it could be an analogy of Don’t Repeat Yourself (DRY) concept applied to a database with all of its advantages.

One can think of this in a very simple way. Choose a table and test it against the normal forms. Well, that’s a start.

Lets oversimplify it with a drawing which we can call 'Pictionary gone wrong'.

Normalization DB

Normal forms

I will shortly describe below the first four normal forms plus a special one, the Boyce-Codd Normal Form (BCNF) which is just a bit higher than the third one. Then, because we hooomans love examples, I will make an example and apply the normal forms to test my table.

  • Unnormalized Form (UN): No normalization. Data is stored in a single table, nested and redundant data is allowed as well. It's like a party where everything goes wild and no one cares about order.

  • First Normal Form (1NF): Engines are starting. Each column is unique and has a single value (can’t be further subdivided). The table must have a unique primary key.1

  • Second Normal Form (2NF): Builds on 1NF and ensures that all partial dependencies are removed.2

  • Third Normal Form (3NF): Builds on 2NF and ensures all table contains only relevant fields related to its primary key and has no transitive dependencies.3

  • Boyce-Codd Normal Form (BCNF): A “stricter” version of the 3NF. It requires all non-key attributes to be fully functionally dependent on the primary key, meaning that each non-key attribute must be uniquely identified by the primary key alone, and not by any subset of the primary key. Think of it like a rule that says that you can only reference other tables by using their unique identifier.

Theory put into practice.

Imagine a database for a music store that tracks albums, artists, and customers' orders.

Unnormalized Form (UNF):

OrderID Customer Album Artist Genre Quantity Price
1 Alice Abbey Road The Beatles Rock 2 $25
2 Bob Thriller Michael Jackson Pop 3 $20
3 Carol Abbey Road The Beatles Rock 1 $25

Spot check: Wide table. Multiple columns. All sorts of information. Primary key is OrderID. To reach 1NF, we will have to create a unique primary key which will be made up of two columns (called a composite key). This will be a concatenation of OrderID, CustomerID.

The new table may look like:

OrderID CustomerID Customer Album Artist Genre Quantity Price
1 1 Alice Abbey Road The Beatles Rock 2 $25
2 2 Bob Thriller Michael Jackson Pop 3 $20
3 3 Carol Abbey Road The Beatles Rock 1 $25

Check the 1NF conditions. Have we reached 1NF? Have we passed the atomicity test? Why?

We did because the "Album" and "Artist" columns are considered to contain atomic values. What would NOT be considered as atomic values?

Let me put it simply.

Consider a database table for storing information about people. In a column called "Full Name," the value "Paul Smith" would be considered atomic because it represents a single piece of information - the full name of a person. However, if you were to store both the first name and last name in the same column, such as "Paul Smith," it would be considered non-atomic because it combines two distinct pieces of data (first name and last name) into one column.

Now, to reach 2NF we need to make sure that no partial dependencies exist (check footnote).

Let's separate the tables for Orders, OrderDetails, Customers, Albums, and Artists.

Orders:

OrderID CustomerID
1 1
2 2
3 3

OrderDetails:

OrderID Quantity Price
1 2 $25
2 1 $20
3 3 $25

Customers:

CustomerID Customer
1 Alice
2 Bob
3 Carol

Albums:

AlbumID Album Genre
1 "Abbey Road" Rock
2 "Thriller" Pop

Artists:

ArtistID Artist
1 The Beatles
2 Michael Jackson

The composite key (OrderID, CustomerID) is a unique primary key for the Orders table now whereas the Customer table has CustomerID as a primary key, the Albums table has AlbumID as a primary key and the Artists table has ArtistID as a primary key.

Moving on to 3NF. Do we pass the test? Think for a moment.

"Orders" table, there's no transitive dependency because "CustomerID" depends directly on the primary key "OrderID." The "OrderDetails" table also has no transitive dependencies since both "Quantity" and "Price" depend directly on the primary key "OrderID."

We have eliminated all transitive dependencies, thus achieving 3NF compliance.

Now as an exercise, test your understanding and try to answer if these tables fulfill the BCNF.

Takeaways

There is a trend that the new modeling is more about a huge table rather than multiple tables. Ralph Kimball, back in 1996, had no limitations on modern obstacles such as Spark costs (e.g. to Join the tables back together or to perform multiple actions) or even the big data that we see today. A mix (wide table-dimension modelling) may be the best way forward, but it sincerely depends on the use case.

Returning to normalisation. As a matter of fact, the whole purpose of normalising data (wherever that is performed) is to reduce size and redundancy at the cost of speed and/or to eliminate redundant storage of duplicate data (throughout the ‘space’). In the hierarchy of priorities however, the speed element comes second as data integrity must be at the front seat of things. Who would want historical, inconsistent data that can be queried quickly? I hope nobody. I hope. There’s also the notion that denormalizing specific parts can also be beneficial. To add a bit on that, now (more than ever) there are databases which are performant and can be tuned further to perform every faster (E.g., indexes, better queries, etc.) which makes denormalisation not required. To be sure about your moves, you may need to measure both (normalisation and non) to a use case, otherwise sticking to a (mostly) normalised schema can be the most prevalent idea. It’s better and safer NOT to assume everything can work for us just like it works for others and this is just on example.

Remember, you have to persuade your managers/colleagues about the changes that could be beneficial for the organisation you’re a part of, which could take double the amount of time than all of the others actions. Measurements can be your best friend here, especially in presenting stuff to others.

In conclusion, I would like to remind you that no one-size-fits-all approach exists. Study, discuss, revise to find the appropriate strategy based on your needs.

Foot Notes


    • Primary key: A single field or set of multiple fields that uniquely determine rows in a table.

    • Partial dependencies: A partial dependency occurs when a non-key attribute is dependent on a subset of the primary key.
      This means that the non-key attribute can be determined by a subset of the primary key, but not by the entire primary key. Partial dependencies can only occur when the primary key is composite.
      Fictional example: The primary key of the [Employees] table is the “employee_id” attribute.
      The “department_id” attribute is functionally dependent on the “employee_id” attribute, but not on the entire primary key.
      This is because the “department_id” attribute can be uniquely identified by the “employee_id” attribute alone.

    • Transitive dependencies: A transitive dependency in a database is a type of dependency where one non-key attribute is dependent on another non-key attribute, which is in turn dependent on the primary key.
      Fictional example: The primary key of the [Employees table] is the “employee_id” attribute.
      The “department_id” attribute is functionally dependent on the “employee_id” attribute, and the “job_title” attribute is functionally dependent on the “department_id” attribute.
      This means that there is a transitive dependency from the “job_title” attribute to the “employee_id” attribute.