# Database Normalization

### Introduction

What are the best choices when designing the schema for a relational database and what is the rationale to decide for one and against some other alternative? Given the amount of vendor-specific recommendations, it is all too easy to overlook basic relational database fundamentals. In this mini-series, I would like to discuss some general Best Practices that I have found to be particularly helpful. Nothing in it is specific to any one vendor’s product and everything should therefore be applicable, regardless which database implementation is being used.

In this article, I attempt to give an approachable introduction to the topic of database normalization and to the five Normal Forms.

#### Normalization

No discussion of relational database design is complete without a section on normalization. A normalized DB schema avoids certain anomalies when inserting, updating, or deleting data and therefore helps to keep the data in the database consistent. However, the absence of anomalies is only the tangible result of a deeper benefit of normalization: correct identification and modeling of entities. The insert, update, and delete anomalies referred to above are the consequences of the redundancy, which is introduced by improper or lacking separation between distinct entities. The normalization procedure is therefore not just a technical chore to be done out of principle, but can actively help to improve the understanding of the business domain.

Unfortunately, the treatment of normalization is often prohibitively formal, and suffers from a special, rather non-intuitive terminology. This is unfortunate, since the outcome of a normalization procedure often evokes the reaction that it all is nothing more than “Common Sense”. I will try to give explanations of expressions that the reader is likely to encounter in the literature as they come up in the discussion below.

#### Overview

Normalization is a process in which an initial DB design is transformed (or decomposed) into a a different, but equivalent design. The resulting schema is equivalent to the original one in the sense that no information is lost when going from one to the other. The normalization procedure consists of a sequence of “projections”, i.e. tables are split up “vertically”: some attributes are extracted from one table to form an new one. The decomposition is lossless only if we can restore the original table by joining its projections.

Through such non-loss decompositions it is possible to transform an
original schema into a resulting one satisfying certain conditions,
known as *Normal Forms*.
The Normal Forms span a hierarchy, in such a way, that a schema
in a higher normal form automatically fulfills all the criteria
for all of the lower normal forms.

- The First Normal Form (1NF) addresses the structure of an isolated table.
- The Second (2NF), Third (3NF), and Boyce-Codd (BCNF) Normal Forms address one-to-one and one-to-many relationships.
- The Fourth (4NF) and Fifth (5NF) Normal Forms deal with many-to-many relationships.

The Fifth Normal Form is the ultimate normal form with respect to projections and joins, i.e. it is guaranteed to be free of anomalies that can be eliminated by taking projections.

In the following discussion, any mention of keys refers to the conceptual keys formed from business data, not to any plainly technical surrogate keys, which might have been defined!

#### First Normal Form

A table is said to be in First Normal Form, if all entries in it
are scalar-valued. Relational database tables are 1NF by construction,
since vector-valued entries are forbidden. Vector-valued data
(i.e. entries which have more than one value in each row) are referred
to as *repeating groups*.

The following relation violates 1NF, since the `SupplierID`

forms a repeating group (here and in the following, primary key fields
are bold face):

{PartID, Supplier1ID, Supplier2ID, Supplier3ID }

Repeating groups indicate a one-to-many relationship — in other words a type of relationship that in relational databases is treated using foreign keys. Note that the problem of repeating groups cannot be solved by adding any number of fields to a record: even if the number of elements of the vector-valued data was fixed, finite, and predetermined, searching for a value in all these parallel fields would be prohibititively cumbersome.

To achieve 1NF, eliminate repeating groups by creating separate tables for each set of related data.

#### Second and Third Normal Form

To demonstrate the typical anomalies that occur in tables that are only 1NF, consider the following example:

{CustomerID,OrderID, CustomerAddress, OrderDate }

Note the following problems:

- Insert: It is not possible to add a record for a customer who has never placed an order.
- Update: To change the address for a customer, this change has to be repeated for all of the customer’s existing orders.
- Delete: Deleting the last order for a customer loses all information about the customer.

##### Functional Dependency

Second and Third Normal Form address dependencies among attributes, specifically between key and non-key fields.

By definition, a key uniquely determines a record: knowing the key determines the values of all the other attributes in the table row, i.e. given a key, the values of all the other attributes in the row are fixed.

This kind of relationship can be formalized as follows: Let X and Y be
attributes (or sets of attributes) of a given relation. Then Y is
*functionally dependent* on X, if, whenever two records agree on
their X-values, they must also agree on their Y-values. In this case,
X is called the *determinant* and Y is called the *dependent*.
Since for any X there must be a *single* Y, this relationship
represents a *single-valued* functional dependency.
If the set of attributes in the determinant is the smallest possible
(in the sense that after dropping one or more of the attributes from X,
the remaining set of attributes does no longer uniquely determine Y)
the dependency is called *irreducible*.

Note that functional dependency is a semantic relationship: It is the business logic of the problem domain, represented by the relation, which determines whether a certain X determines Y.

##### Second Normal Form

A table is 2NF if every non-key field is a fact about the *entire* key.
Equivalently: A table is 2NF if it is 1NF and all non-key attributes are
functionally dependent on the entire primary key (i.e. the dependency is
irreducible).

Clearly, 2NF is only relevant when the key is composite, i.e. consists of
several fields. The following example describes a table which is not 2NF,
since the `WarehouseAddress`

attribute depends only on
`WarehouseID`

but not on `PartID`

:

{PartID,WarehouseID, Quantity, WarehouseAddress }

To achieve 2NF, create separate tables for sets of values that apply to multiple records and relate these tables through foreign keys. The determinants of the initial table become the PKs of the resulting tables.

##### Third Normal Form

A relation is 3NF, if it is 2NF and none of its attributes is a fact
about another non-key field. In other words, no non-key field
functionally depends on any other non-key field. (Such indirect
dependencies are known as *transitive dependencies*.)

The following example violates 3NF, since the `Location`

is functionally dependent on the `DepartmentID.

{EmployeeID, DepartmentID, Location}

To achieve 3NF, eliminate fields that do not depend on the key from the original table and add them to the table whose primary key is their determinant.

To summarize the normalization procedure up to and including Third
Normal Form:
**Every field in a record must depend on The Key (1NF), the Whole
Key (2NF), and Nothing But The Key (3NF).**

#### Boyce-Codd Normal Form

Boyce-Codd Normal Form is an extension of 3NF to the case that there are two or more candidate keys, which are composite and overlapping (i.e. have at least one field in common). If these conditions are not fulfilled, 3NF and BCNF are equivalent.

A table is BCNF, if and only if its only determinants are candidate keys.

In the following table, both `{SupplierID, PartID}`

as well as
`{SupplierName, PartID}`

are candidate keys. The table is not
BCNF, since it contains two determinants (`SupplierID`

and
`SupplierName`

) which are not candidate keys. `SupplierID`

and `SupplierName`

are determinants, since they determine each other).

{ SupplierID, PartID, SupplierName, Quantity }

However, either of the following decompositions is BCNF:

{SupplierID, SupplierName } {SupplierID,PartID, Quantity }

or

{SupplierName, SupplierID } {SupplierName,PartID, Quantity }

To achieve BCNF, remove the determinants which are not candidate keys.

#### Fourth and Fifth Normal Form

Fourth and Fifth Normal Form apply to situations involving many-to-many relationships. In relational databases, many-to-many relationships are expressed through cross-reference tables.

As example, consider a case of class enrollment. Each student can be enrolled in one or more classes, and each class can contain one or more students. Clearly, there is a many-to-many relationship between classes and students. This relationship can be represented by a Student/Class cross-reference table:

{StudentID,ClassID}

The key for this table is the combination of `StudentID`

and
`ClassID`

. To avoid violation of 2NF, all other information
about each student and each class is stored in separate Student- and
Class-tables, respectively.

Note that each `StudentID`

determines not a unique
`ClassID`

, but a well-defined, finite *set* of values.
This kind of behaviour is referred to as *multi-valued dependency*
of `ClassID`

on `StudentID`

.

##### Fourth Normal Form

A table is in Fourth Normal Form if it is 3NF and it does not represent two or more independent many-to-many relationships.

Consider an example where there are two many-to-many relationships:
between students and classes, and between classes and teachers. There
is also an implied many-to-many relationship between students and
teachers. However, the business rules do not constrain this relationship
in any way — the combination of `StudentID`

and
`TeacherID`

does not contain any additional information,
beyond the information implied by the student/class and class/teacher
relationships. Consequentially, the student/class and class/teacher
relationships are *independent* of each other — there are no
additional constraints for these relationships. The following table
is then in violation of 4NF:

{StudentID,ClassID,TeacherID}

As an example of the anomalies that can occur, realize that it is not possible to add a new class taught by some teacher, without adding at least one student who is enrolled in this class.

To achieve 4NF, represent each independent many-to-many relationship through its own cross-reference table.

##### Fifth Normal Form

A table is in Fifth Normal Form if it is 4NF and its information content cannot be reconstructed from several tables containing fewer attributes.

Consider again the student/class/teacher example, but now assume that there is an additional relationship between students and teachers. The example table above is now 4NF, since all the relationships it describes are interrelated. However, it is not 5NF, since it can be reconstructed from three cross-reference tables, each representing one of the three many-to-many relationships:

{StudentID,ClassID} {ClassID,TeacherID} {TeacherID,StudentID}

To achieve 5NF, isolate interrelated many-to-many relationships, introducing the required number of new tables to represent all business domain constraints.

#### Normalization in Context

In practice, many databases contain entities that are at least partially
violating some of the normalization conditions. The reason is often
performance or querying convenience: a de-normalized
database may require fewer joins, and can therefore be faster for retrievals.
While this may be true, the usual caveats against premature optimization
apply here as well as everywhere else: determine sufficiently that a
performance problem exists, *and* that the proposed de-normalization
improves it, before introducing a conceptually suboptimal design.

Furthermore, a de-normalized schema can be harder to update. The additional integrity checks that are necessary in this case may offset the performance gains for queries obtained through denormalization.

*Originally published on IBM developerWorks in 2003.*