Database normalization is a series of steps followed to obtain a database design that allows for consistent storage and efficient access of data in a relational database. These steps reduce data redundancy and the chances of data becoming inconsistent.
However, many relational DBMSs lack sufficient separation between the logical database design and the physical implementation of the data store, such that queries against a fully normalized database often perform poorly. In this case denormalization is sometimes used to improve performance, at the cost of reduced consistency guarantees.
Table of contents |
2 Formal Treatment |
A table in a relational database is said to be in a certain normal form if it satisfies certain constraints. Edgar F. Codd's original work defined three such forms but there are now other generally accepted normal forms. We give here a short informal overview of the most common ones. Each normal form below represents a stronger condition than the previous one (in the order below). For most practical purposes, databases are considered normalized if they adhere to third normal form.
Before we can talk about normalization we first need to fix some terms from the relational model and define them in set theory. These definitions will sometimes be simplifications of their proper definitions in this model because normalization only concerns certain aspects of the relational model.
Basic notions in the relational model are relation names and attribute names. We will represent these as strings such as "Person" and "name" and we will usually use the variables r, s, t, ... and a, b, c to range over them. Another basic notion is the set of atomic values that contains values such as numbers and strings.
Our first definition concerns the notion of tuple which formalizes the notion of row or record in a table:
Informal Overview
Formal Treatment
The next definition defines relation which formalizes the contents of a table as it is defined in the relational model.
Such a relation closely corresponds to what is usually called the extension of a predicate in first-order logic except that here we identify the places in the predicate with attribute names. Usually in the relational model a database schema is said to consist of a set of relation names, the headers that are associated with these names and the constraints that should hold for every instance of the database schema. For normalization we will concentrate on the constraints that hold for individual relations, i.e., the relation constraints. The purpose of these constraints is to describe the relation universe, i.e., the set of all relations that are allowed to be associated with a certain relation name.
One of the simpelest and most important type of relation constraint is the key constraint. It tells us that in every instance of a certain relational schema the tuples can be identified by their values for certain attributes.
INPUT: a set S of FDs that contain only subsets of a header H OUTPUT: the set C of superkeys that hold as candidate keys in all relation universes over H in which all FDs in S hold begin C := ∅; // found candidate keys Q := { H }; // superkeys that contain candidate keys while Q <> ∅ do let K be some element from Q; Q := Q - { K }; minimal := true; for each X->Y in S do K' := (K - Y) ∪ X; // derive new superkey if K' ⊂ K then minimal := false; Q := Q ∪ { K' }; fi od if minimal and there is not a subset of K in C then remove all supersets of K from C; C := C ∪ { K }; fi od end