Chapter 8 Normalization and Other Data Modeling Methods

There are many paths to the top of the mountain, but the view is always the same.

Chinese Proverb

Learning objectives

Students completing this chapter will

  • understand the process of normalization;

  • be able to distinguish between different normal forms;

  • recognize that different data modeling approaches, while they differ in representation methods, are essentially identical.

Multiple paths

There are often many ways to solve a problem, and different methods frequently produce essentially the same solution. When this is the case, the goal is to find the most efficient path. We believe that data modeling, as you have learned in the preceding chapters, is an efficient and easily learned approach to database design. There are, however, other paths to consider. One of these methods, normalization, was the initial approach to database design. It was developed as part of the theoretical foundation of the relational data model. It is useful to understand the key concepts of normalization because they advance your understanding of the relational model. Experience and research indicate, however, that data modeling is easier to master than normalization.

Data modeling first emerged as entity-relationship (E-R) modeling in a paper by Peter Pin-Shan Chen. He introduced two key database design concepts:

  • Identify entities and the relationships between them.

  • A graphical representation improves design communication.

Chen’s core concepts spawned many species of data modeling. We briefly review, later in this chapter, Chen’s E-R approach.

Normalization

Normalization is a method for increasing the quality of database design. It is also a theoretical base for defining the properties of relations. The theory gradually developed to create an understanding of the desirable properties of a relation. The goal of normalization is identical to that of data modeling—a high-fidelity design. The need for normalization seems to have arisen from the conversion of prior file systems into database format. Often, analysts started with the old file design and used normalization to design the new database. Now, designers are more likely to start with a clean slate and use data modeling.

Normal forms can be arrived at in several ways. The recommended approach is data modeling, as experience strongly indicates people find it is an easier path to high quality database design. If the principles of data modeling are followed faithfully, then the outcome should be a high-fidelity model and a normalized database. In other words, if you model data correctly, you create a normalized design. Nevertheless, modeling mistakes can occur, and normalization is a useful crosscheck for ensuring the soundness of a data model. Normalization also provides a theoretical underpinning to data modeling.

Normalization gradually converts a file design into normal form by the successive application of rules to move the design from first to fifth normal form. Before we look at these steps, it is useful to learn about functional dependency.

Functional dependency

A functional dependency is a relationship between attributes in an entity. It simply means that one or more attributes determine the value of another. For example, given a stock’s code, you can determine its current PE ratio. In other words, PE ratio is functionally dependent on stock code. In addition, stock name, stock price, stock quantity, and stock dividend are functionally dependent on stock code. The notation for indicating that stock code functionally determines stock name is

stock code → stock name

An identifier functionally determines all the attributes in an entity. That is, if we know the value of stock code, then we can determine the value of stock name, stock price, and so on.

Formulae, such as yield = stock dividend/stock price*100, are a form of functional dependency. In this case, we have

(stock dividend, stock price) → yield

This is an example of full functional dependency because yield can be determined only from both attributes.

An attribute, or set of attributes, that fully functionally determines another attribute is called a determinant. Thus, stock code is a determinant because it fully functionally determines stock PE. An identifier, usually called a key when discussing normalization, is a determinant. Unlike a key, a determinant need not be unique. For example, a university could have a simple fee structure where undergraduate courses are $5000 and graduate courses are $7500. Thus, course type → fee. Since there are many undergraduate and graduate courses, course type is not unique for all records.

There are situations where a given value determines multiple values. This multidetermination property is denoted as A ↠ B and reads “A multidetermines B.” For instance, a department multidetermines a course. If you know the department, you can determine the set of courses it offers. Multivalued dependency means that functional dependencies are multivalued.

Functional dependency is a property of a relation’s data. We cannot determine functional dependency from the names of attributes or the current values. Sometimes, examination of a relation’s data will indicate that a functional dependency does not exist, but it is by understanding the relationships between data elements we can determine functional dependency.

Functional dependency is a theoretical avenue for understanding relationships between attributes. If we have two attributes, say A and B, then three relations are possible, as shown in the following table.

Functional dependencies of two attributes

Relationship Functional dependency Relationship
They determine each other A → B and B → A 1:1
One determines the other A → B 1:m
They do not determine each other A not → B and B not → A m:m

One-to-one attribute relationship

Consider two attributes that determine each other (A → B and B → A), for instance a country’s code and its name. Using the example of Switzerland, there is a one-to-one (1:1) relationship between CH and Switzerland: CH → Switzerland and Switzerland → CH. When two attributes have a 1:1 relationship, they must occur together in at least one table in a database so that their equivalence is a recorded fact.

One-to-many attribute relationship

Examine the situation where one attribute determines another (i.e., A → B), but the reverse is not true (i.e., B not → A), as is the case with country name and its currency unit. If you know a country’s name, you can determine its currency, but if you know the currency unit (e.g., the euro), you cannot always determine the country (e.g., both Italy and Portugal use the euro). As a result, if A and B occur in the same table, then A must be the key. In our example, country name would be the key, and currency unit would be a nonkey column.

Many-to-many attribute relationship

The final case to investigate is when neither attribute determines the other (i.e., A not → B and B not → A). The relationship between country name and language is many-to-many (m:m). For example, Belgium has two languages (French and Flemish), and French is spoken in many countries. Thus, an associative entity between country and language is required to represent the m:m relationship.

As you can understand from the preceding discussion, functional dependency is an explicit form of presenting some of the ideas you gained implicitly in earlier chapters on data modeling. The next step is to delve into another theoretical concept underlying the relational model: normal forms.

Normal forms

Normal forms describe a classification of relations. Initial work by Codd identified first (1NF), second (2NF), and third (3NF) normal forms. Later researchers added Boyce-Codd (BCNF), fourth (4NF), and fifth (5NF) normal forms. Normal forms are nested like a set of Russian dolls, with the innermost doll, 1NF, contained within all other normal forms. The hierarchy of normal forms is 5NF, 4NF, BCNF, 3NF, 2NF, and 1NF. Thus, 5NF is the outermost doll.

A new normal form, domain-key normal form (DK/NF), appeared in 1981. When a relation is in DK/NF, there are no modification anomalies. Conversely, any relation that is free of anomalies must be in DK/NF. The difficult part is discovering how to convert a relation to DK/NF.

First normal form

A relation is in first normal form if and only if all columns are single-valued. In other words, 1NF states that all occurrences of a row must have the same number of columns. In data modeling terms, this means that an attribute must have a single value. An attribute that can have multiple values must be represented as a one-to-many (1:m) relationship. At a minimum, a data model will be in 1NF, because all attributes of an entity are required to be single-valued.

Second normal form

Second normal form is violated when a nonkey column is dependent on only a component of the primary key. This can also be stated as a relation is in second normal form if and only if it is in first normal form, and all nonkey columns are dependent on the key.

Consider the following table. The primary key of order is a composite of itemno and customerid. The problem is that customer-credit is a fact about customerid (part of the composite key) rather than the full key (itemno+customerid), or in other words, it is not fully functionally dependent on the primary key. An insert anomaly arises when you try to add a new customer to order. You cannot add a customer until that person places an order, because until then you have no value for item number, and part of the primary key will be null. Clearly, this is neither an acceptable business practice nor an acceptable data management procedure.

Analyzing this problem, you realize an item can be in many orders, and a customer can order many items—an m:m relationship. By drawing the data model in the following figure, you realize that customer-credit is an attribute of customer, and you get the correct relational mapping.

Second normal form violation

*itemno *customerid quantity customer-credit
12 57 25 OK
34 679 3 POOR

Resolving second normal form violation

Third normal form

Third normal form is violated when a nonkey column is a fact about another nonkey column. Alternatively, a relation is in third normal form if and only if it is in second normal form and has no transitive dependencies.

The problem in the following table is that exchange rate is a fact about nation, a nonkey field. In the language of functional dependency, exchange rate is not fully functionally dependent on stockcode, the primary key.

The functional dependencies are stockcodenationexchange rate. In other words, exchange rate is transitively dependent on STOCK, since exchange rate is dependent on nation, and nation is dependent on stockcode. The fundamental problem becomes very apparent when you try to add a new nation to the stock table. Until you buy at least one stock for that nation, you cannot insert the nation, because you do not have a primary key. Similarly, if you delete MG from the stock table, details of the USA exchange rate are lost. There are modification anomalies.

When you think about the data relationships, you realize that a nation has many stocks and a stock belongs to only one nation. Now the data model and relational map can be created readily .

Third normal form violation

*stockcode nation exchange rate
MG USA 0.67
IR AUS 0.46

Resolving third normal form violation

Skill builder

You have been given a spreadsheet that contains details of invoices. The
column headers for the spreadsheet are date, invoice number, invoice amount, invoice tax, invoice total, cust number, custname, cust street, cust city, cust state, cust postal code, cust nation, product code, product price, product quantity, salesrep number, salesrep first name, salesrep last name, salesrep district, district name, and district size.

Normalize this spreadsheet so that it can be converted to a high-fidelity relational database.

Boyce-Codd normal form

The original definition of 3NF did not cover a situation that, although rare, can occur. So Boyce-Codd normal form, a stronger version of 3NF, was developed. BCNF is necessary because 3NF does not cover the cases when

  • A relation has multiple candidate keys.

  • Those candidate keys are composite.

  • The candidate keys overlap because they have at least one column in common.

Before considering an example, candidate key needs to be defined. Earlier we introduced the idea that an entity could have more than one potential unique identifier. These identifiers become candidate keys when the data model is mapped to a relational database. One of these candidates is selected as the primary key.

Consider the following case from a management consulting firm. A client can have many types of problems (e.g., finance, personnel), and the same problem type can be an issue for many clients. Consultants specialize and advise on only one problem type, but several consultants can advise on one problem type. A consultant advises a client. Furthermore, for each problem type, the client is advised by only one consultant. If you did not use data modeling, you might be tempted to create the following table.

Boyce-Codd normal form violation

*client *probtype consultant
Alpha Marketing Gomez
Alpha Production Raginiski

The column client cannot be the primary key because a client can have several problem types; however, a client is advised by only one consultant for a specific problem type, so the composite key client+probtype determines consultant. Also, because a consultant handles only one type of problem, the composite key client+consultant determines probtype. So, both of these composites are candidate keys. Either one can be selected as the primary key, and in this case client+probtype was selected. Notice that all the previously stated conditions are satisfied—there are multiple, composite candidate keys that overlap. This means the table is 3NF, but not BCNF. This can be easily verified by considering what happens if the firm adds a new consultant. A new consultant cannot be added until there is a client—an insertion anomaly. The problem is that consultant is a determinant, consultantprobtype, but is not a candidate key. In terms of the phrasing used earlier, the problem is that part of the key column is a fact about a nonkey column. The precise definition is a relation is in Boyce-Codd normal form if and only if every determinant is a candidate key.

This problem is avoided by creating the correct data model and then mapping to a relational model.

Resolving Boyce-Codd normal form violation

Fourth normal form

Fourth normal form requires that a row should not contain two or more independent multivalued facts about an entity. This requirement is more readily understood after investigating an example.

Consider students who play sports and study subjects. One way of representing this information is shown in the following table. Consider the consequence of trying to insert a student who did not play a sport. Sport would be null, and this is not permissible because part of the composite primary key would then be null—a violation of the entity integrity rule. You cannot add a new student until you know her sport and her subject. Modification anomalies are very apparent.

Fourth normal form violation

*studentid *sport *subject
50 Football English
50 Football Music
50 Tennis Botany
50 Karate Botany

This table is not in 4NF because sport and subject are independent multivalued facts about a student. There is no relationship between sport and subject. There is an indirect connection because sport and subject are associated with a student. In other words, a student can play many sports, and the same sport can be played by many students—a many-to-many (m:m) relationship between student and sport. Similarly, there is an m:m relationship between student and subject. It makes no sense to store information about a student’s sports and subjects in the same table because sport and subject are not related. The problem arises because sport and subject are multivalued dependencies of student. The solution is to convert multivalued dependencies to functional dependencies. More formally, a relation is in fourth normal form if it is in Boyce-Codd normal form and all multivalued dependencies on the relation are functional dependencies. A data model sorts out this problem, and note that the correct relational mapping requires five tables.

Resolving fourth normal form violation

Fifth normal form

Fifth normal form deals with the case where a table can be reconstructed from other tables. The reconstruction approach is preferred, because it means less redundancy and fewer maintenance problems.

A consultants, firms, and skills problem is used to illustrate the concept of 5NF. The problem is that consultants provide skills to one or more firms and firms can use many consultants; a consultant has many skills and a skill can be used by many firms; and a firm can have a need for many skills and the same skill can be required by many firms. The data model for this problem has the ménage-à-trois structure which was introduced in Chapter 7.

The CONSULTANT-FIRM-SKILL data model without a rule

The relational mapping of the three-way relationships results in four tables, with the associative entity ASSIGNMENT mapping shown in the following table.

Relation table assignment

*consultid *firmid *skilldesc
Tan IBM Database
Tan Apple Data comm

The table is in 5NF because the combination of all three columns is required to identify which consultants supply which firms with which skills. For example, we see that Tan advises IBM on database and Apple on data communications. The data in this table cannot be reconstructed from other tables.

The preceding table is not in 5NF if there is a rule of the following form: If a consultant has a certain skill (e.g., database) and has a contract with a firm that requires that skill (e.g., IBM), then the consultant advises that firm on that skill (i.e., he advises IBM on database). Notice that this rule means we can infer a relationship, and we no longer need the combination of the three columns. As a result, we break the single three-entity m:m relationship into three two-entity m:m relationships. The revised data model after the introduction of the rule is:

Further understanding of 5NF is gained by examining the relational tables resulting from these three associative entities. Notice the names given to each of the associative entities. The table contract records data about the firms a consultant advises; advise describes the skills a consultant has; and require stores data about the types of skills each firm requires. To help understand this problem, examine the three tables below.

Relational table contract

*consultid *firmid
Gonzales Apple
Gonzales IBM
Gonzales NEC
Tan IBM
Tan NEC
Wood Apple

Relational table advise

*consultid *skilldesc
Gonzales Database
Gonzales Data comm
Gonzales Groupware
Tan Database
Tan Data comm
Wood Data comm

Relational table require

*firmid *skilldesc
IBM Data comm
IBM Groupware
NEC Data comm
NEC Database
NEC Groupware
Apple Data comm

Consider joining contract and advise, the result of which we call could advise because it lists skills the consultant could provide if the firm required them. For example, Tan has skills in database and data communications, and Tan has a contract with IBM. If IBM requires database skills, Tan can handle it. We need to look at require to determine whether IBM requires advice on database; it does not.

Relational table could advise

*consultid *firmid *skilldesc
Gonzales Apple Database
Gonzales Apple Data comm
Gonzales Apple Groupware
Gonzales IBM Database
Gonzales IBM Data comm
Gonzales IBM Groupware
Gonzales NEC Database
Gonzales NEC Data comm
Gonzales NEC Groupware
Tan IBM Database
Tan IBM Data comm
Tan NEC Database
Tan NEC Data comm
Wood Apple Data comm

Relational table can advise

The join of could advise with require gives details of a firm’s skill needs that a consultant can provide. The table can advise is constructed by directly joining contract, advise, and require. Because we can construct can advise from three other tables, the data are in 5NF.

*consultid *firmid *skilldesc
Gonzales IBM Data comm
Gonzales IBM Groupware
Gonzales NEC Database
Gonzales NEC Data comm
Gonzales NEC Groupware
Tan IBM Data comm
Tan NEC Database
Tan NEC Data comm
Wood Apple Data comm

Since data are stored in three separate tables, updating is easier. Consider the case where IBM requires database skills. We only need to add one row to require to record this fact. In the case of can advise, we would have to add two rows, one for Gonzales and one for Tan.

Now we can give 5NF a more precise definition: A relation is in fifth normal form if and only if every join dependency of the relation is a consequence of the candidate keys of the relation.

Up to this point, data modeling has enabled you to easily avoid normalization problems. Fifth normal form introduces a complication, however. How can you tell when to use a single three-way associative entity or three two-way associative entities? If there is a constraint or rule that is applied to the relationships between entities, consider the possibility of three two-way associative entities. Question the client carefully whenever you find a ménage-à-trois. Check to see that there are no rules or special conditions.

Domain-key normal form

The definition of DK/NF builds on three terms: key, constraint, and domain. You already know a key is a unique identifier. A constraint is a rule governing attribute values. It must be sufficiently well-defined so that its truth can be readily evaluated. Referential integrity constraints, functional dependencies, and data validation rules are examples of constraints. A domain is a set of all values of the same data type. With the help of these terms, the concept of DK/NF is easily stated. A relation is in domain-key normal form if and only if every constraint on the relation is a logical consequence of the domain constraints and the key constraints that apply to the relation.

Note that DK/NF does not involve ideas of dependency; it just relies on the concepts of key, constraint, and domain. The problem with DK/NF is that while it is conceptually simple, no algorithm has been developed for converting relations to DK/NF. Hence, database designers must rely on their skills to create relations that are in DK/NF.

Conclusion

Normalization provides designers with a theoretical basis for understanding what they are doing when modeling data. It also alerts them to be aware of problems that they might not ordinarily detect when modeling. For example, 5NF cautions designers to investigate situations carefully (i.e., look for special rules), when their data model contains a ménage-à-trois.

Other data modeling methods

As mentioned previously, there are many species of data modeling. Here we consider one of the most widely known data modeling methods, the the E-R model developed by Chen. There is no standard for the E-R model, and it has been extended and modified in a number of ways.

As the following figure shows, an E-R model looks like the data models with which you are now familiar. Entities are shown by rectangles, and relationships are depicted by diamonds within which the cardinality of the relationship is indicated (e.g., there is a 1:m relationship between NATION and STOCK EXCHANGE). Cardinality can be 1:1, 1:m (conventionally shown as 1:N in E-R diagrams), and m:m (shown as M:N). Recursive relationships are also readily modeled. One important difference is that an m:m relationship is not shown as an associative entity; thus the database designer must convert this relationship to a table.

An E-R diagram

Attributes are shown as ellipses connected to the entity or relationship to which they belong (e.g., stock name is an attribute of STOCK). Relationships are named, and the name is shown above or below the relationship symbol. For example, LISTING is the name of the m:m relationship between STOCK and STOCK EXCHANGE. The method of recording attributes consumes space, particularly for large data models. It is more efficient to record attributes within an entity rectangle, as in the data modeling method you have learned.

Skill builder

Ken is a tour guide for trips that can last several weeks. He has requested a database that will enable him to keep track of the people who have been on the various tours he has led. He wants to be able to record his comments about some of his clients to remind him of extra attention he might need to pay to them on a future trip (e.g., Bill tends to get lost, Daisy has problems getting along with others). Some times people travel with a partner, and Ken wants to record this in his database.

Design a database using MySQL Workbench. Experiment with Object Notation and Relationship Notation (options of the Model tab) to get a feel for different forms of representing data models.

Conclusion

The underlying concepts of different data modeling methods are very similar. They all have the same goal: improving the quality of database design. They share common concepts, such as entity and relationship mapping. We have adopted an approach that is simple and effective. The method can be learned rapidly, novice users can create high-quality models, and it is similar to the representation style of MySQL Workbench. Also, if you have learned one data modeling method, you can quickly adapt to the nuances of other methods.

Summary

Normalization gradually converts a file design into normal form by successive application of rules to move the design from first to fifth normal form. Functional dependency means that one or more attributes determine the value of another. An attribute, or set of attributes, that fully functionally determines another attribute is called a determinant. First normal form states that all occurrences of a row must have the same number of columns. Second normal form is violated when a nonkey column is a fact about a component of the prime key. Third normal form is violated when a nonkey column is a fact about another nonkey column. Boyce-Codd normal form is a stronger version of third normal form. Fourth normal form requires that a row should not contain two or more independent multivalued facts about an entity. Fifth normal form deals with the case where a table can be reconstructed from data in other tables. Fifth normal form arises when there are special rules or conditions. A high-fidelity data model will be of high normal form.

One of the most widely known methods of data modeling is the E-R model. The basic concepts of most data modeling methods are very similar. All aim to improve database design.

Key terms and concepts

Alternate key Functional dependency
Boyce-Codd normal form (BCNF) Generalization hierarchy
Constraint Inversion entry
Data model Many-to-many attribute relationship
Determinant Multidetermination
Domain Multivalued dependency
Domain-key normal form (DK/NF) Normalization
Entity One-to-many attribute relationship
Entity-relationship (E-R) model One-to-one attribute relationship
Fifth normal form (5NF) Relationship
First normal form (1NF) Second normal form (2NF)
Fourth normal form (1NF) Third normal form (3NF)
Full functional dependency

References and additional readings

Chen, P. 1976. The entity-relationship model—toward a unified view of data. ACM Transactions on Database Systems 1 (1):9–36.

Kent, W. 1983. A simple guide to five normal forms in relational database theory. Communications of the ACM 26 (2): 120-125.

Exercises

Short answers

  1. What is normalization, and what is its goal?

  2. How does DK/NF differ from earlier normal forms?

  3. How does E-R differ from the data modeling method of this text?

Normalization and data modeling

Using normalization or E-R create data models from the following narratives, which are sometimes intentionally incomplete. You will have to make some assumptions. Make certain you state these assumptions alongside your data model. Define the identifier(s) and attributes of each entity.

  1. The president of a book wholesaler has told you that she wants information about publishers, authors, and books.

  2. A university has many subject areas (e.g., MIS, Romance languages). Professors teach in only one subject area, but the same subject area can have many professors. Professors can teach many different courses in their subject area. An offering of a course (e.g., Data Management 457, French 101) is taught by only one professor at a particular time.

  3. Kids’n’Vans retails minivans for a number of manufacturers. Each manufacturer offers several models of its minivan (e.g., SE, LE, GT). Each model comes with a standard set of equipment (e.g., the Acme SE comes with wheels, seats, and an engine). Minivans can have a variety of additional equipment or accessories (radio, air conditioning, automatic transmission, airbag, etc.), but not all accessories are available for all minivans (e.g., not all manufacturers offer a driver’s side airbag). Some sets of accessories are sold as packages (e.g., the luxury package might include stereo, six speakers, cocktail bar, and twin overhead foxtails).

  4. Steve operates a cinema chain and has given you the following information:

    “I have many cinemas. Each cinema can have multiple theaters. Movies are shown throughout the day starting at 11 a.m. and finishing at 1 a.m. Each movie is given a two-hour time slot. We never show a movie in more than one theater at a time, but we do shift movies among theaters because seating capacity varies. I am interested in knowing how many people, classified by adults and children, attended each showing of a movie. I vary ticket prices by movie and time slot. For instance, Lassie Get Lost is 50 cents for everyone at 11 a.m. but is 75 cents at 11 p.m.”

  5. A telephone company offers a 10 percent discount to any customer who phones another person who is also a customer of the company. To be eligible for the discount, the pairing of the two phone numbers must be registered with the telephone company. Furthermore, for billing purposes, the company records both phone numbers, start time, end time, and date of call.