Understanding Normalization in DBMS: Anomalies and Normal Forms
Learn about database normalization, including insertion, update, and deletion anomalies, as well as different normal forms like 1NF, 2NF, 3NF, and BCNF.
Anomalies in Databases
Databases often face issues known as anomalies, which can lead to data inconsistencies and integrity problems. Normalization helps resolve these anomalies by organizing data efficiently.
Types of Anomalies
-
Insertion Anomaly: Difficulty in adding new data due to missing other related information.
-
Update Anomaly: Changes in one part of the database need to be updated in multiple places, leading to potential inconsistencies.
-
Deletion Anomaly: Removing data unintentionally deletes other valuable data.
Let's illustrate these anomalies using an example:
Unnormalized Table
| StudentID | StudentName | Major | CourseNo | CourseName | InstructorName | Grade |
|---|---|---|---|---|---|---|
| 101 | Alice | CS | 1 | Math | Mr. Smith | A |
| 102 | Bob | EE | 1 | Math | Mr. Smith | B |
| 103 | Charlie | CS | 2 | English | Mrs. Johnson | C |
| 104 | David | ME | 3 | Physics | Mr. Brown | A |
| 105 | Eve | EE | 3 | Physics | Mr. Brown | B |
-
Insertion Anomaly: Adding a new course without enrolled students is not possible without adding student data.
-
Update Anomaly: Changing Mr. Brown's name requires updates in multiple rows, which can lead to inconsistencies if not done correctly.
-
Deletion Anomaly: If Charlie drops English, deleting his row would also remove information about the English course and Mrs. Johnson.
Normalization
Normalization is the process of structuring a relational database to minimize redundancy and improve data integrity. It involves organizing data into multiple related tables.
First Normal Form (1NF)
Condition: Each cell should contain only atomic (indivisible) values.
Below is an example of a table not in 1NF:
| StudentID | StudentName | PhoneNo |
|---|---|---|
| 101 | Alice | 9090909090, 8080808080 |
To convert it to 1NF:
| StudentID | StudentName | PhoneNo |
|---|---|---|
| 101 | Alice | 9090909090 |
| 101 | Alice | 8080808080 |
Second Normal Form (2NF)
Condition: The table should be in 1NF, and all non-key attributes must be fully dependent on the primary key.
In our original example, the InstructorName depends only on CourseNo, not on both StudentID and CourseNo, indicating a partial dependency.
To resolve this:
Student Table
| StudentID | StudentName | Major |
|---|---|---|
| 101 | Alice | CS |
| 102 | Bob | EE |
| 103 | Charlie | CS |
| 104 | David | ME |
| 105 | Eve | EE |
Course Table
| CourseNo | CourseName | InstructorNo |
|---|---|---|
| 1 | Math | 1 |
| 2 | English | 2 |
| 3 | Physics | 3 |
Instructor Table
| InstructorNo | InstructorName |
|---|---|
| 1 | Mr. Smith |
| 2 | Mrs. Johnson |
| 3 | Mr. Brown |
Grade Table
| StudentID | CourseNo | Grade |
|---|---|---|
| 101 | 1 | A |
| 102 | 1 | B |
| 103 | 2 | C |
| 104 | 3 | A |
| 105 | 3 | B |
Third Normal Form (3NF)
Condition: The table should be in 2NF, and there should be no transitive dependency (non-key attributes should not depend on other non-key attributes).
Our tables are already in 3NF. But to demonstrate 3NF, let's say, we had a table like below.
Grade Table
| StudentID | CourseNo | Marks | Total Marks | Exam Type |
|---|---|---|---|---|
| 101 | 1 | 80 | 100 | Theory |
| 102 | 1 | 40 | 50 | Practical |
| 103 | 2 | 40 | 50 | Practical |
| 104 | 3 | 85 | 100 | Theory |
| 105 | 3 | 90 | 100 | Theory |
Here, a non-prime column Total Marks is dependent on another non-prime column Exam Type. In such cases, we would take both of them and make another table (Exam) and use their primary key ExamId wherever required.
Boyce-Codd Normal Form (BCNF)
Condition: A table is in BCNF if it is in 3NF, and every determinant is a candidate key.
Our tables already meet BCNF standards. However, consider an Employee table:
Employee Table
| EmployeeID | Name | DeskNo | |
|---|---|---|---|
| 101 | Tom | [email protected] | 3 |
| 102 | Bob | [email protected] | 4 |
Here, DeskNo is a non-key attribute but uniquely identifies EmployeeID, which is not allowed in BCNF.
To correct this: Split into two tables: Employee (EmployeeID, Name, Email) and DeskAssignment (EmployeeID, DeskNo).
Denormalization
Denormalization involves combining normalized tables to improve read performance at the expense of write performance and storage efficiency. This technique is often used in data warehousing and reporting systems.
Normalized Tables
Customer Table
| CustomerId | Name |
|---|---|
| 101 | Tom |
| 102 | Bob |
Order Table
| OrderId | Date | Amount | Item | CustomerId |
|---|---|---|---|---|
| 1 | 2021-01-01 | 70000 | Iphone | 101 |
| 2 | 2021-02-02 | 50000 | Ipad | 102 |
If we need some data from Customer Table and some from Order Table, we need to perform join operation like below.
Denormalized
Order Table
| OrderId | CustomerId | Name | Date | Amount | Item |
|---|---|---|---|---|---|
| 1 | 101 | Tom | 2021-01-01 | 70000 | Iphone |
| 2 | 102 | Bob | 2021-02-02 | 50000 | Ipad |
Benefits
Improved Read Performance: By combining tables, the need for joins is eliminated, which can significantly speed up read operations, especially for complex queries or large datasets.
Simplified Queries: Queries become simpler because they no longer need to join multiple tables, making it easier to write and understand SQL statements.
Reduced Latency: With fewer joins, there's less processing required to fetch the data, reducing the overall query latency.
Trade-offs
Increased Redundancy: Data is duplicated, as customer information is repeated in every order row. This increases storage requirements.
Higher Risk of Inconsistency: Updates to customer information must be propagated to all rows where that customer appears. This can lead to anomalies if updates are not handled correctly.
More Complex Updates: Any change to the customer data (e.g., address change) must be updated in multiple rows, making update operations more complex and time-consuming.
Denormalization is a strategic decision based on specific use cases, especially where read performance is more critical than write performance and storage efficiency. It’s commonly used in data warehousing, reporting systems, and applications with heavy read operations.