Normalization in DBMS
Anomalies in Databasesβ
Databases often face issues known as anomalies, which can lead to data inconsistencies and integrity problems. Normalization helps resolve these anomalies by organizing data efficiently.
Types of Anomaliesβ
-
Insertion Anomaly: Difficulty in adding new data due to missing other related information.
-
Update Anomaly: Changes in one part of the database need to be updated in multiple places, leading to potential inconsistencies.
-
Deletion Anomaly: Removing data unintentionally deletes other valuable data.
Let's illustrate these anomalies using an example:
Unnormalized Tableβ
StudentID | StudentName | Major | CourseNo | CourseName | InstructorName | Grade |
---|---|---|---|---|---|---|
101 | Alice | CS | 1 | Math | Mr. Smith | A |
102 | Bob | EE | 1 | Math | Mr. Smith | B |
103 | Charlie | CS | 2 | English | Mrs. Johnson | C |
104 | David | ME | 3 | Physics | Mr. Brown | A |
105 | Eve | EE | 3 | Physics | Mr. Brown | B |
-
Insertion Anomaly: Adding a new course without enrolled students is not possible without adding student data.
-
Update Anomaly: Changing Mr. Brown's name requires updates in multiple rows, which can lead to inconsistencies if not done correctly.
-
Deletion Anomaly: If Charlie drops English, deleting his row would also remove information about the English course and Mrs. Johnson.
Normalizationβ
Normalization is the process of structuring a relational database to minimize redundancy and improve data integrity. It involves organizing data into multiple related tables.
First Normal Form (1NF)β
Condition: Each cell should contain only atomic (indivisible) values.
Below is an example of a table not in 1NF:
StudentID | StudentName | PhoneNo |
---|---|---|
101 | Alice | 9090909090, 8080808080 |
To convert it to 1NF:
StudentID | StudentName | PhoneNo |
---|---|---|
101 | Alice | 9090909090 |
101 | Alice | 8080808080 |
Second Normal Form (2NF)β
Condition: The table should be in 1NF, and all non-key attributes must be fully dependent on the primary key.
In our original example, the InstructorName depends only on CourseNo, not on both StudentID and CourseNo, indicating a partial dependency.
To resolve this:
Student Table
StudentID | StudentName | Major |
---|---|---|
101 | Alice | CS |
102 | Bob | EE |
103 | Charlie | CS |
104 | David | ME |
105 | Eve | EE |
Course Table
CourseNo | CourseName | InstructorNo |
---|---|---|
1 | Math | 1 |
2 | English | 2 |
3 | Physics | 3 |
Instructor Table
InstructorNo | InstructorName |
---|---|
1 | Mr. Smith |
2 | Mrs. Johnson |
3 | Mr. Brown |
Grade Table
StudentID | CourseNo | Grade |
---|---|---|
101 | 1 | A |
102 | 1 | B |
103 | 2 | C |
104 | 3 | A |
105 | 3 | B |
Third Normal Form (3NF)β
Condition: The table should be in 2NF, and there should be no transitive dependency (non-key attributes should not depend on other non-key attributes).
Our tables are already in 3NF. But to demonstrate 3NF, let's say, we had a table like below.
Grade Table
StudentID | CourseNo | Marks | Total Marks | Exam Type |
---|---|---|---|---|
101 | 1 | 80 | 100 | Theory |
102 | 1 | 40 | 50 | Practical |
103 | 2 | 40 | 50 | Practical |
104 | 3 | 85 | 100 | Theory |
105 | 3 | 90 | 100 | Theory |
Here, a non-prime column Total Marks
is dependent on another non-prime column Exam Type
. In such cases, we would take both of them and make another table (Exam) and use their primary key ExamId
wherever required.
Boyce-Codd Normal Form (BCNF)β
Condition: A table is in BCNF if it is in 3NF, and every determinant is a candidate key.
Our tables already meet BCNF standards. However, consider an Employee
table:
Employee Table
EmployeeID | Name | DeskNo | |
---|---|---|---|
101 | Tom | [email protected] | 3 |
102 | Bob | [email protected] | 4 |
Here, DeskNo
is a non-key attribute but uniquely identifies EmployeeID
, which is not allowed in BCNF.
To correct this: Split into two tables: Employee
(EmployeeID, Name, Email) and DeskAssignment
(EmployeeID, DeskNo).
Denormalizationβ
Denormalization involves combining normalized tables to improve read performance at the expense of write performance and storage efficiency. This technique is often used in data warehousing and reporting systems.
Normalized Tablesβ
Customer Table
CustomerId | Name |
---|---|
101 | Tom |
102 | Bob |
Order Table
OrderId | Date | Amount | Item | CustomerId |
---|---|---|---|---|
1 | 2021-01-01 | 70000 | Iphone | 101 |
2 | 2021-02-02 | 50000 | Ipad | 102 |
If we need some data from Customer Table and some from Order Table, we need to perform join operation like below.
SELECT Order.OrderId, Customer.Name, Order.Date, Order.Amount
FROM Order
JOIN Customer ON Order.CustomerId = Customer.CustomerId;
Denormalizedβ
Order Table
OrderId | CustomerId | Name | Date | Amount | Item |
---|---|---|---|---|---|
1 | 101 | Tom | 2021-01-01 | 70000 | Iphone |
2 | 102 | Bob | 2021-02-02 | 50000 | Ipad |
Benefitsβ
Improved Read Performance: By combining tables, the need for joins is eliminated, which can significantly speed up read operations, especially for complex queries or large datasets.
Simplified Queries: Queries become simpler because they no longer need to join multiple tables, making it easier to write and understand SQL statements.
Reduced Latency: With fewer joins, there's less processing required to fetch the data, reducing the overall query latency.
Trade-offsβ
Increased Redundancy: Data is duplicated, as customer information is repeated in every order row. This increases storage requirements.
Higher Risk of Inconsistency: Updates to customer information must be propagated to all rows where that customer appears. This can lead to anomalies if updates are not handled correctly.
More Complex Updates: Any change to the customer data (e.g., address change) must be updated in multiple rows, making update operations more complex and time-consuming.
Denormalization is a strategic decision based on specific use cases, especially where read performance is more critical than write performance and storage efficiency. Itβs commonly used in data warehousing, reporting systems, and applications with heavy read operations.