- Overview
- Normalization
- Functional Dependencies
- Normal Forms
- Armstrong's Axioms
- Database Architecture
- Data Organization
- References
Database design involves organizing data to minimize redundancy, ensure data integrity, and optimize performance. This chapter covers normalization theory, which provides a systematic approach to database design, as well as physical storage considerations.
Key Goals:
- Eliminate data redundancy
- Prevent anomalies (insertion, deletion, update)
- Ensure data integrity
- Optimize storage and access performance
Normalization is the process of organizing data in a database to reduce redundancy and dependency. It involves decomposing tables into smaller, well-structured tables without losing information.
Non-normalized databases suffer from several critical issues:
Data Redundancy occurs when the same information is stored in multiple places, leading to:
Insertion Anomaly:
- Cannot insert data without all related information
- Example: Cannot add a new department unless an employee is assigned to it
Deletion Anomaly:
- Deleting one piece of information causes unintended loss of other information
- Example: Deleting the last employee in a department removes all department information
Update Anomaly:
- Changes must be made in multiple places
- Inconsistency if updates are not applied everywhere
- Example: Changing a department name requires updating every employee record in that department
Example of Redundant Table:
| EmployeeID | Name | DepartmentID | DepartmentName | DepartmentLocation |
|---|---|---|---|---|
| 1 | Alice | 10 | Sales | Building A |
| 2 | Bob | 10 | Sales | Building A |
| 3 | Carol | 20 | Engineering | Building B |
Problems:
- "Sales" and "Building A" repeated for every sales employee
- If "Sales" moves to "Building C", must update multiple rows
- Deleting all Sales employees loses department information
Information Loss happens when decomposing a table improperly, leading to:
- Inability to reconstruct original data
- Loss of relationships between data
- Incorrect data representation
Example of Lossy Decomposition:
Original Table:
| StudentID | CourseID | Instructor |
|---|---|---|
| 1 | CS101 | Dr. Smith |
Bad Decomposition: Table 1: (StudentID, Instructor) Table 2: (CourseID, Instructor)
Problem: Cannot determine which student took which course with which instructor.
Dependency Loss occurs when functional dependencies that existed in the original table cannot be enforced in the decomposed tables.
Example:
- Original constraint: "Each course has one instructor"
- After bad decomposition: This constraint cannot be enforced
Functional Dependency (FD) is a relationship between attributes where one set of attributes uniquely determines another set of attributes.
For a relation R, X → Y (X functionally determines Y) means:
- For each value of X in R, there is precisely one value of Y in R
- Whenever two tuples have the same value for X, they must have the same value for Y
Notation: X → Y reads as "X determines Y" or "Y is functionally dependent on X"
Y is fully functionally dependent on X if:
- Y is dependent on X, AND
- Y is not dependent on any proper subset of X
Example:
(StudentID, CourseID) → Grade
This is a full dependency because:
- Grade depends on both StudentID AND CourseID
- Grade does not depend on StudentID alone
- Grade does not depend on CourseID alone
Y is partially dependent on X if:
- Y is dependent on X, BUT
- Y is also dependent on a proper subset of X
Example:
(StudentID, CourseID) → StudentName
This is a partial dependency because:
- StudentName depends on (StudentID, CourseID)
- StudentName also depends on just StudentID
Y is transitively dependent on X if:
- X → Z AND Z → Y
- Therefore X → Y through Z
Example:
StudentID → DepartmentID
DepartmentID → DepartmentName
Therefore: StudentID → DepartmentName (transitive)
Keys are used to enforce functional dependencies:
| Key Type | Role in FD |
|---|---|
| Super Key | Any set of attributes that determines all other attributes |
| Candidate Key | Minimal super key (no proper subset is a super key) |
| Primary Key | Chosen candidate key used for tuple identification |
Making X the key in a relation enforces the dependency that X determines all other attributes (X → Y for all Y).
Functional dependencies ensure:
- Uniqueness: Each X value maps to exactly one Y value
- One-to-One Relationship: Between X and Y values in the relation
Implementation:
- Use PRIMARY KEY constraints
- Use UNIQUE constraints
- Define foreign keys appropriately
Normal forms provide a progression of rules for organizing database tables, each building on the previous.
Definition: All domain values in a relation R are atomic (indivisible).
Requirements:
- No repeating groups
- No arrays or lists in a single cell
- Each cell contains a single value
- All entries in a column are of the same data type
Note: All proper relations are automatically in 1NF due to the definition of the relational model.
Example:
Violation of 1NF:
| StudentID | Name | PhoneNumbers |
|---|---|---|
| 1 | Alice | 555-1234, 555-5678 |
Corrected to 1NF:
| StudentID | Name | PhoneNumber |
|---|---|---|
| 1 | Alice | 555-1234 |
| 1 | Alice | 555-5678 |
Definition: R is in 2NF if:
- R is in 1NF, AND
- Every non-key attribute is fully functionally dependent on the key (no partial dependencies)
Target: Eliminate partial dependencies
Process:
- Identify partial dependencies
- Decompose table to remove them
Example:
Violation of 2NF:
Enrollment(StudentID, CourseID, StudentName, CourseName, Grade)
Primary Key: (StudentID, CourseID)
Partial Dependencies:
- StudentID → StudentName
- CourseID → CourseName
Corrected to 2NF:
Student(StudentID, StudentName)
Course(CourseID, CourseName)
Enrollment(StudentID, CourseID, Grade)Definition: R is in 3NF if:
- R is in 2NF, AND
- Every non-key attribute is non-transitively dependent on the key
Target: Eliminate transitive dependencies
Process:
- Identify transitive dependencies
- Decompose table to remove them
Example:
Violation of 3NF:
Employee(EmployeeID, DepartmentID, DepartmentName)
Transitive Dependency:
- EmployeeID → DepartmentID
- DepartmentID → DepartmentName
- Therefore: EmployeeID → DepartmentName (transitive)
Corrected to 3NF:
Employee(EmployeeID, DepartmentID)
Department(DepartmentID, DepartmentName)Definition: R is in BCNF if:
- R is in 3NF, AND
- Every determinant is a candidate key
Determinant: A set of attributes in R on which some other attribute is fully functionally dependent.
BCNF is stricter than 3NF:
- 3NF allows non-key attributes to determine other non-key attributes
- BCNF requires every determinant to be a candidate key
Example:
Violation of BCNF (but in 3NF):
CourseSchedule(StudentID, Course, Instructor)
Primary Key: (StudentID, Course)
Functional Dependencies:
- (StudentID, Course) → Instructor (satisfies 3NF)
- Instructor → Course (violates BCNF, Instructor not a candidate key)
Corrected to BCNF:
StudentInstructor(StudentID, Instructor)
InstructorCourse(Instructor, Course)| Normal Form | Eliminates | Key Requirement |
|---|---|---|
| 1NF | Repeating groups | Atomic values |
| 2NF | Partial dependencies | Full dependency on key |
| 3NF | Transitive dependencies | Non-transitive dependency on key |
| BCNF | All non-candidate key determinants | Every determinant is a candidate key |
Progression:
1NF ⊂ 2NF ⊂ 3NF ⊂ BCNF
Every table in BCNF is also in 3NF, 2NF, and 1NF.
Armstrong's Axioms are a set of inference rules for deriving functional dependencies. They are sound (only valid FDs are derived) and complete (all valid FDs can be derived).
Rule: If Y is a subset of X, then X → Y
Example:
If X = {StudentID, Name}, Y = {StudentID}
Then X → Y (trivial dependency)
Application: Trivial dependencies always hold
Rule: If X → Y, then WX → WY (for any W)
Example:
If StudentID → Name
Then (StudentID, CourseID) → (Name, CourseID)
Application: Adding the same attributes to both sides preserves the dependency
Rule: If X → Y and Y → Z, then X → Z
Example:
If StudentID → DepartmentID
And DepartmentID → DepartmentName
Then StudentID → DepartmentName
Application: Dependencies can be chained
From the primary rules, additional useful rules can be derived:
Rule: If X → Y and X → Z, then X → YZ
Proof:
- X → Y (given)
- X → Z (given)
- X → XY (augmentation of 1)
- XY → YZ (augmentation of 2)
- X → YZ (transitivity of 3 and 4)
Rule: If X → YZ, then X → Y and X → Z
Proof:
- X → YZ (given)
- YZ → Y (reflexivity)
- X → Y (transitivity of 1 and 2)
- Similarly, X → Z
Rule: If X → Y and WY → Z, then WX → Z
Application: Useful for more complex dependency inference
Armstrong's Axioms are used to:
- Derive all functional dependencies from a given set
- Determine candidate keys
- Verify normal forms
- Optimize database schemas
Database architecture describes how data is physically stored and accessed.
Modern databases use a hierarchy of storage technologies:
| Storage Type | Speed | Capacity | Volatility | Cost/GB |
|---|---|---|---|---|
| CPU Registers | Fastest | Bytes-KB | Volatile | Highest |
| CPU Cache (L1/L2/L3) | Very Fast | KB-MB | Volatile | Very High |
| DRAM (RAM) | Fast | GB | Volatile | High |
| SSD | Medium | GB-TB | Non-volatile | Medium |
| HDD | Slow | TB | Non-volatile | Low |
| Tape/Cloud | Very Slow | PB+ | Non-volatile | Lowest |
Characteristics:
- Data Lost Without Power
- Fast access (nanoseconds)
- Limited capacity
- Expensive
Usage:
- Cache frequently accessed data ("hot data")
- Buffer pool
- Query execution workspace
Characteristics:
- Data is Safe after power loss
- Slower access (milliseconds)
- Large capacity
- Inexpensive
Usage:
- Permanent data storage
- Transaction logs
- Database files
The database manages data movement between storage tiers:
┌─────────────────────┐
│ Hot Data (RAM) │ ← Frequently accessed
├─────────────────────┤
│ Warm Data (SSD) │ ← Moderately accessed
├─────────────────────┤
│ Cold Data (HDD) │ ← Rarely accessed
└─────────────────────┘
Strategies:
- Least Recently Used (LRU)
- Most Frequently Used (MFU)
- Clock algorithms
Main Memory Access:
- Latency: ~30 nanoseconds (10^-7 seconds)
- Bandwidth: ~GB/s per channel
Disk Access:
- Latency: ~10 milliseconds (10^-2 seconds)
- Seek time: 3-8 ms
- Rotational delay: 2-3 ms
- Transfer time: 0.5-1 ms
Performance Gap:
- Main memory is approximately 100,000 times faster than disk
- This gap drives database architecture decisions
How data is organized on disk significantly impacts performance.
Modern hard drives consist of:
Components:
- Platters: Circular disks with magnetic coating
- Tracks: Concentric circles on platter surfaces
- Sectors: Fixed-size segments of tracks (typically 512 bytes or 4KB)
- Cylinders: Collection of tracks at the same position across platters
- Read-Write Heads: Magnetic sensors for each platter surface
- Actuator: Moves heads across tracks
Blocks:
- Unit of information transferred between disk and main memory
- Typically 4KB, 8KB, or larger
- Multiple records stored in a single block
- Files consist of multiple blocks connected by address pointers
Characteristics:
- Records stored in no particular order
- New records appended to end of file
- Simplest organization
Performance:
| Operation | Complexity | Notes |
|---|---|---|
| Insert | O(1) | Append to end |
| Search | O(N/2) average | Must scan all blocks |
| Delete | O(N/2) | Find then remove |
| Update | O(N/2) | Find then modify |
Use Cases:
- Small tables
- Tables with mostly sequential scans
- Temporary tables
Characteristics:
- Records ordered by a key attribute
- Insertion requires maintaining order
- Binary search possible
Performance:
| Operation | Complexity | Notes |
|---|---|---|
| Insert | O(N) | Must maintain order |
| Search | O(log N) | Binary search |
| Range Query | O(log N + K) | K = result size |
Advantages:
- Fast searches on sort key
- Efficient range queries
- Sequential access benefits
Disadvantages:
- Expensive insertions
- Only one sort order
Indexes provide fast access paths to data without requiring full table scans.
Structure:
- One index entry per data block
- Index stores: (first key in block, pointer to block)
- Requires data to be sorted
Performance:
- Search: O(log F) where F = number of blocks
- Space: Minimal index size
Structure:
- One index entry per record
- Index stores: (key value, pointer to record)
- Requires data to be sorted
Performance:
- Search: O(log N) where N = number of records
- Space: Larger index size
Structure:
- Built on non-key attributes
- Must be dense (one entry per record)
- Data doesn't need to be sorted on this attribute
Use Cases:
- Point queries on non-key attributes
- Attributes frequently used in WHERE clauses
Concept: Create an index on the index
Structure:
┌──────────────────────┐
│ Top-Level Index │ ← Smallest, fits in memory
├──────────────────────┤
│ Second-Level Index │
├──────────────────────┤
│ First-Level Index │ ← Dense or sparse
├──────────────────────┤
│ Data Blocks │
└──────────────────────┘
Performance:
- Search: O(log_F N) where F = fanout
- Reduces number of disk accesses significantly
B+ Trees are the most common index structure in databases.
Characteristics:
- Balanced: All leaves at same depth
- Sorted: Keys in order within nodes
- High Fanout: Each node contains many keys
- Leaf Links: Leaf nodes linked for range scans
Advantages:
- Guaranteed logarithmic search time
- Efficient insertions and deletions
- Range queries via leaf traversal
- Self-balancing
Structure:
- Internal Nodes: Contain keys and pointers to children
- Leaf Nodes: Contain keys and pointers to data (or data itself)
- Order N: Each node contains between ⌈N/2⌉ and N children
Performance:
- Search: O(log_F N)
- Insert: O(log_F N)
- Delete: O(log_F N)
- Range Query: O(log_F N + K) where K = result size
Hashing provides constant-time average access for point queries.
Structure:
- Hash function: h(key) = bucket number
- Each bucket contains one or more data blocks
- Overflow blocks for collisions
Hash Function Properties:
- Uniform distribution of values
- Minimal collisions
- Efficient computation
Performance:
- Search: O(1) average, O(N) worst case
- Insert: O(1) average
- Range Query: O(N) (no ordering)
Limitations:
- Fixed number of buckets
- Performance degrades as data grows
- No support for range queries
Approaches:
- Extendible Hashing: Adjust directory size dynamically
- Linear Hashing: Grow buckets incrementally
Advantages:
- Adapts to data growth
- Maintains constant-time access
- No full reorganization needed
Course Materials:
- CS 6400: Database Systems Concepts and Design - Georgia Tech OMSCS
