What Is Data Modeling? Explained Simply With Examples
Data modeling is the process of deciding how to organize and structure data before you build a system. A good data model makes queries fast, code simple, and your database easy to understand. A bad one creates problems that last years. This guide covers the core concepts with real examples.
3
main model types: conceptual, logical, physical
ER
entity-relationship diagrams — standard design tool
3NF
third normal form — standard normalization target
NoSQL
document vs relational — different trade-offs
What Is a Data Model?
Most expensive technical debt
Bad data models are one of the most expensive forms of technical debt in software. Changing a table structure in a production database with millions of rows is painful — requiring migrations, downtime risk, and code changes across the entire application. Getting it right upfront costs hours; fixing it later costs weeks.
A data model defines what data exists, how it's organized, and how pieces relate to each other. Think of it as the blueprint for your database — designed before you write a single SQL CREATE TABLE or Mongoose schema.
The Three Levels of Data Models
Conceptual Model
High-level: what entities exist and how they relate. Audience: business stakeholders. No technical details — just boxes and relationship arrows. Example: "Customers place Orders; Orders contain Products."
Logical Model
More detail: attributes, data types, relationships (1:1, 1:N, N:M). Audience: architects and developers. Still database-agnostic — works for SQL or NoSQL. Example: User has email (string), created_at (datetime), orders (1:N).
Physical Model
Implementation-specific: actual table names, column types, indexes, constraints, partitioning. Audience: database engineers. Ready to generate DDL (CREATE TABLE statements). Example: users table with BIGINT PK, VARCHAR(255) UNIQUE for email, BTREE index on created_at.
Entity-Relationship (ER) Modeling
ER modeling is the standard way to design relational databases. You identify entities (things), attributes (properties), and relationships (how things connect). Every ER model maps directly to SQL tables.
-- Entity: User
CREATE TABLE users (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
email VARCHAR(255) UNIQUE NOT NULL,
name VARCHAR(100) NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
-- Entity: Product
CREATE TABLE products (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255) NOT NULL,
price DECIMAL(10,2) NOT NULL,
category_id BIGINT REFERENCES categories(id),
stock INT NOT NULL DEFAULT 0
);
-- Entity: Order (bridges users and products — 1:N from users, M:N with products)
CREATE TABLE orders (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
user_id BIGINT NOT NULL REFERENCES users(id),
status ENUM('pending','paid','shipped','delivered') DEFAULT 'pending',
total DECIMAL(10,2) NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
-- Junction table: Many-to-many (orders ↔ products)
-- One order has many products; one product appears in many orders
CREATE TABLE order_items (
order_id BIGINT REFERENCES orders(id),
product_id BIGINT REFERENCES products(id),
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL, -- snapshot price at time of purchase
PRIMARY KEY (order_id, product_id)
);Relationships: 1:1, 1:N, M:N
| Item | Relationship Type | Example + Implementation |
|---|---|---|
| One-to-One (1:1) | User has one Profile | Foreign key in profile table: profile.user_id UNIQUE (unique FK enforces 1:1) |
| One-to-Many (1:N) | User has many Orders | Foreign key in child: orders.user_id references users.id (most common relationship) |
| Many-to-Many (M:N) | Orders contain many Products; Products appear in many Orders | Junction/bridge table: order_items(order_id, product_id) with composite PK |
| Self-referential | Category has sub-categories | categories.parent_id REFERENCES categories.id (nullable for root categories) |
Normalization — Eliminating Redundancy
Normalization is the process of structuring a relational database to reduce data redundancy and improve integrity. The three normal forms every developer needs to know:
First Normal Form (1NF) — atomic values, unique rows
Each column contains atomic (indivisible) values — no arrays or comma-separated lists stored in a single field. Each row must be unique. Bad: tags = "javascript,react,node" in one column. Good: separate tags table with a junction table linking posts to tags.
Second Normal Form (2NF) — full functional dependency
1NF + every non-key column depends on the ENTIRE primary key (eliminates partial dependencies in composite-key tables). Bad: order_items table with product_name column — product_name depends only on product_id, not the full (order_id, product_id) composite key. Good: join the products table to get the name.
Third Normal Form (3NF) — no transitive dependencies
2NF + no non-key column depends on another non-key column (eliminates transitive dependencies). Bad: users table with city, state, zip — zip determines city and state, so city/state transitively depend on zip through zip, not directly on user_id. Good: separate zip_codes table with (zip, city, state). Join when needed.
Don't over-normalize for read-heavy workloads
Relational vs Document Data Models
| Item | Relational (SQL) | Document (MongoDB, DynamoDB) |
|---|---|---|
| Data structure | Tables with fixed schemas — every row has same columns | Flexible JSON documents — each document can have different fields |
| Relationships | JOINs between normalized tables | Embedded sub-documents or referenced IDs |
| Schema changes | ALTER TABLE — migrations needed, risky at scale | No schema enforcement — add/remove fields freely (flexible but risky without validation) |
| Query power | Complex JOINs, aggregations, window functions, analytics | Simple lookups very fast; multi-collection JOINs much harder |
| Transactions | ACID transactions across tables — strong consistency | Single-document atomic; multi-document transactions are slower and limited |
| Best for | Relational data, financial systems, reporting, analytics | Hierarchical/nested data, flexible schemas, high write throughput, event logs |
Practical: Designing a Blog Data Model
-- Users
CREATE TABLE users (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
username VARCHAR(50) UNIQUE NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
-- Posts
CREATE TABLE posts (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
author_id BIGINT NOT NULL REFERENCES users(id),
title VARCHAR(500) NOT NULL,
slug VARCHAR(500) UNIQUE NOT NULL, -- URL-friendly identifier
body TEXT,
status TEXT CHECK(status IN ('draft','published','archived')) DEFAULT 'draft',
published_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
-- Tags (many-to-many with posts)
CREATE TABLE tags (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
name VARCHAR(100) UNIQUE NOT NULL,
slug VARCHAR(100) UNIQUE NOT NULL
);
CREATE TABLE post_tags (
post_id BIGINT REFERENCES posts(id) ON DELETE CASCADE,
tag_id BIGINT REFERENCES tags(id) ON DELETE CASCADE,
PRIMARY KEY (post_id, tag_id)
);
-- Comments (self-referential for threaded replies)
CREATE TABLE comments (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
post_id BIGINT NOT NULL REFERENCES posts(id) ON DELETE CASCADE,
author_id BIGINT REFERENCES users(id),
parent_id BIGINT REFERENCES comments(id), -- NULL = top-level comment
body TEXT NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
-- Performance indexes for common access patterns
CREATE INDEX idx_posts_author ON posts(author_id);
CREATE INDEX idx_posts_published ON posts(published_at) WHERE status = 'published';
CREATE INDEX idx_posts_slug ON posts(slug);
CREATE INDEX idx_comments_post ON comments(post_id);
CREATE INDEX idx_comments_parent ON comments(parent_id);Common Data Modeling Mistakes
Storing arrays in a single column
Putting comma-separated values (tags="react,node,sql") in one column violates 1NF and makes querying, filtering, and indexing impossible. Always use a junction table for multi-value attributes.
Using strings for foreign keys
Storing username or email as foreign keys instead of numeric IDs causes bloat (VARCHAR vs BIGINT in every child row), makes renaming impossible, and is significantly slower to JOIN. Always use surrogate numeric IDs as PKs.
One giant table for everything
Combining unrelated concepts into one mega-table with nullable columns for different entity types (type="user"|"admin"|"bot") is the EAV anti-pattern. Each distinct entity should be its own table.
Missing indexes on foreign keys
PostgreSQL and MySQL do not automatically index foreign key columns. Every FK that you JOIN or filter on needs an explicit index. Forgetting indexes on high-traffic FK columns causes full table scans that kill performance at scale.