ChebbiOS

Databases: A Backend Engineer's Guide

#System Design#Databases#Backend#SQL#NoSQL

As a backend engineer, the database is often the heaviest component in your stack. It’s the source of truth, the bottleneck, and the most difficult piece to scale. This guide covers the mental model you need to choose and use them effectively.

1. The Landscape

We can broadly categorize databases into Relational (SQL) and Non-Relational (NoSQL).

Relational (SQL)

Examples: PostgreSQL, MySQL, SQLite.

  • Structure: Data is stored in tables with strict schemas. Relationships are defined via Foreign Keys.
  • Query Language: SQL (Structured Query Language). Powerful for complex joins and aggregations.
  • Best For: Most general-purpose applications (e-commerce, CMS, user data) where data integrity and structure are paramount.

NoSQL

NoSQL isn’t one thing; it’s a family of databases optimized for specific access patterns.

  1. Key-Value Stores (e.g., Redis, DynamoDB)

    • Model: A giant hash map. get(key) and set(key, value).
    • Use Case: Caching, Session management, Shopping carts. Extremely fast O(1) lookups.
  2. Document Stores (e.g., MongoDB)

    • Model: JSON-like documents. Flexible schema.
    • Use Case: Content management, catalogs, rapid prototyping where the schema evolves quickly.
  3. Columnar Stores (e.g., Cassandra, ClickHouse)

    • Model: Data stored by column rather than row.
    • Use Case: Analytics, Time-series, Logging. Great for aggregating millions of rows (e.g., “calculate average temperature”) but expensive for retrieving single records.
  4. Graph Databases (e.g., Neo4j)

    • Model: Nodes and Edges.
    • Use Case: Social networks, Recommendation engines, Fraud detection.

2. Core Concepts

Transactions (ACID)

When money is involved, you need guarantees. ACID is the gold standard.

  • Atomicity: All or nothing. If a transaction fails halfway, it is rolled back completely.
  • Consistency: The database moves from one valid state to another (respecting constraints).
  • Isolation: Concurrent transactions don’t interfere with each other (depending on the isolation level).
  • Durability: Once committed, data is saved even if the power goes out.

Indexing

An index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space.

  • B-Tree: The default for most SQL databases. Balanced tree structure. Great for range queries (WHERE age > 21) and equality.
  • LSM Tree (Log-Structured Merge-Tree): Used by Cassandra, RocksDB. Optimized for heavy write loads by appending data to a log and merging later.
  • Hash Index: O(1) lookups but supports only exact matches (no ranges).

B-Tree vs BST

Why do databases use B-Trees instead of the simpler Binary Search Tree (BST)? It’s all about minimizing trips to the disk.

Think of your hard drive like a library, and memory (RAM) like your desk.

  • Disk Access is Slow: Walking to the library takes time.
  • Memory Access is Fast: Grabbing a book from your desk is instant.

The Problem with BST (Binary Search Tree)

A BST is like a game of “Higher or Lower”. You ask “Is it 50?”, it says “Lower”. You ask “Is it 25?”, it says “Higher”. In a database with millions of rows, you might have to ask this question 20 times. That means 20 trips to the library.

The Solution: B-Tree

A B-Tree is like a Card Catalog or a Phone Book. Instead of asking one question at a time, you open a drawer and see: “A-D are in Row 1, E-H are in Row 2…”. With one glance (one disk read), you eliminate massive chunks of data. You find your book in just 2 or 3 trips.

B-Tree vs BST

Why do databases use B-Trees instead of the simpler Binary Search Tree (BST)? It’s all about minimizing trips to the disk.

B-tree

7
16
1 2 5 6
9 12
18 21
Average Worst Case
SpaceO(n)O(n)
SearchO(log n)O(log n)
InsertO(log n)O(log n)
DeleteO(log n)O(log n)

B-tree is optimized for systems that read and write large blocks of data. It is commonly used in databases and file systems.

Binary Search Tree

8
3
10
1
6
14
Average Worst Case
SpaceO(n)O(n)
SearchO(log n)O(n)
InsertO(log n)O(n)
DeleteO(log n)O(n)

The major advantage of binary search trees over other data structures is that the related sorting algorithms and search algorithms such as in-order traversal can be very efficient.

Concrete Example

Imagine querying a table with 10 Million rows.

  • BST: Might require ~24 disk reads.
    • Time: 24 x 10ms = 240ms (Per query!)
  • B-Tree: Usually requires 3 disk reads.
    • Time: 3 x 10ms = 30ms

This is why B-Trees are the default for PostgreSQL, MySQL, and Oracle.


3. Scaling Strategies

Scale means handling more users or more data. We have two main knobs to turn.

1. Replication (Read Scaling)

Your app basically reads way more than it writes (e.g., Twitter, Instagram).

  • The Srategy: Have one Leader (Master) that takes writes, and multiple Followers (Slaves) that copy the data.
  • The Win: You can add infinite Followers to serve millions of users reading data.
  • The Cost: Replication Lag. You might post a tweet, refresh the page, and not see it immediately because the Follower hasn’t synced yet.

2. Sharding (Write Scaling)

Your app writes so much data that a single disk fills up (e.g., WhatsApp logs, Analytics).

  • The Strategy: Split the data across multiple servers. Users A-M go to Shard 1, N-Z go to Shard 2.
  • The Win: Infinite write capacity. Storage is limitless.
  • The Cost: Queries that need data from both shards (e.g., “Find all friends of User A”) become insanely complex and slow. Avoid sharding until you absolutely must.

4. How to Choose?

Don’t overthink it. Use this cheat sheet.

ProblemSolutionExample DB
Default ChoiceSQL (Relational)PostgreSQL, MySQL
High CachingKey-ValueRedis, DynamoDB
Unstructured / Rapid DevDocumentMongoDB
Analytics / LogsColumnarCassandra, ClickHouse
Social / RecommendationsGraphNeo4j

Rule of Thumb: Start with PostgreSQL. It supports SQL, JSON (NoSQL-like), and works great as a simple queue. Only move when you hit a specific bottleneck.