Design Data Intensive Applications

A deep discussion on how to design with data intensive applications.

  • the SLA we have to achieve
  • the trade-off we have to make
  1. Reliable, Scalable, and Maintainable Applications
    1. reliablitity
      1. hardware faults
        1. multiple instnace
        2. RAID
        3. hot swap CPU
      2. software errors
      3. human errors
    2. scalability
      1. system load
      2. performance
    3. maintainability
      1. operability
      2. simplicity
      3. evolvability
  2. data models and query languages
    1. relational model vs. document model
    2. declarative queries and map-reduce querying
  3. storage and retrieval
    1. data structures
      1. hash indexes
      2. SSTables and LSM-Trees
      3. B-Trees
    2. OLAP or OLTP
    3. column-oriented storage

specific

LSM Tree

  • Write-Optimized: LSM trees are designed to handle a high rate of writes. Instead of writing data directly to the main storage, writes are first buffered in memory (in a structure called a MemTable). This allows for fast, sequential writes.
  • Compaction: Periodically, the data in the MemTable is flushed to disk in the form of immutable files called SSTables (Sorted String Tables). These files are then merged and compacted in the background to maintain efficiency and reduce fragmentation.
  • Read Performance: Reads in an LSM tree can be slower than writes because data might be spread across multiple SSTables. To mitigate this, LSM trees often use Bloom filters and other indexing techniques to quickly determine which SSTables contain the relevant data.
  • Levels: LSM trees organize SSTables into levels. New SSTables are initially placed in the first level. As more data is written and compaction occurs, SSTables are merged and moved to higher levels. Each level has a larger capacity and fewer SSTables, which helps balance read and write performance.
  • Use Cases: LSM trees are commonly used in NoSQL databases like Apache Cassandra, HBase, and LevelDB. They are well-suited for applications that require high write throughput and can tolerate slightly slower reads.

reliability

fault tolerant

  • the application performs the function that the user expected
  • it can tolerate the user making mistakes or using the software in unexpected ways
  • its performance is good enough for the required use case, under the expected load and data volume
  • the system prevents any unauthorized access and abuse

storage & retrieval

  • OLTP vs. OLAP
  • row-based vs. column based
  • different types of indexes
  • heavy read or heavy write
  • B-Tree vs. LSM Tree

Encoding

  • JSON, XML, CSV
  • BSON, MessagePack, …
  • Thrift, Protobuf
    • schema file
  • Avro
    • support default value union {null, int, string}
    • reader, writer schema resolution

replication

  • leaderless
    • voldemort, cassandra, riak, CRDTs
    • eventually consistency
    • causality
  • single-leader
    • consistency models
    • MySQL, PostgreSQL, MongoDB, RethinkDB, SQL Server
  • multiple-leader
    • google docs, couchDB, epherpad

pros:

  • geographically close to users – reduce latency
  • partially tolerance – availability
  • scale out the number of machines – increase read throughput

leader failure

  • determining that the leader has failed
  • choosing a new leader
  • reconfiguring the system to use the new leader

replication

  • any statement that calls a nondeterministic function, NOW(), random()
  • auto incrementing column
  • statements that have side effects (triggers, stored procedures, user-defined functions)

latency approaches

  • read your own write (read user’s modification from leader node)
  • monotonic reads (read data from the same replica)

multi-leader

clients with offline operation (calendar apps), every device has a local database that acts as a leader (accepts write requests), and there is an asynchronous multi-leader replication process (sync) between the replicas of your calendar on all of your devices

conflict resolution

  • avoid conflicts
  • last write wins
  • merge conflict resolution
  • interactive conflict resolution (sync, inefficiency)
  • CRDTs (conflict-free replicated data types)

partition (sharding)

  • hotspot – rebalancing
  • request route
    • redis cluster (node inter-communication)
    • routing proxy
    • client maintains cluster nodes metadata

transactions

  • ACID vs. BASE
  • 2PC
  • 3PC

distributed systems

succ, fail, timeout

how to determine a node is dead

  • unreliable network
  • unreliable clock
    • lamport timestamp
  • lease expired
    • fence token
  • byzantine fault-tolerant

consensus algos

  • CAP

what to do

  1. figure out which tools and which approaches are the most appropriate for the task at hand
    1. database
    2. index
    3. cache

problems

encoding

how those formats are used for data storage and for communication

  • backward compatibility
  • forward compatibility

unable to delete fields, allowed to add fields at last.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// version 1
message Request {
  int64 orderId;
  int64 userId;
}

// version 2
message Request {
  int64 orderId;
  int64 userId;

  // new parameter
  string comment;
}

misc

How do you ensure that the data remains correct and complete, even when things go wrong internally? How do you provide consistently good performance to clients, even when parts of your system are degraded? How do you scale to handle an increase in load? What does a good API for the service look like?

quotes

In order to make the database resilient to crashes, it is common for B-tree implemenā€tations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log).

Licensed under CC BY-NC-SA 4.0
Get Things Done
Built with Hugo
Theme Stack designed by Jimmy