A deep discussion on how to design with data intensive applications.
- the SLA we have to achieve
- the trade-off we have to make
- Reliable, Scalable, and Maintainable Applications
- reliablitity
- hardware faults
- multiple instnace
- RAID
- hot swap CPU
- software errors
- human errors
- hardware faults
- scalability
- system load
- performance
- maintainability
- operability
- simplicity
- evolvability
- reliablitity
- data models and query languages
- relational model vs. document model
- declarative queries and map-reduce querying
- storage and retrieval
- data structures
- hash indexes
- SSTables and LSM-Trees
- B-Trees
- OLAP or OLTP
- column-oriented storage
- data structures
specific
LSM Tree
- Write-Optimized: LSM trees are designed to handle a high rate of writes. Instead of writing data directly to the main storage, writes are first buffered in memory (in a structure called a MemTable). This allows for fast, sequential writes.
- Compaction: Periodically, the data in the MemTable is flushed to disk in the form of immutable files called SSTables (Sorted String Tables). These files are then merged and compacted in the background to maintain efficiency and reduce fragmentation.
- Read Performance: Reads in an LSM tree can be slower than writes because data might be spread across multiple SSTables. To mitigate this, LSM trees often use Bloom filters and other indexing techniques to quickly determine which SSTables contain the relevant data.
- Levels: LSM trees organize SSTables into levels. New SSTables are initially placed in the first level. As more data is written and compaction occurs, SSTables are merged and moved to higher levels. Each level has a larger capacity and fewer SSTables, which helps balance read and write performance.
- Use Cases: LSM trees are commonly used in NoSQL databases like Apache Cassandra, HBase, and LevelDB. They are well-suited for applications that require high write throughput and can tolerate slightly slower reads.
reliability
fault tolerant
- the application performs the function that the user expected
- it can tolerate the user making mistakes or using the software in unexpected ways
- its performance is good enough for the required use case, under the expected load and data volume
- the system prevents any unauthorized access and abuse
storage & retrieval
- OLTP vs. OLAP
- row-based vs. column based
- different types of indexes
- heavy read or heavy write
- B-Tree vs. LSM Tree
Encoding
- JSON, XML, CSV
- BSON, MessagePack, …
- Thrift, Protobuf
- schema file
- Avro
- support default value
union {null, int, string}
- reader, writer schema resolution
- support default value
replication
- leaderless
- voldemort, cassandra, riak, CRDTs
- eventually consistency
- causality
- single-leader
- consistency models
- MySQL, PostgreSQL, MongoDB, RethinkDB, SQL Server
- multiple-leader
- google docs, couchDB, epherpad
pros:
- geographically close to users – reduce latency
- partially tolerance – availability
- scale out the number of machines – increase read throughput
leader failure
- determining that the leader has failed
- choosing a new leader
- reconfiguring the system to use the new leader
replication
- any statement that calls a nondeterministic function,
NOW(), random()
- auto incrementing column
- statements that have side effects (triggers, stored procedures, user-defined functions)
latency approaches
- read your own write (read user’s modification from leader node)
- monotonic reads (read data from the same replica)
multi-leader
clients with offline operation (calendar apps), every device has a local database that acts as a leader (accepts write requests), and there is an asynchronous multi-leader replication process (sync) between the replicas of your calendar on all of your devices
conflict resolution
- avoid conflicts
- last write wins
- merge conflict resolution
- interactive conflict resolution (sync, inefficiency)
- CRDTs (conflict-free replicated data types)
partition (sharding)
- hotspot – rebalancing
- request route
- redis cluster (node inter-communication)
- routing proxy
- client maintains cluster nodes metadata
transactions
- ACID vs. BASE
- 2PC
- 3PC
distributed systems
succ, fail, timeout
how to determine a node is dead
- unreliable network
- unreliable clock
- lamport timestamp
- lease expired
- fence token
- byzantine fault-tolerant
consensus algos
- CAP
what to do
- figure out which tools and which approaches are the most appropriate for the task at hand
- database
- index
- cache
problems
encoding
how those formats are used for data storage and for communication
- backward compatibility
- forward compatibility
unable to delete fields, allowed to add fields at last.
|
|
misc
How do you ensure that the data remains correct and complete, even when things go wrong internally? How do you provide consistently good performance to clients, even when parts of your system are degraded? How do you scale to handle an increase in load? What does a good API for the service look like?
quotes
In order to make the database resilient to crashes, it is common for B-tree implemenātations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log).