DDIA - Foundations of Data Systems

Reliable, Scalable and Maintainable Applications

  • Reliability - The system should continue to work correctly
    • Tolerating hardware & software faults
    • human error
  • Scalability
    • Load
    • performance latency percentiles
    • throughput
  • Maintainbility
    • Operability
    • Simplicity
    • Evolability

Reliablility

Scalability

Maintainability

  • Fixing bugs
  • Keeping the systems operational
  • investigating failures
  • adapting to new platforms
  • modifying for new use cases
  • repaying technical debts
  • adding new features

Operationality - Make life easy for operations

  • monitoring the health of the system and quickly restoring services if it goes into a bad state
  • tracking down the cause of problems, such as system failures or degraded performance
  • keeping software and platforms up to date, including security patches
  • keeping tabs on how different systems affect each other
  • anticipating future problems and solving them before they occur
  • establishing good practices and tools for deveployment, configuration management
  • performing complex maintenance tasks, such as moving an application from one platform to another
  • maintaining the security of the system as configuration changes are made
  • defining processes that make operations predictable and help keep the production environment stable
  • preserving the organization’s knowledge about the system

Data Models and Query Languages

The question is how to we make an abstraction out of real world problems.

Relational Versus Document Model

Relational Model: Data is organized into relations, where each relation is an unordered collection of tuples.

For those OOP language, we need to use ORM (Object-relational mapping) to fix the mismatch with the object with the database schema. (Hibernate, Mybatis)

Anything that is meaningful to humans may need to change sometime in the future—and if that information is duplicated, all the redundant copies need to be updated.

Why we need NoSQL

  1. a need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput
  2. a widespread preference for free and open source software over commercial ones
  3. specialized query operations that are not well supported by the relational model
  4. frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model

Discussions about JSON

For a data structure like a resume, which is mostly a self-contained document, a JSON representation can be quite appropriate.

1
2
3
4
5
6
7
{
  "user_id": 123,
  "name": "example name",
  // compare to relational foreign table and keys
  "positions": [],
  "educations": []
}

Query Language for Data

Graph-like Data Models

Storage and Retrieval

Data Structures

  • Append Only Logs
  • Index (Hash, Offset)
  • Data file segment with compaction & Merge (for more write less read scenario)
    • file format (csv)
    • deleting records (tombstone)
    • crash recovery (in-memory cache restart recovery)
    • partitially written records
      • bitcask files include checksums, allowing such corrupted parts of the log to be detected and ignored
    • concurrency control
      • only one thread allowed to write

Appending and segment merging are sequential write operations, which are generally much faster than random writes, especially on magnetic spinning-disk hard drives. Concurrency and crash recovery are much simpler if segment files are append only or immutable. Merging old segments avoids the problem of data getting fragmented over time.

Limitations of HashTable index

  • the hash table must fit in memory
  • hash table index requires a lot of random access I/O, not fit on-disk
  • range queries are not efficient

SSTable and LSM-Trees

constructing and maintain SST

  • when a write comes in, add it to an in-memory balanced tree data structure
  • when the memtable gets bigger than some threshold - typically a few megabytes- write it out to disk as an SST file. (this can be done efficiently because the tree already maintains the key-value pairs sorted by key. the new SST file becomes the most recent segment of the database. While the SST is being written out to disk, writes can continue to a new memtable instance)
  • in order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
  • from time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.

the LSM-tree algorithm can be slow when looking up keys that do not exist in the database: have to check the memtable, then the segments all the way back to the oldest before make sure that the key does not exist. (BloomFilter)

B-Tree

A four-level of 4kb pages with a branching factor of 500 can store up to 256TB.

Transaction Processing or Analytics

Column Oriented Storage

Encoding and Evolution

Formats of Encoding Data

Modes of Dataflow

Licensed under CC BY-NC-SA 4.0
Get Things Done
Built with Hugo
Theme Stack designed by Jimmy