Designing Data-Intensive Applications Note

Steps into system design

Posted by Chester on March 13, 2020

CH 1

Reliability

It’s a topic about failures and faults.

  • function is correct
  • making mistake
  • loading/volume can handle the required
    fault ~= failure
    fault relies on error handling
    Hardware faults-> Redundancy. AWS is fairly common to fail without message….. SW fault-tolerance is more critical in this situation

Scalability

  • Define the Load. => Twitter Example
  • Post tweet: 4.6k requests/sec avg, 12k requests/sec peak
  • Home timeline: view posts they followe 300k requests/sec
    1) Direct to read the database
    2) Design a system can utilize the writting time(post tweet) to reduce the reading time(Home timeline update).
    A hybrid solution in CH 12
  • Performance
  • Latency: the internal perspective of delay. Latency is the duration to handle that request.
  • response time: it’s user-view on the suspension. It includes queueing delay/network delay/…etc. Therefore, we tend to think of response time as distribution instead of a single reliable number.

scaling up = vertical scaling
scaling out = horizontal scaling

Maintainability

  • Operability
    • monitoring health of the system
    • Tracking the problem
    • up to date

Keep the routine task easy. Save the productivity on high-value task. - visibility - CI/CD - Provide good default behavior ( This should be followed in all kind of design)

  • Simplicity To avoid following:
    • inconsistent naming/terminology
    • tangled dependencies
    • coupling of modules
    • accidential complexity: To achieve this , abstration is the way to create a easy understand case. Hide details of implementation
  • Evolvability

    CH 2 Data Models

    This topic discuss about how to design appropriate database for the application.

  • RDBMS
  • NO sql It’s very tricky naming. Nosql isn’t a unique technology. Anything isn’t belong to RDMBS is Nosql.
  • Grpah-Like Data models

Design categories

  • One to Many Nested JSON format. Once you query unique ID and pull out all the related information. It’s easy for this user case.
  • Many to One Remove redundant data/option. It’s not natural to document model(Nosql)
  • Many to Many Books& Authors. One writer might have multiple work, so do the books.

RDBMS v.s. Document database

  • Schema Flexibility
  • Queue type
  • Data Locality: how to ETL single data

Declarative v.s. imperative

Questions

  1. Why many2many relation type query doesn’t work well on document database ?

CH3

Storage!

2 BIG family of database design

  • log-structure = append-only
  • Page-oriented =

    BITCASK

  • store whole hash map in ram
  • Update index whenever the new data being stored in DB.

-> New issue: Out of memory hash map LSM tree/ SStable

SSTable/LSMtree

Sorted string table: Keys are stroed as string. Once we store the key by its key sequence. Boom! the SSTable come out as solution.

  • We are able to merge any size segments in efficient way. The idea behind this scenario is merge sort
  • No longer need to store whole key-value index. Even the key which we are interested in doen’t apper in out record, the neibor keys can provide the samll enough interval.

However following problem is that we need to remake our writing process.
Create Btree/LSMtree in memory. Once the data structure over some threshold, write the whole tree into a disk, sequential writing definitely.

Transcation/Analytics- What’s different requirement beind this?

A transaction needn’t necessarily have ACID (atomicity, consis‐ tency, isolation, and durability) properties.

OLTP v.s OLAP

Property Transaction processing systems (OLTP) Analytic systems (OLAP)
Main read pattern Small number of records per query, fetched by key Aggregate over large number of records
Main write pattern Random-access, low-latency writes from user input Bulk import (ETL) or event stream
Primarily used by End user/customer, via web application Internal analyst, for decision support
What data represents Latest state of data (current point in time) History of events that happened over time

Data warehouse

Clean playground for analytic task. It’s sperater from OLTP. Speed is key for the different requirements: Quick read/wrtire, or, Quick count/stats

ad hoc analytic: Very specfic analysis. It oftern gernerate by user for one-time purpose.

Summery

DB -> 1. OLTP -> A. log-structured -> B. update-in-place

  1. OLAP

digraph G { main -> parse -> execute; main -> init; main -> cleanup; execute -> make_string; execute -> printf init -> make_string; main -> printf; execute -> compare; }

CH4

Data sturcuter are vitually exist. Landing the data into real world is different story. This chapter conves basic idea how to encode the data and store on the real materials( HD/RAM/CPU)

Languages encoding fomat

  • Python<=>pickle
  • Allow the in-memories directly store into disk with minimum overhead
  • arbitary issue ( about security)
  • forward/backward compatibility. If we use tensorflow 1.x and store obejcct in picke will cause different library version issue.

    It’s bad idea to use language encoding format in general.

Json XML, other binaries

  • It’s very surprising, these binary format can’t deal with very big number.
  • CSV file can’t handle some escape charater(?!
  • json, xml supproted addtional scheme

Without the addtional scheme, the filed name need to be storeed in each data.

Binary schema

Thrift, Protocol Buffers, and Avro They have differnt code generator. Encapsule data structure by Thrift interface definition language (IDL) for example. Instead of using filed name on each data set. Pre-defined scheme prvoide series number to map the filed name and data set.

How to deal with forward/backward update ?

  • Forward: It can ignore data if the tag number can’t be verifyied.
  • Backward: Due the unique filed number. New tag number won’t be a issue

Dataflow

  • database. Node exchage information throught database.
  • REST and RPC: microservices architecture. Services expose an application-specific API that only allows inputs and outputs that are predetermined by the business logic. Unlike databse we can send arbitary requests. SOAP

    a a

    REST

    b b

  • Massage-Pass In the mode, the system has message broker.
    • Increase system reilable. Broker can control the overload/unavailable microservice
    • Redeliver the same request once the system crash.
    • One request send to multiple reciver

    Actor distributed model :

Summery

  • why we need to care so much on encoding? Rolling update is essentioal part of evolution of service. We need to assume that different application version on different node can still process the backward/forward data format.

CH5 Replication

  • sinle leader
  • multiple leader
  • No leader

Eventually consiostency ?

Single Leader

Single leader solution allow write query to apply only on leader server. This method has only one downside: There’s only one worker can handle the write command. If the one client can’t reach out the leader means the client can’t write any changs to database.

  • Synchronous
  • Asynchronous

Replication Log

Replication Lag

Multiple Leader

Dealing with conflicts becomes inevitable issue on multiple leader scheme.

  • conflict avoidance: For single user perspect, the whole system is actually single leader scheme.
  • Converging toward a consistent state: 1. LLW(UUID) 2. Merge 3. Specific application deals the conflct later

Conflict resolution logic

CH 8 Trouble with distributed system

PART III

  • service: waiting time+ availability
  • batch processing: Throughput (how much data v.s. how long to it take to finish the job)
  • streaming process: Lower latency of batch processing( also smaller)

    CH 10 Batch Processing

    Appetizer

    unix-like philosophy.

    1. stack pipe for more complicated requirements, instead of modifying old program with new features.
    2. Expect all programs will produce the output as another input of another program.
    3. Discard clumsy part of program as earlier as possible.

      MapReduce

      Mapreduce usually do nothing on input files, and write files only once. HDFS/GFS Central server = Name Node

JOB execution

ch10 question

  • What’s difference between mapreduce & Spark?
- I/O strategies is different. MapReduce will schedule the HD writing between stages. If the different task need to access the partial result, it'll take from HD(cluster).
- Spark, however, process all the data in memory in most user case. It provide better real time computing performance.
  • What’s daemon process?
    • daemon process don’t have interactionable terminal. It concealingly run on background.