CH 1

Reliability

It’s a topic about failures and faults.

function is correct
making mistake
loading/volume can handle the required
fault ~= failure
fault relies on error handling
Hardware faults-> Redundancy. AWS is fairly common to fail without message….. SW fault-tolerance is more critical in this situation

Scalability

Define the Load. => Twitter Example
Post tweet: 4.6k requests/sec avg, 12k requests/sec peak
Home timeline: view posts they followe 300k requests/sec
1) Direct to read the database
2) Design a system can utilize the writting time(post tweet) to reduce the reading time(Home timeline update).
A hybrid solution in CH 12
Performance
Latency: the internal perspective of delay. Latency is the duration to handle that request.
response time: it’s user-view on the suspension. It includes queueing delay/network delay/…etc. Therefore, we tend to think of response time as distribution instead of a single reliable number.

scaling up = vertical scaling
scaling out = horizontal scaling

Maintainability

Operability
- monitoring health of the system
- Tracking the problem
- up to date

Keep the routine task easy. Save the productivity on high-value task. - visibility - CI/CD - Provide good default behavior ( This should be followed in all kind of design)

Simplicity To avoid following:
- inconsistent naming/terminology
- tangled dependencies
- coupling of modules
- accidential complexity: To achieve this , abstration is the way to create a easy understand case. Hide details of implementation
Evolvability
CH 2 Data Models

This topic discuss about how to design appropriate database for the application.
RDBMS
NO sql It’s very tricky naming. Nosql isn’t a unique technology. Anything isn’t belong to RDMBS is Nosql.
Grpah-Like Data models

Design categories

One to Many Nested JSON format. Once you query unique ID and pull out all the related information. It’s easy for this user case.
Many to One Remove redundant data/option. It’s not natural to document model(Nosql)
Many to Many Books& Authors. One writer might have multiple work, so do the books.

RDBMS v.s. Document database

Schema Flexibility
Queue type
Data Locality: how to ETL single data

Declarative v.s. imperative

Questions

Why many2many relation type query doesn’t work well on document database ?

CH3

Storage!

2 BIG family of database design

log-structure = append-only
Page-oriented =
BITCASK
store whole hash map in ram
Update index whenever the new data being stored in DB.

-> New issue: Out of memory hash map LSM tree/ SStable

SSTable/LSMtree

Sorted string table: Keys are stroed as string. Once we store the key by its key sequence. Boom! the SSTable come out as solution.

We are able to merge any size segments in efficient way. The idea behind this scenario is merge sort
No longer need to store whole key-value index. Even the key which we are interested in doen’t apper in out record, the neibor keys can provide the samll enough interval.

However following problem is that we need to remake our writing process.
Create Btree/LSMtree in memory. Once the data structure over some threshold, write the whole tree into a disk, sequential writing definitely.

Transcation/Analytics- What’s different requirement beind this?

A transaction needn’t necessarily have ACID (atomicity, consis‐ tency, isolation, and durability) properties.

OLTP v.s OLAP

Property	Transaction processing systems (OLTP)	Analytic systems (OLAP)
Main read pattern	Small number of records per query, fetched by key	Aggregate over large number of records
Main write pattern	Random-access, low-latency writes from user input	Bulk import (ETL) or event stream
Primarily used by	End user/customer, via web application	Internal analyst, for decision support
What data represents	Latest state of data (current point in time)	History of events that happened over time

Data warehouse

Clean playground for analytic task. It’s sperater from OLTP. Speed is key for the different requirements: Quick read/wrtire, or, Quick count/stats

ad hoc analytic: Very specfic analysis. It oftern gernerate by user for one-time purpose.

Summery

DB -> 1. OLTP -> A. log-structured -> B. update-in-place

OLAP

digraph G { main -> parse -> execute; main -> init; main -> cleanup; execute -> make_string; execute -> printf init -> make_string; main -> printf; execute -> compare; }

CH4

Data sturcuter are vitually exist. Landing the data into real world is different story. This chapter conves basic idea how to encode the data and store on the real materials( HD/RAM/CPU)

Languages encoding fomat

Python<=>pickle
Allow the in-memories directly store into disk with minimum overhead
arbitary issue ( about security)
forward/backward compatibility. If we use tensorflow 1.x and store obejcct in picke will cause different library version issue.

It’s bad idea to use language encoding format in general.

Json XML, other binaries

It’s very surprising, these binary format can’t deal with very big number.
CSV file can’t handle some escape charater(?!
json, xml supproted addtional scheme

Without the addtional scheme, the filed name need to be storeed in each data.

Binary schema

Thrift, Protocol Buffers, and Avro They have differnt code generator. Encapsule data structure by Thrift interface definition language (IDL) for example. Instead of using filed name on each data set. Pre-defined scheme prvoide series number to map the filed name and data set.

How to deal with forward/backward update ?

Forward: It can ignore data if the tag number can’t be verifyied.
Backward: Due the unique filed number. New tag number won’t be a issue

Dataflow

database. Node exchage information throught database.
REST and RPC: microservices architecture. Services expose an application-specific API that only allows inputs and outputs that are predetermined by the business logic. Unlike databse we can send arbitary requests. SOAP

a a

REST

b b
Massage-Pass In the mode, the system has message broker.
- Increase system reilable. Broker can control the overload/unavailable microservice
- Redeliver the same request once the system crash.
- One request send to multiple reciver
Actor distributed model :

Summery

why we need to care so much on encoding? Rolling update is essentioal part of evolution of service. We need to assume that different application version on different node can still process the backward/forward data format.

CH5 Replication

sinle leader
multiple leader
No leader

Eventually consiostency ?

Single Leader

Single leader solution allow write query to apply only on leader server. This method has only one downside: There’s only one worker can handle the write command. If the one client can’t reach out the leader means the client can’t write any changs to database.

Synchronous
Asynchronous

Replication Log

Replication Lag

Multiple Leader

Dealing with conflicts becomes inevitable issue on multiple leader scheme.

conflict avoidance: For single user perspect, the whole system is actually single leader scheme.
Converging toward a consistent state: 1. LLW(UUID) 2. Merge 3. Specific application deals the conflct later

Conflict resolution logic

CH 8 Trouble with distributed system

PART III

service: waiting time+ availability
batch processing: Throughput (how much data v.s. how long to it take to finish the job)
streaming process: Lower latency of batch processing( also smaller)
CH 10 Batch Processing

Appetizer

unix-like philosophy.
1. stack pipe for more complicated requirements, instead of modifying old program with new features.
2. Expect all programs will produce the output as another input of another program.
3. Discard clumsy part of program as earlier as possible.
  MapReduce
  
  Mapreduce usually do nothing on input files, and write files only once. HDFS/GFS Central server = Name Node

JOB execution

ch10 question

What’s difference between mapreduce & Spark?

- I/O strategies is different. MapReduce will schedule the HD writing between stages. If the different task need to access the partial result, it'll take from HD(cluster).
- Spark, however, process all the data in memory in most user case. It provide better real time computing performance.

What’s daemon process?
- daemon process don’t have interactionable terminal. It concealingly run on background.

Designing Data-Intensive Applications Note

Steps into system design

CH 1

Reliability

Scalability

Maintainability

CH 2 Data Models

Design categories

RDBMS v.s. Document database

Declarative v.s. imperative

Questions

CH3

Storage!

BITCASK

SSTable/LSMtree

Transcation/Analytics- What’s different requirement beind this?

Data warehouse

Summery

CH4

Languages encoding fomat

forward/backward compatibility. If we use tensorflow 1.x and store obejcct in picke will cause different library version issue.

Json XML, other binaries

Binary schema

Dataflow

Summery

CH5 Replication

Single Leader

Replication Log

Replication Lag

Multiple Leader

CH 8 Trouble with distributed system

PART III

CH 10 Batch Processing

Appetizer

MapReduce

JOB execution

ch10 question

CATALOG

FEATURED TAGS