RSM (Reliability, Scalability, Maintainability) - tradeoff exists between each
Scalability - property of a system to continue to maintain desired performance proportional to the resources added for handling more work
False assumptions that people new to distributed applications make:
In a highly distributed system, choose any two amongst - Consistency, Availibility, and Partition Tolerance (network breaking up). Networks are always unreliable, so the choice comes down to Consistency vs Availability in the presence of partitions.
“A data store is available if and only if all get and set requests eventually return a response that’s part of their specification. This does not permit error responses, since a system could be trivially available by always returning an error.” ― CAP FAQ
Video: What is CAP Theorem?
PACELC Theorem: In case of (“PAC”) we need to choose between C or A, (E)lse (even if system is running normally in the absence of partitions), we need to choose between (L)atency and (C)onsistency.
We do have CA in non-distributed systems like RDBMS like MySQL, Postgres, etc… it is called ACID there and its a fundamental property of SQL transactions.
Availability is insured by:
Availability is the percentage of uptime of a system. Overall availability of a system containing multiple components prone to failure depends on:
A1 * A2
1 - ((1 - A1) * (1 - A2))
(so basically the system only goes down only if all the components go down)Ex - if a system has two components each having 99.00% availability, then in sequence the overall system’s availablity goes down (98.01%), but in parallel, it goes up (99.99%).
Consistency is insured by: sharing updates between multiple replicas (one master many slaves, or all masters)
Some techniques for achieving consistency:
Max Consistency - all writes are available globally to everyone reading from any part of the system. In simple words, ensures that every node or replica has the same view of data at a given time, irrespective of which client has updated the data.
In a distributed system, we read/write to multiple replicas of a data store from multiple client servers, handling of these incoming queries/updates needs to be performed such that data in all the replicas is in a consistent state.
In decreasing order of consistency, increasing order of efficiency:
sum(1,2)
utilize multiple keys read(1)
and read(2)
and overall ordering will decide the operation sum
outputWeak consistency (Causal & Eventual) levels are too loose, hence we often pair them with some sort of stronger guarantees like:
Quorum Consistency (Flexible): we can configure R
and W
nodes to have strong or weak consistency depending on the use case (notes)
Within a single database node, when we modify a piece of data from multiple transactions, isolation levels come into the picture.
From the POV of Follower 2
replica in the diagram above:
Servers
) the system acts as if it were a single entityx
can be sent immediately without blocking, even though write was the first operation coming into the system. Both the followers and leader will converge to x=2
eventually.Can be used in between web server and service, service and database, etc… knows which services are up and routes traffic to only them, does health check heartbeats for the same.
Types: layer-4 and layer-7 /networking/notes/#load-balancers
Balancing Algorithms: static and dynamic
https://www.enjoyalgorithms.com/blog/types-of-load-balancing-algorithms
To avoid it being a single-point-of-failure (SPOF), keep another LB as a fail-over (active-active or active-passive).
/microservices/patterns/#load-balancing
JOIN
are often done in the application codeReference: http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html
BASE is used in NoSQL to describe properties:
Types of NoSQL Stores:
Big Data stores like HBase and Google BigQuery store column-wise rather than row-wise (illustration), hence they are called Columnar Databases. Cassandra also belongs to the column-family of databases. They are very fast in adding and processing data which requires only a subset of columns rather than fetching the whole row.
Types of Databases - Fireship YT
NoSQL Patterns: http://horicky.blogspot.com/2009/11/nosql-patterns.html
Scaling up to your first 10 million users: https://www.youtube.com/watch?v=kKjm4ehYiMs
Partitioning a database can be done for various reasons ranging from functional decomposition or storage considerations associated with a denormalized (i.e. single large) database.
Partitioning Criteria: hash-based, list, round robin, composite
Sharding: divide database into multiple ones based on criteria (often non-functional such as geographic proximity to user).
JOIN
across shards are very expensive. Consistent hashing can help in resolving uneven shard access patterns. Needs a layer-7 balancer to read and route read/write requests.Federation: divide database into multiple ones based on functional boundaries
Denormalization: duplicate columns are often kept in multiple tables to make reads easier, writes suffer though. We often use Materialized views which are cached tables stored when we query across partitioned databases i.e. a complex JOIN
Goals: Uniform distribution of data among nodes and minimal rerouting of traffic if nodes are added or removed. Consistent Hashing is used.
How to copy data?
Two generals problem - the problem with systems that reply on ACK
for consistency is that the other one doesn’t know if the update has been performed on the other node if ACK
doesn’t reach us back, resolution - hierarchical replication.
Problems in replication:
Split-Brain Problem: happens when there are even number of masters in a system, half of them have a value of x=1
and the other half has x=2
, but we need a single unambiguous value of x
. Auto-reconciliation like timestamp comparisons can be done, sometimes manual-reconciliation too. Better to have an odd number of nodes to avoid this problem, in such a system even if there is a split, a strict quorum is possible.
Write Amplification Problem (WA): when replicating data from a master to a replica, we’ve to write n
times if there are n
replicas in the system. In such a system, what happens when one of the replicas fails to ack the write? should we wait for it (indefinite wait) or move on (stale data on that replica).
Sampling: only a subset of the collected data should be persisted to the database rather than the entire dataset. This approach is suitable when not all data points are critical, helps reduce storage. Used in Time Series Databases.
Aggregation: aggregate multiple events into fewer ones using Services and write those to DB. We can have multi-tier aggregation as well to combine multiple data events into fewer and fewer ones in each layer. Requires additional aggregator services which increases complexity.
We can also have a layer-7 Load Balancer to do partitioning and multi-tier aggregation together:
It is the copying of data from one or more sources into one destination which represents the data differently from any of the sources.
A typical ETL process is just a ETL pipeline (DAG of tasks), which applies transformations and outputs data that can be queried.
Two types of ETL processings:
ETL pipeline architectures:
Transfer data from an old DB to a new one with minimal downtime.
Naive Approach:
Stop old DB
DB_copy from old
DB_upload to new
Point services to new DB
Downside is the downtime of database during migration.
Simple Approach:
Utilities like pg_dump can take backup without stopping the database server so it remains operational buring backup.
Creating a backup (dump) of a large database will take some time. Any inserts or updates that come in during this backup time will’ve to be populated to new database i.e. entries after timestamp x
at which the backup started. These new entries will be much smaller in number than entire existing database backup.
DB_copy to new (starts at x)
DB_copy to new with timestamp > x (starts at y)
Add trigger on old to send queries to new (starts at z)
DB_copy to new with: y < timestamp < z
If update queries on data yet to be backed up happens with trigger, we can do UPSERT (INSERT if not present, UPDATE otherwise).
If we’ve less data, we can run backup at x
and add trigger to old at y
, skipping step 2 above (second iteration of DB_copy). After that, we only need to copy data from backup start uptil trigger was added i.e x < timestamp < y
.
Adding Triggers early on isn’t a good design in large databases as it will cause a lot of misses if read from new DB and stale data if read old DB. So add triggers only after some data is migrated to new DB.
Optimized Approach:
Setup CDC or SQL Triggers first, and then transfer data from old to new using backup.
After the new DB is populated we want to point clients to the new DB instead of old DB, all read/write should happen on new DB.
Use VIEW
tables as proxy created from new DB (transformations can be done to mimic old DB) but name them same as in old DB because the code is still expecting connection to the old DB. Rename old DB to “old_DB_obsolete” (unpoint).
Once everything is working fine, refactor service code to entirely use new DB (some downtime to restart services). We can delete the proxy VIEW
s after that.
Diff Cloud Providers:
When migrating from Azure to AWS, we can’t add a trigger as in simple approach or create a view proxy as in optimized approach.
We use a DB Proxy
server in such a case. Reads are directed to the old DB, writes to both old and new. Any views needed are present in the proxy server. Rest of the steps remain the same.
It requires two restarts though - one to point services to DB Proxy and other is to point services to AWS when we’re done with the migration.
Data can be cached on the client side in the browser. Saves server storage space.
Data can be stored on the client as well. Ex - WhatsApp stores backups on any third-party like Google Drive. Analytics can be a problem though with this approach.
It is the ability to understand the internal state or condition of a complex system based solely on knowledge of its external outputs, specifically its telemetry.
The 3 pillars of observability (o11y):
Treat logs as a continuous stream: centralized logs with trace and gather metrics in a timeseries DB
Other aspects of good o11y:
Anomaly Detection: detecting faults
Root Cause Analysis: finding out cause of faults, correlation in metrics
[5 mins]
List and confirm FRs and NFRs (approx. 3 to 5) and Capacity Estimation (optional; preferably do later as and when required)[2 mins]
Identify Core entities[5 mins]
Jot API spec and interfaces[5 mins]
Track Data flow (optional)[10 - 15 mins]
High level design - components and architecture (drawing boxes)[10 mins]
Deep dives - specific technologies, edge cases, identifying and adressing issues and bottlenecks, interviewer’s probesAdditional stuff - security, o11y and alerts, a11y, analytics.
Functional Reqiurements (FR) - business use-cases, features
Non-Functional Reqiurements (NFR) - consistency, availability, performance (latency and throughput), fault-tolerance, security, privacy, maintainability
You can use C4 model for visualising software architecture.
Reference: