Scale From Zero to Millions of Users: A Complete System Design Walkthrough
Designing a system that supports millions of users is challenging — it is a journey that requires continuous refinement and endless improvement. In this post, we build a system that supports a single user and gradually scale it up to serve millions of users. After reading this, you will master a handful of techniques that will help you crack system design interview questions.
A journey of a thousand miles begins with a single step. Building a complex system is no different.
Step 1: The Single Server Setup
To start with something simple, everything runs on a single server — web app, database, and cache all on the same machine.
How the request flow works
To understand this setup, let’s trace every step of a single request:
Users access websites through domain names such as
api.mysite.com. The Domain Name System (DNS) is a paid service provided by third parties and not hosted on our servers. DNS translates human-readable domain names into numeric IP addresses.An IP address is returned to the browser or mobile app. In the example, IP address
15.125.23.214is returned.Once the IP address is obtained, Hypertext Transfer Protocol (HTTP) requests are sent directly to your web server.
The web server returns HTML pages or JSON responses for rendering.
Where traffic comes from
Traffic to your web server comes from two sources: web applications and mobile applications.
Web application — it uses a combination of server-side languages (Java, Python, etc.) to handle business logic and storage, and client-side languages (HTML + JavaScript) for presentation in the browser.
Mobile application — HTTP is the communication protocol between the mobile app and the web server. JavaScript Object Notation (JSON) is the most commonly used API response format due to its simplicity. Here is an example:
GET /users/12
{
"id": 12,
"firstName": "John",
"lastName": "Smith",
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": 10021
},
"phoneNumbers": ["212 555-1234", "646 555-4567"]
}
What breaks first on a single server
A single server is fine for development and early-stage products. Two problems emerge as traffic grows:
- No separation of concerns — the web server and database compete for the same CPU and RAM. A traffic spike starves the database of resources, and vice versa. You cannot scale them independently.
- No redundancy — if this one server crashes, your entire product is offline. Every user gets an error until you restart it. This is called a Single Point of Failure (SPOF).
The fix for the first problem: separate the web tier from the data tier.
Step 2: Separating the Database
With the growth of your user base, one server is no longer enough. We need multiple servers: one for web/mobile traffic, the other for the database. Separating web/mobile traffic (web tier) and database (data tier) servers allows them to be scaled independently.
Which database should you use?
You can choose between a relational database and a non-relational database.
Relational databases (SQL) are also called RDBMS or SQL databases. The most popular ones are MySQL, Oracle, and PostgreSQL. Relational databases represent and store data in tables and rows. You can perform JOIN operations across different tables using SQL. Relational databases have been around for over 40 years and have worked well historically — they are the best default choice for most developers.
Non-relational databases (NoSQL) — popular ones include CouchDB, Neo4j, Cassandra, HBase, and Amazon DynamoDB. These databases are grouped into four categories:
- Key-value stores — Redis, Memcached
- Graph stores — Neo4j
- Column stores — Cassandra, HBase
- Document stores — MongoDB, CouchDB
Join operations are generally not supported in non-relational databases. Non-relational databases might be the right choice if:
- Your application requires super-low latency (sub-millisecond response)
- Your data is unstructured, or you do not have any relational data
- You only need to serialize and deserialize data (JSON, XML, YAML)
- You need to store a massive amount of data that doesn’t fit on a single machine
For everything else, start with PostgreSQL or MySQL. They are proven, well-understood, and their consistency guarantees prevent entire classes of bugs that NoSQL databases can introduce.
Step 3: Vertical Scaling vs. Horizontal Scaling
When your single web server starts struggling under load, you have two paths forward.
Vertical scaling — referred to as “scale up” — means adding more power (CPU, RAM) to your existing server. When traffic is low, vertical scaling is a great option and its main advantage is simplicity.
Unfortunately, vertical scaling comes with serious limitations:
- Vertical scaling has a hard limit. It is impossible to add unlimited CPU and memory to a single server.
- Vertical scaling does not have failover and redundancy. If one server goes down, the website/app goes down completely.
Horizontal scaling — referred to as “scale out” — allows you to scale by adding more servers into your pool of resources. It is more desirable for large-scale applications due to the limitations of vertical scaling.
In the previous single-server design, users are connected to the web server directly. Users are unable to access the website if the web server is offline. In another scenario, if many users access the web server simultaneously and it reaches its load limit, users experience slower responses or fail to connect. A load balancer is the best technique to address these problems.
Step 4: Load Balancer — Distributing Traffic Across Servers
A load balancer evenly distributes incoming traffic among web servers that are defined in a load-balanced set.
As shown above, users connect to the public IP of the load balancer directly. With this setup, web servers are unreachable directly by clients anymore.
For better security, private IPs are used for communication between servers. A private IP is an IP address reachable only between servers in the same network — it is unreachable over the public internet. The load balancer communicates with web servers through these private IPs.
After adding a load balancer and a second web server, we successfully solve the no-failover problem and improve the availability of the web tier:
- If server 1 goes offline, all traffic is routed to server 2. This prevents the website from going offline. We also add a new healthy web server to the pool to balance the load.
- If the website traffic grows rapidly and two servers are not enough, the load balancer can handle this gracefully. You only need to add more servers to the web server pool, and the load balancer automatically starts sending requests to them.
The most common load balancing strategy is round-robin (1→2→1→2). More sophisticated strategies: least connections (route to the server with fewest active connections), IP hash (same client always hits same server), weighted round-robin (send more traffic to more powerful servers).
Now the web tier looks good. What about the data tier? The current design has one database, so it does not support failover and redundancy. Database replication is a common technique to address this problem.
Step 5: Database Replication — Surviving Database Failures
“Database replication can be used in many database management systems, usually with a master/slave relationship between the original (master) and the copies (slaves).”
A master database generally only supports write operations (INSERT, UPDATE, DELETE). A slave database gets copies of the data from the master and only supports read operations (SELECT). All data-modifying commands must be sent to the master database. Most applications require a much higher ratio of reads to writes — thus, the number of slave databases in a system is usually larger than the number of master databases.
Advantages of database replication
- Better performance — In the master-slave model, all writes and updates happen in master nodes, whereas read operations are distributed across slave nodes. This model improves performance because it allows more queries to be processed in parallel.
- Reliability — If one of your database servers is destroyed by a natural disaster such as a typhoon or an earthquake, data is still preserved. You do not need to worry about data loss because data is replicated across multiple locations.
- High availability — By replicating data across different locations, your website remains in operation even if a database is offline, as you can access data stored in another database server.
What happens when a database goes offline?
If only one slave database is available and it goes offline, read operations will be directed to the master database temporarily. As soon as the issue is found, a new slave database will replace the old one. In case multiple slave databases are available, read operations are redirected to other healthy slave databases.
If the master database goes offline, a slave database will be promoted to be the new master. All database operations will be temporarily executed on the new master. A new slave database will replace the old one for data replication immediately. In production systems, promoting a new master is more complex because the data in a slave database might not be up to date — the missing data needs to be updated by running data recovery scripts. Although more complex replication methods like multi-master and circular replication could help, those setups are beyond the scope of this tutorial.
After adding the load balancer and database replication, the overall request flow is:
- A user gets the IP address of the load balancer from DNS
- A user connects to the load balancer with this IP address
- The HTTP request is routed to either Server 1 or Server 2
- A web server reads user data from a slave database
- A web server routes any data-modifying operations to the master database (write, update, delete)
Now it is time to improve the load and response time. This can be done by adding a cache layer and shifting static content (JS/CSS/image/video files) to a CDN.
Step 6: Cache — Stop Hitting the Database for the Same Data
A cache is a temporary storage area that stores the result of expensive responses or frequently accessed data in memory so that subsequent requests are served more quickly.
Every time a new web page loads, one or more database calls are executed to fetch data. The application performance is greatly affected by calling the database repeatedly. The cache can mitigate this problem.
The cache tier is a temporary data store layer, much faster than the database. The benefits of having a separate cache tier include:
- Better system performance
- Ability to reduce database workload
- Ability to scale the cache tier independently
This caching strategy is called a read-through cache. After receiving a request, a web server first checks if the cache has the available response. If it does, it sends data back to the client. If not, it queries the database, stores the response in cache, and sends it back to the client.
Interacting with cache servers is simple because most cache servers provide APIs for common programming languages. Here are typical Memcached APIs:
SECONDS = 1
cache.set('myKey', 'hi there', 3600 * SECONDS)
cache.get('myKey')
Considerations for using cache
When to use cache. Consider using cache when data is read frequently but modified infrequently. Since cached data is stored in volatile memory, a cache server is not ideal for persisting data. For instance, if a cache server restarts, all the data in memory is lost. Thus, important data should be saved in persistent data stores.
Expiration policy. It is a good practice to implement an expiration policy. Once cached data is expired, it is removed from the cache. When there is no expiration policy, cached data will be stored in memory permanently. It is advisable not to make the expiration date too short (this will cause the system to reload data from the database too frequently) nor too long (the data can become stale).
Consistency. This involves keeping the data store and the cache in sync. Inconsistency can happen because data-modifying operations on the data store and cache are not in a single transaction. When scaling across multiple regions, maintaining consistency between the data store and cache is challenging. (See the paper “Scaling Memcache at Facebook” for further details.)
Mitigating failures. A single cache server represents a potential single point of failure (SPOF) — a part of a system that, if it fails, will stop the entire system from working. As a result, multiple cache servers across different data centers are recommended to avoid SPOF. Another recommended approach is to overprovision the required memory by certain percentages — this provides a buffer as memory usage increases.
Eviction policy. Once the cache is full, any requests to add items to the cache might cause existing items to be removed. This is called cache eviction. Least-recently-used (LRU) is the most popular cache eviction policy. Other policies — Least Frequently Used (LFU) or First In First Out (FIFO) — can be adopted to satisfy different use cases.
Step 7: Content Delivery Network (CDN)
A CDN is a network of geographically dispersed servers used to deliver static content. CDN servers cache static content like images, videos, CSS, and JavaScript files.
Dynamic content caching is a relatively new concept. This tutorial focuses on how to use CDN to cache static content.
Here is how CDN works at a high level: when a user visits a website, a CDN server closest to the user will deliver static content. Intuitively, the further users are from CDN servers, the slower the website loads. For example, if CDN servers are in San Francisco, users in Los Angeles will get content faster than users in Europe.
CDN workflow step-by-step
- User A tries to get
image.pngusing an image URL. The URL’s domain is provided by the CDN provider (e.g.,https://mysite.cloudfront.net/logo.jpgorhttps://mysite.akamai.com/image-manager/img/logo.jpg). - If the CDN server does not have
image.pngin its cache, the CDN server requests the file from the origin, which can be a web server or online storage like Amazon S3. - The origin returns
image.pngto the CDN server, which includes an optional HTTPCache-ControlheaderTime-to-Live (TTL)that describes how long the image is cached. - The CDN caches the image and returns it to User A. The image remains cached in the CDN until the TTL expires.
- User B sends a request to get the same image.
- The image is returned from the CDN cache as long as the TTL has not expired. User B gets it instantly from the nearby edge node.
Considerations for using a CDN
Cost. CDNs are run by third-party providers, and you are charged for data transfers in and out of the CDN. Caching infrequently used assets provides no significant benefits so you should consider moving them out of the CDN.
Setting an appropriate cache expiry. For time-sensitive content, setting a cache expiry time is important. The cache expiry time should neither be too long nor too short. If it is too long, the content might no longer be fresh. If it is too short, it can cause repeat reloading of content from origin servers to the CDN.
CDN fallback. You should consider how your website/application copes with CDN failure. If there is a temporary CDN outage, clients should be able to detect the problem and request resources directly from the origin.
Invalidating files. You can remove a file from the CDN before it expires by:
- Invalidating the CDN object using APIs provided by CDN vendors
- Using object versioning to serve a different version of the object — add a parameter to the URL such as a version number:
image.png?v=2
After adding CDN and cache: static assets (JS, CSS, images) are no longer served by web servers — they are fetched from the CDN for better performance. The database load is lightened by caching data.
Step 8: Stateless Web Tier — The Key to Horizontal Scaling
Now it is time to consider scaling the web tier horizontally. To do this, we need to move state (for instance, user session data) out of the web tier. A good practice is to store session data in persistent storage such as a relational database or NoSQL. Each web server in the cluster can then access state data from databases. This is called a stateless web tier.
Stateful architecture (the problem)
A stateful server and a stateless server have some key differences. A stateful server remembers client data (state) from one request to the next. A stateless server keeps no state information.
In the stateful diagram above, User A’s session data and profile image are stored in Server 1. To authenticate User A, HTTP requests must be routed to Server 1. If a request is sent to Server 2, authentication would fail because Server 2 does not contain User A’s session data.
The issue is that every request from the same client must be routed to the same server. This can be done with sticky sessions in most load balancers — however, this adds overhead. Adding or removing servers is much more difficult, and it is also challenging to handle server failures.
Stateless architecture (the solution)
In the stateless architecture, HTTP requests from users can be sent to any web server, which fetches state data from a shared data store. State data is stored in a shared data store and kept out of web servers. A stateless system is simpler, more robust, and scalable.
The shared data store could be a relational database, Memcached/Redis, or NoSQL. The NoSQL data store is chosen as it is easy to scale. After the state data is removed out of web servers, auto-scaling of the web tier is easily achieved by adding or removing servers based on traffic load.
Step 9: Multiple Data Centers — Geographic Redundancy
Your website grows rapidly and attracts a significant number of users internationally. To improve availability and provide a better user experience across wider geographical areas, supporting multiple data centers is crucial.
In normal operation, users are geoDNS-routed to the closest data center, with a split traffic of x% in US-East and (100–x)% in US-West. GeoDNS is a DNS service that allows domain names to be resolved to IP addresses based on the location of a user.
In the event of any significant data center outage, we direct all traffic to a healthy data center. If data center 2 (US-West) is offline, 100% of the traffic is routed to data center 1 (US-East).
Technical challenges in multi-data center setup
Several technical challenges must be resolved:
Traffic redirection. Effective tools are needed to direct traffic to the correct data center. GeoDNS can be used to direct traffic to the nearest data center depending on where a user is located.
Data synchronization. Users from different regions could use different local databases or caches. In failover cases, traffic might be routed to a data center where data is unavailable. A common strategy is to replicate data across multiple data centers. A previous study shows how Netflix implements asynchronous multi-data center replication.
Test and deployment. With multi-data center setup, it is important to test your website/application at different locations. Automated deployment tools are vital to keep services consistent through all the data centers.
To further scale our system, we need to decouple different components of the system so they can be scaled independently. Messaging queue is a key strategy employed by many real-world distributed systems to solve this problem.
Step 10: Message Queue — Decouple and Scale Independently
A message queue is a durable component, stored in memory, that supports asynchronous communication. It serves as a buffer and distributes asynchronous requests. The basic architecture of a message queue is simple:
- Input services (called producers/publishers) create messages and publish them to a message queue
- Other services (called consumers/subscribers) connect to the queue and perform actions defined by the messages
Decoupling makes the message queue a preferred architecture for building scalable and reliable applications. With the message queue, the producer can post a message to the queue when the consumer is unavailable to process it. The consumer can read messages from the queue even when the producer is unavailable.
A real-world example: photo customization
Consider an application that supports photo customization — cropping, sharpening, blurring, etc. Those customization tasks take time to complete.
- Web servers publish photo processing jobs to the message queue
- Photo processing workers pick up jobs from the queue and asynchronously perform photo customization tasks
- The producer (web server) and the consumer (worker) can be scaled independently
- When the size of the queue becomes large, more workers are added to reduce the processing time
- However, if the queue is empty most of the time, the number of workers can be reduced
Other real-world uses: email delivery (don’t make users wait for SMTP), video transcoding (YouTube processes uploads asynchronously), search indexing (Elasticsearch indexes new documents in the background), push notifications, and audit logging.
Step 11: Logging, Metrics, and Automation
When working with a small website that runs on a few servers, logging, metrics, and automation support are good practices but not a necessity. However, now that your site has grown to serve a large business, investing in these tools is essential.
Logging — Monitoring error logs is important because it helps identify errors and problems in the system. You can monitor error logs at the per-server level or use tools to aggregate them to a centralized service for easy search and viewing.
Metrics — Collecting different types of metrics helps gain business insights and understand the health status of the system. Some useful metrics are:
- Host level metrics: CPU, Memory, disk I/O, etc.
- Aggregated level metrics: for example, the performance of the entire database tier, cache tier, etc.
- Key business metrics: daily active users, retention, revenue, etc.
Automation — When a system gets big and complex, we need to build or leverage automation tools to improve productivity. Continuous integration is a good practice in which each code check-in is verified through automation, allowing teams to detect problems early. Besides automating your build, test, and deploy process, automation can significantly improve developer productivity.
The design now includes a message queue (which helps make the system more loosely coupled and failure resilient) plus logging, monitoring, metrics, and automation tools.
As the data grows every day, your database gets more overloaded. It is time to scale the data tier.
Step 12: Database Sharding — Scaling the Data Tier
There are two broad approaches for database scaling: vertical scaling and horizontal scaling.
Vertical scaling (scale up) means adding more power (CPU, RAM, DISK) to an existing machine. There are some powerful database servers available — according to Amazon Relational Database Service (RDS), you can get a database server with 24 TB of RAM. This kind of powerful database server could store and handle lots of data. For example, stackoverflow.com in 2013 had over 10 million monthly unique visitors, but it only had 1 master database.
However, vertical scaling comes with serious drawbacks:
- You can add more hardware, but there are hardware limits — if you have a large user base, a single server is not enough
- Greater risk of single point of failure
- The overall cost is high — powerful servers are much more expensive
Horizontal scaling — also known as sharding — is the practice of adding more servers.
Sharding separates large databases into smaller, more easily managed parts called shards. Each shard shares the same schema, though the actual data on each shard is unique to that shard.
User data is allocated to a database server based on user IDs. Anytime you access data, a hash function is used to find the corresponding shard. In our example, user_id % 4 is used as the hash function. If the result equals 0, Shard 0 is used. If the result equals 1, Shard 1 is used. And so on.
The most important factor to consider when implementing a sharding strategy is the choice of the sharding key (also known as partition key). The sharding key consists of one or more columns that determine how data is distributed. A sharding key allows you to retrieve and modify data efficiently by routing database queries to the correct database. When choosing a sharding key, one of the most important criteria is to choose a key that can evenly distribute data.
Sharding challenges
Sharding is a great technique to scale the database but it is far from a perfect solution. It introduces complexities and new challenges:
Resharding data — Resharding is needed when: (1) a single shard can no longer hold more data due to rapid growth, (2) certain shards might experience shard exhaustion faster than others due to uneven data distribution. When shard exhaustion happens, it requires updating the sharding function and moving data around. Consistent hashing (covered in Chapter 5) is a commonly used technique to solve this problem.
Celebrity problem — This is also called a hotspot key problem. Excessive access to a specific shard could cause server overload. Imagine data for Katy Perry, Justin Bieber, and Lady Gaga all end up on the same shard. For social applications, that shard will be overwhelmed with read operations. To solve this problem, we may need to allocate a shard for each celebrity. Each shard might even require further partition.
Join and de-normalization — Once a database has been sharded across multiple servers, it is hard to perform JOIN operations across database shards. A common workaround is to de-normalize the database so that queries can be performed in a single table.
The Full Architecture: Zero to Millions of Users
Summary: How We Scale to Millions of Users
Scaling a system is an iterative process. More fine-tuning and new strategies are needed to scale beyond millions of users. For example, you might need to optimize your system and decouple the system into even smaller services. All the techniques learned in this chapter should provide a good foundation to tackle new challenges.
Here is a summary of how we scale our system to support millions of users:
| Principle | What It Means |
|---|---|
| Keep web tier stateless | Store sessions in shared storage (Redis/SQL/NoSQL), not in server memory |
| Build redundancy at every tier | No single point of failure — load balancers, DB replicas, multiple cache nodes |
| Cache data as much as you can | CDN for static assets, Redis for dynamic data, reduce DB load |
| Support multiple data centers | Geographic redundancy protects against regional outages |
| Host static assets in CDN | Images, CSS, JS from edge nodes — not your origin server |
| Scale your data tier by sharding | Partition by a key that distributes load evenly across shards |
| Split tiers into individual services | Decouple components so they scale independently |
| Monitor your system and use automation tools | Logging, metrics, CI/CD — you cannot improve what you cannot measure |
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
What’s Next
In the next post, we cover Back-of-the-Envelope Estimation — how to quickly calculate QPS, storage requirements, and bandwidth before designing a system. This skill separates engineers who guess from those who reason from numbers.