跪拜 Guibai
← Back to the summary

From Single Server to Multi-Server: A Full-Stack Developer's Production Architecture Upgrade

Recording the Process of Upgrading from a Single Server to a Multi-Server Architecture in the Company

I. Preface

I am currently responsible for full-stack development of an online education system at the company. The tech stack is Node.js + Nest.js + MySQL + Redis. Previously, everything ran on a single 4-core 16GB server. In the early stages, the user volume was small, and there were no major issues. This year, user growth has been relatively fast, with daily active users increasing from a few hundred to five or six thousand. The single machine is starting to struggle. API response times have drifted from around 200ms to over 2 seconds. Logs occasionally show memory warnings, WebSocket connections frequently drop, and the user experience during classes is poor, with continuous feedback about system problems.

After research, we decided to overhaul the architecture. The core idea of this architecture upgrade is to separate compute and storage, put a load balancer in front of the frontend, and use managed databases for the backend. No longer will everything be crammed onto a single machine. The goal is that if a single server goes down, the service must not stop, concurrency capacity must be improved, and the database must not be a single point of failure.

This architecture is based on Volcengine's Guangzhou region, using 2 cloud servers as compute nodes, deployed in Availability Zone A and Availability Zone B respectively. An external IP is exposed through a load balancer to facilitate DNS resolution for accessing the service.

MySQL and Redis are separated out and not placed on the servers. This ensures data security, better performance, and easier rollback and data backup.

To avoid slow access speeds between the servers and the database, they are deployed in the same data center and connected using internal network IPs. We also plan to use Volcengine's Full Site Acceleration DCDN to speed up API access from different regions. So the overall architecture configuration is as follows:

ChatGPT Image 2026年6月25日 10_17_10.png

II. Step 1: Create Cloud Server Instances (2 instances)

2.1 Console Entry

2.2 First Instance Configuration

Log in to the Volcengine console, find Cloud Server ECS under the Compute menu, and click Create Instance.

b9c4587857dbec124a85c19e5dd05a07.png

Configuration Item Value Description
Billing Method Pay-as-you-go (testing) / Subscription (production) Cost estimation based on subscription
Region South China 1 (Guangzhou) Guangzhou region supports MySQL and Redis services
Availability Zone Availability Zone A Different from the second instance
Specification ecs.g3ie.xlargeecs.g3ie.large 4 cores 16GB, adjusted down based on requirements
Image Ubuntu 22.04 LTS Consistent with the original environment
System Disk 40GB SSD Kept unchanged
Network Select default VPC and subnet Ensure network connectivity
Public IP Allocate Convenient for temporary debugging and direct server connection for troubleshooting
Security Group Not specified yet, to be created later Configure firewall rules separately

The first one was named kcl-server-1, selected South China 1 (Guangzhou) Availability Zone A. The specification chosen was ecs.g3ie.large, 4 cores 16GB. This configuration was determined to be suitable after stress testing; smaller would easily max out the CPU, larger would be a waste of money. The image used was Ubuntu 22.04 LTS, with a 40GB SSD system disk. The network uses the default VPC. The security group is not bound yet; it will be configured uniformly later.

2.3 Second Instance Configuration

Configuration Item Value
Name kcl-server-2
Region South China 1 (Guangzhou)
Availability Zone Availability Zone B
Specification ecs.g3ie.large (4 cores 16GB)
Other Consistent with the first instance

The configuration for the second instance, kcl-server-2, is identical. The only difference is that the Availability Zone is set to B. This way, even if one availability zone in Guangzhou has issues, the other can take over. After both instances were created, we confirmed they were in the same VPC and could communicate via the internal network. Initially, we selected different subnets, and the internal network wasn't working, which took half a day to troubleshoot. This time, we learned our lesson.

III. Step 3: Create Cloud Database MySQL Edition (Including Read/Write Splitting)

The database is separated out and placed on the cloud, not crammed together with the servers. Previously, on a single machine, MySQL and Node.js competed for CPU and memory, affecting both sides.

e4f2bf0bd80ca34840002976d19ca287.png

3.1 Primary Node Instance Configuration

Configuration Item Value Description
Billing Method Pay-as-you-go (testing) / Subscription (production)
Region South China 1 (Guangzhou) Same region as the cloud servers
Availability Zone Availability Zone B
Database Version MySQL 8.0
Series High Availability Edition Primary-standby architecture, automatic failover
Specification 2 cores 8GB Aligned with server specifications, cost control
Storage Space 30GB Adjusted based on data volume
Network Select the same VPC as the cloud servers

The primary instance was selected in Guangzhou Availability Zone B, MySQL 8.0, High Availability Edition, with automatic primary-standby failover. Specification is 2 cores 8GB, storage 30GB, network in the same VPC as the servers. Internal network latency measured around 0.2ms, which is negligible.

3.2 Add Read Replica Node (Implement Read/Write Splitting)

After the primary instance is created, extend read capacity by adding a read replica node.

Configuration Item Value Description
Node Type Read Replica Node
Availability Zone Availability Zone A Different from the primary node for improved disaster recovery
Specification 2 cores 8GB Consistent with the primary node specification
Quantity 1 (can be scaled as needed) Supports up to 10 read replica nodes

After creating the primary database, we added one read replica node, also with 2 cores 8GB, placed in Availability Zone A. This is mainly for scenarios like the AI teaching assistant, which has more reads than writes. During stress testing, the primary database's CPU spiked to over 80% under concurrent load, with read and write requests competing. It improved significantly after adding the read replica node.

⚠️ Note: Adding a read replica node will cause a brief connection interruption. It is recommended to perform this operation during off-peak hours.

3.3 Enable Database Proxy (Automatic Read/Write Splitting)

The database proxy is the core component for read/write splitting. It provides a unified connection endpoint and automatically routes read/write requests to the appropriate nodes.

Configuration Item Value Description
Number of Proxy Nodes 2 (recommended) Ensure high availability of the proxy service itself
Read/Write Splitting Enabled Automatically forwards read requests to read replica nodes
Transaction Splitting Enabled (optional) Separates read requests within transactions for further optimization
Connection Pool Enable as needed Reduce the overhead of frequent connection establishment

After enabling the database proxy, the application only needs to connect to the unified address provided by the proxy. The proxy automatically routes SELECT statements to the read replica node and INSERT/UPDATE/DELETE statements to the primary node. There is no need to distinguish between the primary and read-only databases in the code, saving a lot of effort. The proxy was configured with 2 nodes to prevent the proxy itself from becoming a single point of failure.

Database Proxy Address: After creation, the system generates a proxy connection address (proxy-xxx.volcengine.com). The application accesses the database through this address without needing to know the specific addresses of the primary and read-only databases.

3.4 Connection Information (Obtained after creation)

Configuration Item Value Purpose
Primary Node Internal Address 172.xx.xx.xx:3306 (auto-assigned) For management/maintenance (generally not directly connected)
Proxy Connection Address proxy-xxx.volcengine.com:3306 Application connection address (recommended)
Database Name kcl_database (manually created)
Username kcl_app_user (manually created)
Read Replica Node Internal Address 172.xx.xx.xx:3306 (auto-assigned) For management/maintenance (generally not directly connected)

Best Practice: The application should always use the proxy connection address for connections. This allows the proxy component to automatically handle read/write splitting, eliminating the need to distinguish between the primary and read-only databases in the code.

IV. Step 4: Create Cache Database Redis Edition

21138d6d206a8995fbf3585f86a54c39.png

4.1 Instance Configuration

Configuration Item Value Description
Region South China 1 (Guangzhou) Same region as the cloud servers
Availability Zone Availability Zone A Different availability zone from MySQL
Instance Type Primary-Standby Instance High availability guarantee
Version Redis 7.0
Specification 2GB Consistent with the original environment
Network Select the same VPC as the cloud servers

Redis uses a primary-standby instance, 2GB specification. The specification can be adjusted later based on project usage. Version 7.0, placed in Availability Zone A.

The primary-standby setup is for high availability; if the primary goes down, the standby automatically takes over.

4.2 Connection Information (Obtained after creation)

Configuration Item Value
Internal Address 172.xx.xx.xx (auto-assigned)
Port 6379
Connection Password Must be set during creation

V. Step 5: Create Load Balancer

72532edb5da0ef603ff05e9115f2251d.png

Configuration Item Value
Name kcl-clb
Region South China 1 (Guangzhou)
Network Type Public Network
Specification Small I
Public IP Automatically assigned by the system
Backend Protocol/Port HTTP:8050
Listener Protocol/Port HTTP:80
Idle Timeout 1800 seconds
Health Check Path /health

Log in to the Volcengine console, find Load Balancer under the Network & CDN menu, and click Create Instance. Select the region as South China 1 (Guangzhou), network type as Public Network. The system will automatically assign a public IP, and all user traffic will come through this IP. The specification selected is Small I, which is sufficient for the current scale.

5.1 Listener Configuration

Listener configuration: The listening port is 80. When users access the load balancer's public IP or domain name, it goes through HTTP port 80. The backend port is 8050. After receiving the request, the load balancer forwards it to Nginx's port 80. Nginx then forwards the /api interface to the server's port 8050. This will be covered in the Nginx configuration later.

5.2 Health Check

The health check path was set to /health. This interface is implemented in the Node.js application; returning a 200 status indicates the service is healthy. The load balancer periodically sends requests to this address for health checks. If it fails to receive a normal response for several consecutive attempts, it will remove that server from the backend group, routing all traffic to the other healthy machine. Once the server recovers, it will be automatically added back.

5.3 Configure Backend Server Group

After creating the listener, you need to configure the backend server group. In the load balancer console, find the instance you just created, click into it, and there is a "Backend Server Group" menu. Click "Create Backend Server Group". Select the protocol as HTTP, port as 8050, and the scheduling algorithm as Weighted Round Robin (default weights are 10, traffic is evenly distributed between the two machines).

Then, add the two cloud servers created earlier. Specifically, click "Add Backend Server", check kcl-server-1 and kcl-server-2 in the popup, uniformly set the port to 8050, and keep the default weight of 10. After saving, the load balancer will distribute requests to the two servers in a 1:1 ratio. If one machine has higher specifications later, its weight can be increased to handle more traffic.

VI. Step 6: Security Group Rules

The security group was configured separately.

Rule Name Policy Protocol Port Source IP Purpose
allow-http Allow TCP 80 0.0.0.0/0 HTTP traffic
allow-https Allow TCP 443 0.0.0.0/0 HTTPS (to be configured later)
allow-ssh Allow TCP 22 Your office network IP SSH management

HTTP port 80 is open to 0.0.0.0/0. HTTPS port 443 is reserved for later certificate configuration. SSH port 22 only allows access from the office's public IP to avoid exposure to the public internet and scanning.

VII. Step 7: Server-Side Code Configuration on Servers

Log in to both servers separately and deploy the project code. After pulling the code from the repository, the main task is to change the database and Redis connection addresses. Previously, public IPs or localhost were used; now they all need to be changed to internal network addresses.

7.1 Environment Variable Configuration

The project uses a config.js or .env file. Change the database and Redis addresses to the internal network addresses provided by Volcengine:

const config = {
    mysql: {
        host: '172.xx.xx.xx',      // MySQL proxy internal address
        port: 3306,
        username: 'username',
        password: 'Your database password',
        database: 'kcl_database',
    },
    redis: {
        host: '172.xx.xx.xx',      // Redis instance internal address
        port: 6379,
        password: 'Your Redis password',
        db: 8,
    },
    // Other configuration items remain unchanged
}

After starting the project, confirm that the service is running normally on port 8050 so that the load balancer's health check can pass. Perform the same operation on both servers.

VIII. Step 8: Nginx Configuration (Web Frontend)

8.1 Install Nginx

sudo apt-get update
sudo apt-get install -y nginx

8.2 Configuration File Location

The configuration file is placed at /etc/nginx/config/kcl_web.conf

8.3 Complete Nginx Configuration

The purpose is to host the frontend web code while also forwarding /api to the server's interface port. Note that an SSL certificate needs to be applied for in advance. I used certbot.

server {
    listen 80;
    listen 443 ssl http2;
    server_name xxx.com;
    
    # SSL certificate configuration
    ssl_certificate /etc/letsencrypt/live/xxx.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/xxx.com/privkey.pem;
    ssl_session_timeout 5m;
    ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE:ECDH:AES:HIGH:!NULL:!aNULL:!MD5:!ADH:!RC4;
    ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
    ssl_prefer_server_ciphers on;
    
    location / {
        root /mnt/new_disk/project/prod/kcl-app-web/dist; 
        index index.html;
        try_files $uri $uri/ /index.html;
        
        if ($request_filename ~* ^.*?.(html|htm)$) {
            add_header Cache-Control "private, no-store, no-cache, must-revalidate, proxy-revalidate";
        }
        
        if ($request_filename ~* ^.*?.(js|css|jpg|jpeg|png|gif|ico|txt|svg|woff|woff2|ttf|eot|otf)$) {
            expires max;
            add_header Cache-Control "public, max-age=31536000, immutable";
        }
    }
    
    # API proxy
    location /api {
        proxy_pass http://127.0.0.1:8050;
        
        if ($request_method = OPTIONS) {
            return 204;
        }
    }
    
    # WebSocket forwarding
    location /socket.io/ {
        proxy_pass http://127.0.0.1:8050/socket.io/;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_read_timeout 3600s;
        proxy_send_timeout 3600s;
    }
}

8.4 Enable Configuration

# Check configuration syntax for correctness
nginx -t

# Reload Nginx configuration
nginx -s reload

IX. Step 9: Domain and DNS Configuration

9.1 DNS Records

Add an A record in the domain registrar's console, pointing the domain to the load balancer's public IP.

Configuration Item Value
Domain xxx.com
Record Type A Record
Resolved Value 34.128.xxx.xxx
TTL 600 seconds (recommended)

9.2 Verification

After DNS resolution takes effect, access http://xxx.com in a browser to check if the page loads correctly, if the API interfaces return data normally, and if the WebSocket connection can be established. If there are any issues, first check if the services on the servers are running normally, if the ports are correct, and if the load balancer's health check is passing.

X. Future Plans

The overall setup is not too complicated. The Volcengine console operation logic is fairly clear, and the documentation is mostly accurate. After setting up the basic architecture, there are several things that need to be done sequentially, listed by priority:

10.1 Docker Unified Deployment

Currently, the two servers are deployed manually. This will be cumbersome for future maintenance. Code updates require logging into both machines to pull the code and restart, which is error-prone. We plan to introduce Docker for unified deployment:

  1. Build Project Image Write a Dockerfile, package the Node.js project and its dependencies into an image, and push it to Volcengine's Container Registry (CR). Tag the image, e.g., kcl-app:v1.0.0, and create a new tag for each release.

  2. Containerized Deployment Install Docker on both servers. Write a docker-compose.yml file defining the Node.js application and Nginx as services. Configuration files and environment variables should be mounted using environment variable files, not hardcoded in the image.

  3. Unified Container Management For future code updates, simply build a new image, then run docker pull on both servers to pull the new image, and docker-compose up -d to restart the containers. Operations on both machines can be batched with a simple script, eliminating the need for manual git pull and pm2 restart.

  4. Image Version Management Keep the last 5 versions of images in the CR repository for quick rollback to a previous stable version if issues arise.

  5. Consider Kubernetes Later If the number of nodes continues to increase, manually managing containers becomes cumbersome. We can look into Volcengine's VKE (Container Service) to incorporate the servers into a Kubernetes cluster for unified scheduling, with rolling updates and auto-scaling handled by K8s.

10.2 Database Related

  1. MySQL Automatic Backup Configure an automatic backup strategy in the MySQL console, set to back up once daily at 2 AM, retaining backups for 7 days. Additionally, perform a manual full backup just in case.

  2. MySQL Slow Query Monitoring Enable the slow query log, set the threshold to 2 seconds. Review the list of slow queries weekly and optimize time-consuming SQL statements.

  3. Database Proxy Monitoring Monitor the database proxy's connection count and request latency. If the proxy node's CPU is too high, consider increasing the number of proxy nodes.

10.3 HTTPS and Security

  1. HTTPS Certificate Configuration The SSL certificate application is in progress. Once issued, add an HTTPS listener (port 443) to the load balancer and configure the certificate. Configure HTTP port 80 to force redirect to HTTPS.

  2. Tighten Security Group Policies The current security group rules are relatively broad. Refine them based on the principle of least privilege: The security groups for the database and Redis should only allow internal IP access from the two cloud servers, not direct public network access. The load balancer's security group should only open ports 80 and 443, blocking all other ports.

  3. Server System Updates Schedule monthly system patch updates, executing apt update && apt upgrade -y. Verify updates in a test environment first.

  4. Log Collection Configuration Configure the cloud log service to collect Nginx access logs and Node.js application logs into a log center for easy keyword search and troubleshooting.

10.4 High Availability Verification

  1. Multi-Availability Zone Disaster Recovery Verification Plan to manually shut down one server in Availability Zone A during a low-traffic period and observe if the load balancer automatically switches all traffic to the machine in Availability Zone B, verifying cross-availability zone disaster recovery.

  2. MySQL Primary-Standby Switchover Drill Schedule a maintenance window to manually trigger a MySQL primary-standby switchover in the console. Verify the switchover time and whether the application automatically reconnects successfully to gain confidence.

  3. Server Image Backup Create image snapshots of the system disks for both servers so they can be quickly restored if a machine fails.

10.5 Monitoring and Alerting

  1. Cloud Monitoring Alert Configuration Configure alerts in Cloud Monitoring for the following metrics, sending SMS notifications when thresholds are exceeded:

    • MySQL CPU usage > 80%
    • MySQL connection count > 1000
    • MySQL disk usage > 85%
    • MySQL slow queries count > 10/minute
    • Server CPU usage > 85%
    • Server memory usage > 90%
    • Server disk usage > 85%
    • Number of abnormal backend servers in load balancer > 0
  2. Container Resource Monitoring If Docker is adopted, configure container-level monitoring, paying attention to the CPU and memory usage of each container to prevent one container from consuming all resources on the machine.

  3. Cloud Resource Cost Alert Set a budget alert in the Cost Center. Send a notification when the monthly cost exceeds 80% of the budget to avoid cost overruns due to resource over-provisioning.

10.6 Performance and Capacity

  1. Performance Stress Testing Use wrk or JMeter to stress test the actual QPS limit of the production environment to identify the bottleneck and prepare for future scaling.

  2. Capacity Evaluation Based on stress test results and business growth trends, evaluate resource requirements for the next six months and plan capacity expansion in advance. If MySQL consistently maxes out the 2-core 8GB specification, consider upgrading to 4 cores 16GB or adding more read replica nodes.

  3. Redis Cache Strategy Optimization Check Redis memory usage and eviction policy. Ensure key expiration times are set reasonably to avoid memory exhaustion. The current configuration is 2GB. If the cache hit rate continues to be high, consider upgrading to 4GB.

10.7 Volcengine Full Site Acceleration DCDN Configuration

Full Site Acceleration (Dynamic Content Delivery Network, DCDN) is a network acceleration service launched by Volcengine. It can be thought of as an upgraded version of a traditional CDN. Enabling it can effectively optimize API access speeds.

The core difference is:

Traditional CDN is only good at accelerating static resources like images, videos, and web files. It can cache these on edge nodes and return them directly. However, for dynamic requests like user login, order queries, or form submissions, the CDN cannot cache them and must pass them through to the origin server. This can be slow, especially across countries or networks.

DCDN combines these two tasks: