Hi everyone!

Have you ever wondered what happens when your primary server goes down? Your users get errors, your business loses money, and everyone panics. That’s why we need Disaster Recovery (DR).

In this blog, I will show you how to build a Cold Standby DR architecture on Google Cloud Platform. We’ll create a system that can switch to a backup region when things go wrong. One of the best part? It’s all automated with Terraform!

What is Cold Standby?

Before we start, let me explain what Cold Standby means.

Think about a highway. Your main road (primary) handles all the traffic every day. But there’s also an alternative road (standby) that exists. The road is built, but the barriers are closed. No cars use it.

One day, the main highway has a big accident. Traffic is stuck. What do you do? You open the barriers on the alternative road and redirect all cars there.

That’s Cold Standby. The backup road exists, but nobody drives on it. When disaster happens, we open the barriers and switch traffic to it.

Architecture Diagram

What We’re Building

Primary region in us-central1 with 2 running instances
Standby region in us-east1 with 0 instances (cold)
Global Load Balancer to route traffic
Automatic snapshots every hour
Failover scripts to switch regions
Email alerts when something goes wrong
Monitoring dashboard to see everything
Uptime checks that run every minute

Prerequisites

Before starting, make sure you have:

Google Cloud account with billing enabled
gcloud CLI installed on your computer
Terraform version 1.5.0 or higher. Install it from terraform.io.

Step 1: Clone the Repository

First, let’s get the code. Open your terminal (or Cloud Shell) and run:

git clone https://github.com/misskecupbung/gcp-dr-cold-standby.git
cd gcp-dr-cold-standby

# Take a look at the folder structure
ls

You’ll see:

terraform: all infrastructure code
scripts/: deployment and failover scripts
app/: simple demo application

Step 2: Run the Setup Script

The setup script does a lot of work for you. It detects your GCP project, your email, and creates the configuration file.

./scripts/setup.sh

What happens behind the scenes?

Gets your current GCP project ID
Gets your email from gcloud config
Creates terraform.tfvars with your settings
Shows you the configuration

You should see output like this:

Feel free to modify the terraform.tfvars file. You can change the regions, instance sizes, or add your own domain name. Open it with any editor and adjust what you need.

Step 3: Deploy the Infrastructure

Time to create everything in Google Cloud Platform!

./scripts/deploy.sh

When you see Do you want to apply these changes?, type yes and press Enter.

This takes about 5–10 minutes.

What gets created

VPC network with 2 subnets (one per region)
Firewall rules for HTTP and health checks
Instance templates with startup scripts
Managed Instance Groups (MIG) in both regions
Global HTTP Load Balancer
Snapshot schedules for backups
Monitoring dashboard with 4 charts
Three alert policies (service down, region unhealthy, high latency)
Email notification channel for alerts
Uptime checks every minute

When it finishes, you’ll see the Load Balancer IP:

load_balancer_ip="35.227.222.112"

Save this IP, we’ll need it for testing.

Step 4: Verify the Deployment

Let’s make sure everything works. Run the verification script:

./scripts/verify-deployment.sh

Or test manually with curl:

curl http://35.227.222.112/health

Or via browser directly:

You should see:

{"hostname":"dr-primary-vm-mmsx","region":"us-central1","status":"healthy","timestamp":"2026-02-28T12:54:13.354665"}

Notice the region shows us-central1. That’s our primary region serving traffic.

Check the GCP Console

Go to Compute Engine > Instance Groups. You should see:

Step 5: Explore the Load Balancer

Go to Network Services > Load Balancing in GCP Console.

Click on dr-url-map-d639c08eto see details:

Frontend: Global IP on port 80
Backend: Two backend services (primary and standby)
Health checks: Running every 5 seconds

Notice only the primary backend shows healthy instances. The standby has zero instances.

Step 6: Check the Snapshots

Go to Compute Engine > Snapshots

The system creates automatic snapshots every hour. If disaster happens, we can restore data from these snapshots.

Check the snapshot policy:

gcloud compute resource-policies list

Step 7: Test Failover

Now let’s simulate a disaster. We’ll pretend the primary region is down and switch to standby.

Start the Failover

./scripts/failover.sh

Type yes when asked to confirm.

What the script does:

Disables autoscaling (for manual control)
Creates emergency snapshots
Scales down primary to 0 instances
Scales up standby to 2 instances
Switches Load Balancer to standby backend
Verifies everything works

Verify the Failover

Test the application again:

curl http://35.227.222.112/

Or via browser:

Now you should see:

{"hostname":"dr-standby-vm-6gk0","message":"DR Cold Standby Lab - Application Running","region":"us-east1","timestamp":"2026-02-28T13:04:12.558283"}

The region changed to us-east1! Traffic is now going to the standby region.

Check the Console:

Primary MIG: 0 instances
Standby MIG: 2 instances running

That’s failover working!

Step 8: Test Failback

Disaster is over, let’s go back to primary.

./scripts/failback.sh

Type yes to confirm.

What happens:

Scales up primary to 2 instances
Waits for instances to be healthy
Switches Load Balancer back to primary
Scales down standby to 0 instances

Test again:

curl http://35.227.222.112/health

Or via browser

It shows like:

{"hostname":"dr-primary-vm-084q","region":"us-central1","status":"healthy","timestamp":"2026-02-28T13:39:02.604194"}

We’re back to primary!

Step 9: Run Automated DR Test

For regular DR testing, use the automated test script:

./scripts/test-failover.sh --simulate-failure --full

This script:

Validates current state
Performs failover
Checks standby is working
Performs failback
Verifies primary is working
Generates a test report

Step 10: Check Monitoring and Alerts

The lab configures real email alerts that notify you when something goes wrong. You’ll receive an email when the primary region goes down, and another when it returns to healthy.

Monitoring Dashboard

Go to Monitoring > Dashboards in GCP Console.

You’ll find a dashboard called DR Cold Standby Dashboard that shows:

Load balancer request count
Load balancer latency (p99)
Uptime check status
VM CPU utilization

Uptime Checks

Go to Monitoring > Uptime checks.

The system pings your application every minute. If the health check fails, you’ll know immediately.

Here is what the DR Heartbeat — Primary Region looks like:

And below is what the Load Balancer Health uptime dashboard looks like:

Email Alerts

Go to Monitoring > Alerting.

The lab creates 3 alert policies:

Service Unavailable: Triggers when uptime check fails for 5 minutes. You get an email saying your app is down!
Primary Region Unhealthy: Triggers when primary MIG has 0 instances. This tells you to consider failover.
High Latency: Triggers when load balancer latency goes above 5 seconds. Something might be wrong.

When I tested the failover, I actually received an email alert! It looked like this:

I also received an email when the system became healthy again

This is what production systems need. You don’t want to find out your app is down from customers.

Cleanup

When you’re done, delete everything to avoid charges:

./scripts/cleanup.sh

Or manually:

cd terraform
terraform destroy

Type yes to confirm. All resources will be deleted.

Resources

Thanks for reading! Drop a comment if you have any questions or feedback!

<hr><p>Build Your Own Disaster Recovery on GCP: Cold Standby Architecture was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>