"We can't afford downtime during events." That was the first thing James D. told us when we started building FleetGrid, an operations platform for PopUp Capital's event management business.
He wasn't exaggerating. PopUp Capital runs three event locations simultaneously. Vendors check in via the platform. Equipment gets tracked in real-time. Booth assignments update dynamically. If the system goes down during a Saturday afternoon event, 200+ vendors at each location lose access to their booth information, check-in breaks, and the operations team falls back to paper printouts they stopped maintaining three events ago.
The question wasn't "should we have zero-downtime deployments?" It was "how do we make zero-downtime deployments work reliably for a team of four developers who are also building features?"
Why Multi-Location Makes Everything Harder
Single-tenant SaaS deployments are straightforward. You have one environment, one database, one set of users. Deploy at 3 AM when nobody's online. Done.
Multi-location changes the math:
- Time zone differences — "Deploy at 3 AM" means 3 AM in one time zone but prime time in another. PopUp Capital's locations span Eastern and Central time zones.
- Shared database state — Locations share some data (vendor master records, equipment inventory) but own other data locally (booth assignments, local schedules). A migration that locks the shared tables affects all locations.
- Different feature readiness — Location A might need a new check-in feature immediately, while Location B is mid-event and needs the old workflow to stay stable.
- Rollback scope — If the deployment breaks Location A, do you roll back everywhere? What if Location B is working fine with the new version?
These aren't theoretical problems. We hit every one of them during FleetGrid's first six months.
Blue-Green Deployment: The Foundation
We run two identical production environments: Blue and Green. At any time, one serves live traffic and the other sits idle (or receives the next deployment).
The deployment flow:
- Deploy new version to the idle environment (let's say Green)
- Run automated health checks against Green (API responses, database connectivity, critical user flows)
- Run synthetic transactions — automated scripts that simulate real user actions (vendor check-in, equipment transfer, booth assignment update)
- If all checks pass, switch the load balancer to Green
- Monitor for 15 minutes
- If anything breaks, switch back to Blue (takes ~10 seconds)
Why blue-green over rolling deployment?
Rolling deployments update servers one at a time, meaning both old and new versions run simultaneously during the rollout. For FleetGrid, this meant every API endpoint would need to be backward-compatible with both versions during the transition — a maintenance burden our 4-person team couldn't sustain. Blue-green gives us a clean cutover: everyone sees the same version, always.
The tradeoff: blue-green costs more because you maintain two full environments. For FleetGrid on AWS, the idle environment costs roughly $400/month in ECS Fargate tasks and RDS standby instances. That's the cost of sleeping well during events.
AWS Infrastructure for Blue-Green
# ECS Blue-Green with CodeDeploy
Resources:
BlueTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
HealthCheckPath: /api/health
HealthCheckIntervalSeconds: 10
HealthyThresholdCount: 2
GreenTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
HealthCheckPath: /api/health
HealthCheckIntervalSeconds: 10
HealthyThresholdCount: 2
DeploymentGroup:
Type: AWS::CodeDeploy::DeploymentGroup
Properties:
DeploymentStyle:
DeploymentType: BLUE_GREEN
DeploymentOption: WITH_TRAFFIC_CONTROL
BlueGreenDeploymentConfiguration:
TerminateBlueInstancesOnDeploymentSuccess:
Action: KEEP_ALIVE # Keep Blue running for rollback
WaitTimeInMinutes: 60
Key detail: we keep the old environment alive for 60 minutes after cutover. This gives us a full hour of rollback capability. After 60 minutes with no issues, the old environment scales down to minimum capacity (1 task) as a warm standby.
Database Migrations: The Hard Part
Blue-green deployment handles application code cleanly. Databases are the hard part. You can't run two different schema versions against the same database without breaking something.
Expand-and-Contract Pattern
Every schema change follows three phases:
Phase 1: Expand — Add the new column/table alongside the existing one. Both old and new application versions can work with the database. The new column is nullable or has a default value.
Phase 2: Migrate — Backfill existing data from the old column to the new one. This runs as a background job, not a blocking migration. For large tables, process in batches of 1,000 rows with a 100ms delay between batches to avoid locking.
Phase 3: Contract — After confirming all application servers use the new column (via feature flag or version check), drop the old column in a separate deployment. Never in the same deployment as the code change.
The rule we never break
Never combine schema changes and code changes in the same deployment. Schema goes first, code follows. This means if the code deployment fails and we roll back, the schema is still compatible with the old code. We learned this the hard way on a Wednesday night when a combined migration + code deploy left us with code pointing at a column that had been renamed.
Migration Locking Strategy
PostgreSQL's ALTER TABLE acquires an ACCESS EXCLUSIVE lock by default. On a table with 500k rows, that can block reads for 30+ seconds — an eternity during an event.
Our approach:
- Add columns with
ALTER TABLE ... ADD COLUMN ... DEFAULT NULL(instant in PostgreSQL 11+, no table rewrite) - Create indexes with
CREATE INDEX CONCURRENTLY(doesn't lock the table) - Set lock timeout:
SET lock_timeout = '5s'so migrations fail fast rather than blocking queries - Drop columns only during maintenance windows (Sunday night after events)
Feature Flags for Per-Location Rollouts
The third piece of our deployment stack: feature flags. We use a simple database-backed feature flag system (no third-party service needed for our scale).
// Feature flag check with location awareness
const isEnabled = await featureFlags.check({
flag: 'new-checkin-flow',
locationId: req.location.id,
userId: req.user.id, // Optional: per-user rollout
percentage: 25 // Optional: percentage rollout
});
if (isEnabled) {
return newCheckinFlow(req, res);
}
return legacyCheckinFlow(req, res);
This lets us:
- Deploy code to all locations simultaneously (the code is identical everywhere)
- Enable features per-location (Location A gets the new check-in flow on Monday, Location B gets it Thursday after their event ends)
- Disable features instantly if something breaks (flip the flag, no redeployment)
- Run percentage rollouts for risky features (10% of users first, then 50%, then 100%)
We track feature flag state changes in the same audit log as deployments. When investigating an incident, we can see exactly which features were enabled for which locations at any point in time.
The Canary Pattern: Location-by-Location
Even with blue-green and feature flags, we don't deploy to all locations at once. FleetGrid uses canary deployments with location-based routing:
- Canary location — Deploy to the lowest-traffic location first. For PopUp Capital, that's the Tuesday afternoon market. Monitor for 30 minutes.
- Second location — If canary is stable, deploy to the second location. Monitor for 30 minutes.
- Full rollout — Deploy to the remaining location. Monitor for 60 minutes (this is usually the highest-traffic one).
Total deployment time: ~2 hours from first canary to full rollout. For a team of four, this is done during business hours (not 3 AM), because the canary pattern means we catch problems with real traffic but limited blast radius.
| Deployment Strategy | Downtime Risk | Rollback Time | Infrastructure Cost |
|---|---|---|---|
| In-place deployment | High (2-5 min per deploy) | 5-15 min (redeploy old version) | Baseline |
| Rolling deployment | Medium (mixed versions) | 2-5 min (roll forward to old) | +10-20% |
| Blue-green + canary | Near zero | ~10 seconds | +80-100% |
The 2 AM Rollback: When Theory Meets Reality
Two months after launch, we pushed a performance optimization that changed how FleetGrid queries vendor assignments. The optimization was solid — 40% faster query times in staging. We deployed it through the full canary pipeline during business hours. All three locations looked healthy.
At 2 AM, our alerting fired. Location B's database CPU spiked to 95%. The optimization had triggered a query plan change in PostgreSQL that performed well on small result sets (which is what we saw during the day) but fell off a cliff when the nightly batch job ran analytics queries against the full dataset.
The fix? One command:
aws deploy stop-deployment --deployment-id d-ABC123 --auto-rollback-enabled
Ten seconds later, all traffic was back on the previous version. CPU dropped to normal within a minute. Total user impact: zero — the batch job retried automatically after the rollback.
We fixed the query the next morning (added a composite index and forced a specific query plan for the batch job), redeployed through the canary pipeline, and this time included the nightly batch in our synthetic transaction tests.
The lesson isn't "test more"
The lesson is: build deployment systems that assume failures will happen. The canary pipeline didn't prevent the bug. But the blue-green architecture made the bug a non-event instead of a 3-hour scramble to hotfix production at 2 AM.
Monitoring Stack: What We Watch
Zero-downtime deployment only works if you can detect problems before users report them. Our monitoring for FleetGrid:
Application-level:
- API response time p95 per location (alert at >500ms, normal is ~120ms)
- Error rate per location (alert at >1%, normal is <0.1%)
- Active WebSocket connections (drop indicates client-side breakage)
Infrastructure-level:
- Database CPU and connection pool utilization
- ECS task health and restart count
- Redis memory usage and queue depth
Business-level:
- Check-ins per hour per location (deviation from historical average signals a problem)
- CSV import success rate (our CSV processing pipeline has its own health checks)
- Vendor data sync latency between locations
Business-level metrics are the ones that catch the subtle bugs. A deployment might have healthy API responses and clean error rates, but if check-ins per hour drops 30% at one location, something is wrong — even if no errors are being thrown.
Deployment Pipeline in Practice
Here's what a typical FleetGrid deployment looks like from commit to full rollout:
09:00 — PR merged to main
09:02 — CI/CD pipeline: lint, test, build Docker image
09:08 — Image pushed to ECR, tagged with commit SHA
09:10 — Deploy to Green (idle) environment
09:12 — Automated health checks pass (API, DB, Redis)
09:13 — Synthetic transactions pass (check-in, import, sync)
09:15 — Canary: switch Location C (lowest traffic) to Green
09:45 — 30-min monitoring clear → switch Location A to Green
10:15 — 30-min monitoring clear → switch Location B to Green
11:15 — 60-min full monitoring clear → scale down Blue to standby
11:15 — Deployment complete. Total time: ~2 hours
This entire flow is automated via GitHub Actions + AWS CodeDeploy. The only manual step: a human approves the canary promotion after reviewing the monitoring dashboard. We considered full automation but decided that a human eye on the dashboard during rollout is worth the 2 minutes of wait time per location.
Frequently Asked Questions
What is a blue-green deployment strategy?
Blue-green deployment runs two identical production environments. Blue serves live traffic while Green receives the new version. After testing Green, you switch the load balancer. If anything breaks, switch back to Blue in seconds. This eliminates downtime because the switch is instant.
How do you handle database migrations with zero downtime?
Expand-and-contract pattern. Add the new column alongside the old one, backfill data as a background job, then drop the old column in a separate deployment. Never rename or drop columns in the same deployment as code changes.
What's the difference between rolling and blue-green deployment?
Rolling updates servers one at a time — both versions run simultaneously during rollout. Blue-green keeps two complete environments and switches all traffic at once. Rolling is cheaper but requires backward-compatible APIs. Blue-green is cleaner for multi-location SaaS.
How do feature flags help with deployments?
Feature flags decouple deployment from release. Deploy code with a feature hidden behind a flag, verify stability, then enable per-location. If problems arise, disable the flag without redeploying. Essential for multi-location SaaS where locations need features on different schedules.
How do you test multi-location deployments before going live?
Canary pattern: deploy to the lowest-traffic location first, monitor for 30-60 minutes, then proceed to remaining locations. Each location gets its own health check window. Synthetic transactions (automated test scenarios) run against each location after deployment.
Next Steps
Zero-downtime deployment isn't a luxury for multi-location SaaS. It's a requirement. The infrastructure cost (~$400/month for FleetGrid's scale) is a fraction of the business impact of even one hour of downtime during an event.
- Book a 30-minute architecture call — we'll review your current deployment setup and identify the highest-impact improvements
- CSV Processing Optimization Case Study — the data pipeline behind FleetGrid
- HIPAA-Compliant App Development — zero-downtime deployment patterns for healthcare SaaS
- SaaS MVP Development Services — how we architect for production from day one