contact@axionvextech.com

Cloud & Infrastructure

Backend Migration & System Cleanup

SaaS platform · B2B · Production system serving paying users

Overview

The client ran a Node.js backend serving a B2B SaaS product with several thousand active users. The application was deployed to a single EC2 instance via manual SSH sessions. There was no CI/CD pipeline, no structured logging, no alerting, and the staging environment was configured differently from production — so bugs found in staging were not reliable indicators of production behavior.

System Architecture

architecture — infrastructure

GitHub

Push to main

CI Pipeline

Build + test + scan

Container Registry

ECR · Versioned images

Staging

Parity with prod

Production

ECS · Auto-scaling

CloudWatch

Structured logs

Alerts

Threshold-based

Health Checks

Every 30s

The Problem

Deployments were manual: an engineer would SSH into the production server, pull the latest code, run a build, and restart the process. This took 20–30 minutes and was error-prone. A bad deploy meant SSH-ing back in and reverting manually.

There was no structured logging. The application used console.log for everything. When an incident occurred, the on-call engineer had to SSH into the server and grep through log files to figure out what happened. Two incidents in the past quarter had taken over four hours to resolve.

The staging environment ran on a different OS version, different Node version, and had a different database schema migration applied. Bugs that passed staging regularly appeared in production.

The on-call rotation was dreaded. Engineers would trade shifts to avoid it. The CTO described it as the single biggest morale problem on the team.

Technical Approach

Replaced unstructured logging with structured JSON output

Every log entry now includes a timestamp, severity level, request ID, user ID (when applicable), and structured context. This took about a week to migrate across the codebase — mostly mechanical work, but critical for everything else.

Set up centralized log aggregation and alerting

Logs ship to CloudWatch with structured queries. We defined alert thresholds for error rates, response times, and specific failure patterns. The on-call engineer gets a notification with context — not a generic 'server down' ping.

Containerized the application

We Dockerized the application so the exact same image runs in development, staging, and production. This eliminated the environment-parity problem entirely. The 'works on staging but not production' class of bugs disappeared.

Built a deployment pipeline with rollback

Push to main triggers: build → test → security scan → deploy to staging → manual approval → deploy to production. A bad deploy can be rolled back in under two minutes. The 20-minute manual SSH deploy became a one-click operation.

Added health checks and an incident response runbook

Each service has a /health endpoint checked every 30 seconds. We wrote a runbook covering the five most common incident types with step-by-step resolution procedures. New on-call engineers can follow the runbook without needing institutional knowledge.

Tech Stack

runtime

Node.jsTypeScript

containers

DockerAWS ECR

infrastructure

AWS ECSTerraformALB

observability

CloudWatchStructured JSON loggingHealth checks

cicd

GitHub ActionsStaged rollout pipeline

Engagement Details

Timeline

6 weeks

Team

1 senior engineer (AxionvexTech) + 1 internal DevOps-leaning engineer on the client side

Outcomes

Mean time to resolution

Before

2–4 hours per incident

After

Under 15 minutes for most incidents

Deploy time

Before

20–30 minutes, manual SSH

After

Under 5 minutes, automated with one-click rollback

Staging reliability

Before

Bugs passed staging regularly

After

Environment parity — staging catches what production would see

On-call morale

Before

Engineers traded shifts to avoid on-call

After

First rotation after launch: engineer said it was the first time they did not dread the shift

“The CTO told us that the infrastructure work changed how the team felt about the product. Before, they were scared of their own system. After, they had the confidence to ship faster because they could see what was happening and fix it quickly when something went wrong.”

← Internal Operations Platform

All Case Studies

Have a project like this?

Tell us what you are building or what needs to improve.

Start a Project