A multi-tenant SaaS handling sensitive financial data has to hold SOC 2 Type 2 and PCI-DSS at the same time - and still ship every day. Most teams treat that as a trade-off. We made it a property of how the platform builds and deploys.
The challenge
The product is a multi-tenant platform for regulated financial and professional-services firms, with 80-plus tenants on board. Its customers handle their own clients’ sensitive financial data, so they expect a serious security posture: SOC 2 Type 2, PCI-DSS, continuous threat detection, and disaster recovery you can actually prove. Anything less and the platform is unsellable into that market.
By the time we led the DevOps function, the estate had grown into a sprawl - dozens of serverless APIs and a handful of Fargate workloads, each with its own deploy story, spread across dev, qa, uat and prod. Two problems sat on top of each other.
- Releasing was a per-pipeline chore. Engineers picked AWS regions by hand, approvals were tied to a few hardcoded individuals, and there was no consistent versioning or changelog across services. Nobody could answer “what is in UAT right now?” without going to ask.
- The compliance program needed evidence, not aspiration. Org-wide guardrails, audited DR tests, software-supply-chain controls - the kind of proof an assessor signs off on, not a slide deck.
Holding two frameworks while shipping fast is hard because the usual fixes fight: lock everything down and velocity dies; optimize for speed and the audit trail evaporates. The work was to make the fast path and the compliant path the same path.
Our approach
Promote a whole environment, not a pipeline at a time
The centrepiece is what we called GROUP deployments. Instead of triggering each service’s pipeline by hand, an engineer can deploy or promote a whole environment - or a whole service type - in one action. A small set of reusable GitHub Actions workflows covers the shapes the platform actually has: a Serverless workflow for the Lambda APIs, an ECS-Service workflow for Fargate, a Frontend workflow, and an Infra workflow, all sharing one path: dev to qa to uat to prod. The deliberate tradeoff: a few opinionated shared workflows to maintain, instead of one bespoke pipeline per service that drifts.
Delete the inputs a human can get wrong
A lot of friction came from small, repeated decisions, so we removed them. The AWS region is no longer a pipeline input - it is auto-mapped from the target environment, which kills an entire class of “deployed to the wrong place” incidents. Releases flow through Semantic Versioning and Conventional Commits, generating changelogs and GitHub Releases automatically, all surfaced on an internal deployments dashboard so the state of every environment is one glance away.
Make compliance a side-effect of how you ship
Manual approvals moved off hardcoded names and onto GitHub Teams, with separate approver groups per environment. That is operationally simpler - approval rights are now group membership that survives staff changes - and it hands auditors a clean, group-based record of who could approve what, with no manual write-up. The compliance program ran alongside, under an AWS Tier-1 managed-security service, with software-supply-chain controls (SBOMs plus a dependency-vulnerability tracker) wired into the same delivery flow.
Under the hood
The mechanics matter more than the labels:
- An engineer merges and tags; the release is versioned and changelogged automatically.
- A single GROUP action promotes that release along the
devtoqatouattoprodpath, the correct region resolved from the environment. - The right GitHub Team is required to approve at each environment gate, producing the access record auditors want as a by-product.
- Underneath, a multi-account AWS landing zone managed with Terraform (via atmos, with a parallel Terramate stack over reusable components) enforces org guardrails and Firewall Manager policies, so isolation and controls are inherited, not reapplied per service.
- Disaster recovery is evidence-backed: DynamoDB point-in-time-recovery and RDS restoration runbooks, exercised in annual DR tests with incident-simulation logs to show assessors.
We also hardened identity - migrating authentication onto a dedicated identity provider and adopting policy-based authorization - and later brought GenAI into the engineering workflow, with AI-assisted pull-request reviewers on a managed foundation-model service plus internal MCP servers and agents.
The outcome
Audits stopped being events and became a property of how the platform already works. The team ships at pace through a single reviewable promotion; the evidence assessors want - approval records, changelogs, tested DR - is generated as a side-effect of shipping rather than assembled in a panic before the assessment. The platform carried SOC 2 Type 2 and PCI-DSS while continuing to deliver across its full multi-tenant estate.
Key takeaways
- Make environment promotion a first-class action. GROUP deploy and promote across
dev/qa/uat/prodbeats orchestrating pipelines one service at a time. - Delete avoidable inputs. Auto-mapping the region from the environment removes a class of human error entirely.
- Tie approvals to groups, not people. Easier to operate, and the audit record writes itself.
- DR is only real if you test it. PITR and restore runbooks plus annual test evidence turn a claim into proof.
- Compliance and velocity are the same path. Bake guardrails and evidence into the delivery flow and you stop choosing between shipping and passing.