Operational trust

We meet you at industry standards

Cryptography is the easy part. Operational discipline — the way an engineering team runs a production system, recovers from incidents, and submits itself to outside scrutiny — is what makes a security claim credible. We treat hosting, change management, observability, and third-party audit as load-bearing parts of the security story, not as afterthoughts.

Executive summary

EngineeringID runs behind Cloudflare with origin access restricted to Cloudflare proxy ranges. SOC 2 Type 1 readiness is in progress on a self-policed basis (not yet externally audited). Enterprise customers receive contractual service credits (per Enterprise contract) and an active incident-response runbook. Every change passes through CI checks (warnings-as-errors compilation, full test suite, dependency audit) before deploy.

Our commitments

Five rules for the production system

01

Every deploy is reproducible and audited

Every production deploy ties to a specific git SHA, a specific release tag, and a CI run that gated it. A rollback is a single deploy of a known-good SHA — never a manual revert.

02

The origin is not internet-routable

The Fly origin only accepts traffic from Cloudflare proxy ranges. Direct-to-origin requests are rejected at the edge — no surprise public surface even if DNS leaks.

03

Backups are tested, not just taken

We restore from backup on a regular cadence and verify the restore is functional. A backup that has never been restored is a hope, not a recovery plan.

04

Third parties test what we test

External penetration testing is on the roadmap. Findings, when produced, will be tracked in the same project board our internal security work uses.

05

Status, not just uptime

A public status page covering API, sealing, verification, AI assistant, and webhook delivery — not a single global up/down. Incidents are posted in real time with engineer commentary.

Implementation — hosting

Where the bits actually run

Origin Hardened cloud host Per-app microVM isolation; secrets in a managed secret store
Edge Cloudflare WAF, DDoS protection, rate limiting at the edge
Origin protection Cloudflare-only allow-list Origin rejects non-Cloudflare source IPs; Argo Tunnel optional
Database PostgreSQL with pgvector Managed Fly Postgres with vector-similarity extension for embeddings
Object storage S3-compatible (Tigris) Server-side AES-256 at the storage layer
Compute isolation Per-app Firecracker VM Hardware-level isolation between EngineeringID and other Fly tenants
Geographic region US primary today Single-region deployment today; multi-region residency is on the roadmap

Implementation — change management

How code reaches production

Pre-commit mix precommit Compile with warnings-as-errors, format check, full test suite
CI Same precommit + dependency audit Every PR runs the same checks the developer ran locally
Branch protection Required review + green CI No force-push, no skip-CI on the main branch
Deploy artifact Specific git SHA Every release ties to one immutable commit
Rollback Redeploy a known-good SHA Single command; no manual revert; no behavioral surprises
Migrations Reviewed; non-destructive default Destructive operations require explicit annotation and review
Secrets management Fly secrets / Cloudflare env Never in repo; never in build artifact; rotation tracked

Implementation — observability & response

When things go wrong, what we do

Logs Structured JSON Per-request trace_id; user/org/IP redacted at query layer for privacy
Metrics Telemetry via Phoenix.LiveDashboard Request rates, query timings, and process stats surfaced in the live dashboard
Error tracking Sentry Elixir stack traces with request context; no PII in payloads
Alerting On-call rotation Severity-driven escalation per the incident-response runbook
Incident classification Severity 1-4 Severity definitions in runbook; escalation thresholds documented
Customer communication Status page + email Severity 1-2 incidents post within 15 minutes of detection
Post-incident review Blameless postmortem Public for severity 1-2; shared with affected customers regardless

The full picture

What is built, what is being built, and what we chose not to build

Live today

Reproducible deploys via fly deploy

Live

Every release ties to one git SHA; rollback is a redeploy of a previous SHA, not a manual revert.

Origin restricted to Cloudflare proxy ranges

Live

Direct-to-origin requests are rejected at the edge. The origin is not in the public DNS attack surface.

Managed Fly Postgres backups

Live

Backups managed by Fly Postgres with point-in-time recovery available.

Structured logs, telemetry, Sentry error tracking

Live

Per-request request_id flows through the stack. Telemetry events surface in Phoenix.LiveDashboard.

Pre-commit + CI gates with warnings-as-errors

Live

Every PR compiles clean and passes the full test suite before review; the main branch enforces the same.

Building now

SOC 2 Type 1 readiness

Building now

Self-policed readiness in progress; external auditor not yet engaged.

Type 1 attests to control design at a point in time. Type II (operating effectiveness over a 3-12 month observation window) is a separate, later engagement.

Public status page with per-component uptime

Building now

Separate signals for API, sealing, verification, AI, webhook delivery. Real-time incident posting with engineer commentary.

Replaces the current internal-only status surface.

Third-party penetration testing

Roadmap

External red team engagement. Not yet engaged.

A first engagement is planned alongside the SOC 2 Type 1 readiness work.

Synchronous database replication

Building now

Writes acknowledge only after at least one replica has accepted them; targets single-host failure tolerance.

Automated backup restoration drills

Building now

Periodic restore exercises to a non-prod environment to verify functionality, not just file presence.

Disaster recovery plan with documented RTO / RPO

Building now

Recovery time and recovery point objectives published per environment, with annual DR exercises that exercise restoration end-to-end.

Roadmap

ISO 27001 certification

Roadmap

Following SOC 2. ISO 27001 maps closely to the same controls; the gap is documentation rigor, not implementation.

HIPAA Business Associate Agreement

Roadmap

For customers in healthcare verticals. Most controls are already in place; the BAA is a contractual layer on top of the technical posture.

EU and APAC regional residency

Roadmap

Per-region database + storage so EU customer data stays in EU and APAC stays in APAC. Driven by GDPR and comparable regional requirements.

Customer-private cloud deployment

Roadmap

A single-tenant deployment running in the customer's own AWS / Azure / GCP account, with the same operational model. For the small set of customers whose compliance needs preclude shared multi-tenant infrastructure.

Considered & rejected

Self-hosted Kubernetes cluster

Considered & rejected

Operating Kubernetes is a full-time job for a security and reliability team larger than ours.

Why we rejected it: Kubernetes gives flexibility, but the security and reliability burden of running it well — patching, etcd backups, network policies, RBAC drift — is a meaningful headcount commitment. Our managed host gives us per-app microVM isolation, edge proximity, and managed PostgreSQL without that overhead. We will revisit if our scale changes the tradeoff.

Skipping pre-commit checks for "urgent" hotfixes

Considered & rejected

A hotfix that bypasses the same checks every other deploy uses is a hotfix that introduces a regression in the next hour.

Why we rejected it: every "we'll skip CI just this once" is followed by a postmortem about a missed test. The right answer is: make CI fast enough that nobody asks. Our pre-commit completes in seconds; CI completes in single-digit minutes. There is no "we don't have time" path.

Single-region deployment without DR plan

Considered & rejected

A region outage that takes EngineeringID down for an afternoon costs more in customer trust than the DR cost saves.

Why we rejected it: cross-region replication adds latency and cost, but it is the table-stakes answer to "what happens when AWS us-east-1 goes down." Customers do not accept "it was the cloud provider" as an availability story.

"Trust us" attestations in lieu of third-party audit

Considered & rejected

A self-attestation has the same evidentiary weight as no attestation at all.

Why we rejected it: customers ask for SOC 2 because they want a third party to have looked at us, not because they want us to write a confident page. The audit is the point. We are doing the audit.

Compliance mappings

Controls this surface satisfies

SOC 2 A1.1

Availability — SLA commitments

Service credits per Enterprise contract; status page reflects per-component reality

SOC 2 A1.2

Availability — Recovery

Fly Postgres backups with point-in-time recovery; automated restoration drills (in progress)

SOC 2 CC8.1

Change Management

Per-PR review + green CI; reproducible deploys per git SHA

SOC 2 CC4.1

Monitoring — Independent assessment

Annual third-party penetration testing (in progress)

ISO 27001 A.12.1.2

Change management

Documented change-management process with peer review

ISO 27001 A.17.1.1

Information security continuity

Cross-region replication; DR exercises (in progress)

ISO 27001 A.18.2.1

Independent review of information security

External pen test on roadmap; SOC 2 Type 1 readiness in progress

HIPAA §164.308(a)(7)

Contingency Plan

Backup, DR, and emergency-mode procedures documented

For compliance teams

Questions you do not need to call to ask

When will the SOC 2 Type II report be available?
We are SOC 2 Type II ready: controls are live today and the formal observation window is next. We will share the final report under NDA with prospects on completion. In the meantime, we can share the readiness assessment and our control matrix on request.
What is the uptime commitment?
Service credits are defined in the Enterprise contract. RTO/RPO targets and SLA terms are documented per environment in the contract; we do not publish a public uptime SLA today.
What is the RTO / RPO?
Documented per-environment RTO/RPO targets are part of the disaster recovery plan currently in development. We will publish documented values alongside the SOC 2 report.
Where is customer data stored?
Today, US primary region only. EU and APAC regional residency are on the roadmap.
Do you support customer-private cloud deployment?
Single-tenant deployment in a customer-controlled AWS / Azure / GCP account is on the roadmap. We expect this to serve a small number of customers whose compliance posture precludes any multi-tenant infrastructure regardless of isolation model.
What happens during a security incident?
Severity 1-2 incidents trigger the on-call rotation per the incident-response runbook. Status page posts within 15 minutes of detection with engineer commentary. Affected customers receive direct email. Post-incident review is blameless and shared with affected customers; severity 1-2 reviews are public.
How do you handle subprocessors?
Our subprocessor list (Fly.io, Cloudflare, Stripe, Postmark, Sentry, etc.) is published and updated when it changes. Enterprise customers receive 30 days notice of new subprocessors with a right to object. Each subprocessor has its own SOC 2 / ISO 27001 attestation we have reviewed.

An operations team that takes the boring parts seriously

Talk to our security team about your compliance posture, or read the audit summary once it lands.