DevSecOps and SRE
DevSecOps Principle
Security, reliability, and operations are not final gates. They are design constraints and engineering responsibilities embedded throughout delivery.
Secure Development Baseline
Adopt NIST SSDF-inspired practices:
- Prepare the organization with roles, policies, tooling, and training.
- Protect code and build artifacts.
- Produce well-secured software through secure design, secure coding, code review, testing, and vulnerability management.
- Respond to vulnerabilities with root-cause learning and timely remediation.
Security Practices
Threat Modeling
Use lightweight threat modeling for:
- New internet-facing services.
- Authentication and authorization changes.
- Payment, PII, health, or regulated data.
- AI/agent workflows.
- Third-party integrations.
- Major architecture changes.
Ask:
- What are we protecting?
- Who can attack it?
- What can go wrong?
- How would we detect it?
- How would we recover?
Supply Chain
Minimum expectations:
- Dependency scanning.
- Lockfiles.
- SBOM for critical products.
- Signed or trusted build artifacts where practical.
- Secret scanning.
- Container image scanning.
- License checks.
- Review of high-risk transitive dependencies.
CI/CD Security
- Least privilege for pipelines.
- Protected branches.
- Required checks.
- Secret-free logs.
- Environment separation.
- Approval gates for production when risk requires it.
- Auditable deployment history.
SRE Principle
Reliability is a product feature. It must be intentionally designed, measured, and traded against speed and cost.
SLOs, SLIs, And Error Budgets
Start with what users care about:
- Availability.
- Latency.
- Correctness.
- Freshness.
- Durability.
- Throughput.
- Recovery time.
Define:
- SLI: the measurement.
- SLO: the target.
- Error budget: the acceptable miss rate.
- Policy: what changes when the budget burns too fast.
Example:
| User Journey | SLI | SLO | Error Budget Policy |
|---|---|---|---|
| Login | Successful login requests | 99.9% monthly | Pause risky releases if burn rate exceeds threshold |
| Search | P95 latency | P95 under 500 ms | Prioritize performance work when budget is half consumed |
| Payment | Correct successful transactions | 99.99% monthly | Immediate incident if correctness drops |
Observability
Every important service needs:
- Metrics.
- Logs.
- Traces.
- Dashboards tied to user journeys.
- Alerting on symptoms, not only causes.
- Runbooks.
- Ownership metadata.
Good alerts are actionable, urgent, owned, and linked to diagnosis.
Incident Response
For significant incidents:
- Declare severity.
- Assign incident commander.
- Separate communication, diagnosis, mitigation, and customer updates.
- Keep a timestamped timeline.
- Prefer mitigation before perfect root cause.
- Capture follow-up actions.
Blameless Postmortems
Postmortems should identify:
- Impact.
- Timeline.
- Detection gap.
- Contributing factors.
- What went well.
- What went poorly.
- Action items with owners and due dates.
- System improvements.
Blameless does not mean actionless. It means the organization learns the truth without fear and still follows through.
Toil Reduction
Track repetitive manual work:
- Deployments.
- Data fixes.
- Access requests.
- Support triage.
- Environment setup.
- Release notes.
- Incident reporting.
Automate or eliminate recurring toil. Toil is a tax on innovation.
Team Reference Guide
How To Explain This Page
DevSecOps and SRE are both about trust. DevSecOps asks whether we can design, build, release, and operate software securely. SRE asks whether the product keeps the reliability promises that matter to users.
Security and reliability fail when they are treated as final gates. They work best when they are built into daily engineering: threat modeling, secure pipelines, dependency checks, observability, incident response, SLOs, and postmortems.
Guidelines For Teams
- Threat model new internet-facing, data-sensitive, AI-enabled, or architecture-significant changes.
- Keep secrets out of code, logs, and screenshots.
- Scan dependencies and containers where relevant.
- Define SLOs for critical user journeys.
- Alert on user-impacting symptoms, not only technical noise.
- Run incidents with clear roles: commander, communicator, investigator, mitigator.
- Make postmortems blameless but action-oriented.
- Track toil and automate repeated manual work.
What Good Looks Like
The team can answer: what are we protecting, what reliability promise are we making, how would we know if users are harmed, and how would we recover?
Reflection Questions
- Which service has no clear reliability promise?
- Which security check happens too late?
- Which manual operational task should be automated next?