Recruitment Room Team
AWS Site Reliability Engineer (SRE)
Job Description
We’re looking for an AWS Site Reliability Engineer (SRE) to help us build and operate highly reliable, secure, and scalable cloud platforms. This role is ideal for someone who thrives at the intersection of software engineering, cloud infrastructure, and operations, and enjoys automating everything.
As an AWS SRE, you’ll be a key player in shaping our cloud environment, mentoring engineers, and ensuring our AWS workloads are secure, cost-efficient, and always available.
Duties and responsibilities:
You will be responsible for building and operating resilient infrastructure, automating operational processes, and driving continuous improvements in system performance and availability. This role requires a balance of hands-on technical expertise, problem-solving skills, and a passion for delivering highly reliable services that support critical business operations.
- Reliability & Uptime
- Design, implement, and maintain highly available and resilient AWS cloud infrastructure.
- Monitor system health and performance, ensuring services meet SLAs.
- Respond to and resolve production incidents, performing root cause analysis and implementing long-term fixes.
- Automation & Scalability
- Build automation for deployment, monitoring, scaling, and recovery using Infrastructure as Code (Terraform, AWS CDK, CloudFormation).
- Automate repetitive operational tasks to reduce toil and improve system reliability.
- Implement CI/CD pipelines to ensure smooth and reliable delivery of applications.
- Monitoring & Observability
- Configure and manage observability solutions (CloudWatch, Grafana, etc.).
- Define and track Service Level Indicators (SLIs) and Objectives (SLOs).
- Develop proactive alerting and anomaly detection mechanisms.
- Security & Compliance
- Apply AWS security best practices, including IAM governance, secrets management, encryption, and compliance monitoring.
- Work closely with InfoSec teams to ensure systems adhere to regulatory standards (e.g., PCI DSS, POPIA, GDPR, ISO27001).
- Perform regular audits of cloud resources, ensuring alignment with organizational policies.
- Performance & Cost Optimization
- Continuously optimize cloud infrastructure for performance, efficiency, and cost-effectiveness.
- Analyse usage patterns and right-size resources or recommend reserved/spot instances where appropriate.
- Provide visibility into AWS spend and assist teams in cost governance.
- Incident & Problem Management
- Drive post-incident reviews, documenting learnings and improving runbooks.
- Develop self-healing and fault-tolerant systems to minimize impact of failures.
- Collaboration & Continuous Improvement
- Partner with development teams to embed reliability, scalability, and observability into applications.
- Advocate and implement SRE best practices across the organization.
- Mentor engineers on AWS, DevOps, and reliability engineering practices.
Required experience:
- Strong experience (> 5 years) with AWS services (EC2, ECS/EKS, Lambda, RDS, DynamoDB, S3, CloudFront, VPC, Route 53, IAM).
- Expertise in Infrastructure as Code (Terraform, AWS CDK, CloudFormation).
- Proficiency in monitoring & observability tools (CloudWatch, Grafana, ELK/OpenSearch).
- Experience with CI/CD pipelines (GitHub Actions, GitLab CI, AWS Code Pipeline).
- Knowledge of containerization & orchestration (Docker, Kubernetes, ECS, EKS).
- Strong scripting/coding skills (Python, Bash, Go, etc.).
- Experience with incident management & on-call operations.
Required Qualifications:
- AWS Professional certifications.
- Experience running Kubernetes/EKS in production.
- Knowledge of compliance frameworks (ISO27001, SOC2, PCI-DSS, POPIA).
Key competencies:
- Problem-solving mindset with focus on root cause analysis and prevention.
- Strong communication skills to collaborate across engineering, security, and business teams.
- Ability to prioritize reliability, scalability, and performance in production systems.
- Continuous improvement mindset, with passion for automation and efficiency.
Why this role:
As an AWS Site Reliability Engineer, you’ll be at the centre of our mission to deliver secure, reliable, and scalable digital services. This is not just about keeping systems running — it’s about designing for resilience, driving automation, and enabling our business to innovate at speed while staying safe and compliant.