Recruitment Room Team

Platform Engineer – Sandton

Sandton, Gauteng
3 days ago
Apply Now
Deadline date:

Job Description


Key Responsibilities
Reliability & Operations

  • Own uptime, performance, and monitoring for all production applications.
  • Manage Heroku pipelines, CI/CD, review apps, and production   environments.
  • Operate Celery workers and queues, monitor health, and handle missed   task check-ins.
  • Define and track service level objectives (SLOs) (availability, latency, task success rate).
  • Maintain runbooks, a centralised wiki for incident response, and lead post-mortems.
  • Run periodic disaster recovery drills and coordinate penetration tests.

 
Platform Engineering

  • Keep environments current (Heroku stacks, Postgres/Redis versions,   DO/AWS base images).
  • Manage daily backups, ensure restore tests and disaster recovery   runbooks are in place.
  • Standardise infrastructure (Terraform or scripts for DO/AWS; app.json   for Heroku).
  • Manage Cloudflare for DNS, edge security, and performance optimisation.
  • Tune performance (DB indices, query optimisation, cache usage, Celery   queue design).
  • Optimise infrastructure costs across Heroku, DigitalOcean, and AWS.

 
Developer Experience & CI/CD

  • Maintain CI pipelines with type checking, linting, and security   scanning.
  • Enforce test coverage and automate deploy checks (smoke tests,   migration health, error budgets).
  • Support developers with tooling for local/staging environments and   build self-service dashboards (e.g., Celery queue status).
  • Collaborate with developers to streamline workflows and educate on   secure coding practices.

Security & Compliance

  • Own vulnerability management and dependency patching cadence.
  • Manage access reviews, secrets, MFA/SSO, and enforce least-privilege   IAM policies.
  • Implement encryption for data at rest and in transit (e.g., S3   server-side encryption).
  • Contribute evidence and responses for security questionnaires and SOC   2 audits.
  • Maintain a “security pack” with architecture, sub-processors, and DR/backup   processes.

 
Monitoring & Alerting

  • Configure Sentry ownership rules, Cron Monitors, and release health.
  • Centralise metrics/logs (Heroku metrics, Papertrail, Sentry, APM,   Prometheus/New Relic).
  • Set up alerts on golden signals (latency, errors, traffic, saturation)   and avoid alert fatigue.
  • Conduct capacity planning and track resource usage trends.

 
Vendor & External Services

  • Evaluate and manage vendor relationships (e.g., Mailgun, Twilio) to   ensure service level agreements (SLAs) and performance.
  • Assess new tools/services to enhance platform capabilities (e.g.,   observability, security)
  • Track   costs, security posture, and integration quality for all third-party   services.

 

Must-Have

  • Cloud infrastructure management: 3+ years operating production apps on Heroku, AWS, DigitalOcean, or similar.
  • CI/CD pipelines: Hands-on experience with GitHub Actions, Heroku CI, or equivalent; solid Git fundamentals.
  • Monitoring & incident response: Experience with Sentry, Papertrail (or similar), logs, and uptime/performance dashboards.
  • Security fundamentals: Understanding of IAM, encryption in transit/at rest, MFA/SSO, and secure configuration practices.
  • Disaster recovery & backups: Experience implementing and operating automated backups, restore testing, and writing/maintaining incident runbooks.
  • Communication & collaboration: Ability to document processes clearly and work closely with developers in a small team.

 
Strong Plus

  • Infrastructure as Code & automation: Experience with Terraform, Docker, or equivalent tooling.
  • Asynchronous workloads: Familiarity with Celery, Redis, or other task queues and message brokers.
  • Scaling & cost optimisation: Capacity planning, performance tuning, and managing infra spend.
  • Compliance frameworks: Exposure to SOC 2, GDPR, or supporting client security questionnaires.
  • Incident management: Participation in on-call rotations, leading post-mortems, or serving as incident commander.

Nice-to-Have

  • Proficiency in Python; familiarity with Django/Flask.
  • Experience with DNS/CDN/edge security (e.g., Cloudflare).
  • Observability platforms (Prometheus, Grafana, New Relic).
  • Static analysis and code quality tools (mypy, Bandit, SonarQube).
  • Prior exposure to multi-tenant SaaS environments.
  • Certifications (AWS Certified DevOps Engineer, CKS, or equivalent).