Site Reliability Engineer

Site Reliability Engineer

Who we are:

We create the data products & technology that make advertising work better for people. Choreograph is a global data products and technology company, purpose-built for an era that demands an innovative approach to data management, usage, and brand growth.

Data is the fuel that powers growth. The companies who best leverage data are creating unbeatable advantages over their competitors while simultaneously connecting with customers more effectively.

Our goal is to help future-focused businesses use their data in ways that meet savvy customers’ expectations while building trust and understanding.

We are over 700 data scientists and strategists, technologists, and product gurus. You will find us in over 17 markets around the world. We offer a modular product suite, empowering marketers to drive sustainable, data-enabled growth.



About choreograph create:

Choreograph create is a dynamic advertising platform built for WPP agencies, enabling brands to achieve better outcomes through the power of addressable, data driven creative. The create platform caters for the full end-to-end process of producing, activating, and measuring the success of dynamic advertising campaigns. This includes modular template creation, managing complex decisioning logic and the activation of campaigns through programmatic integrations.



About this role:

The site reliability engineer has many responsibilities, including helping the team to design a platform that works across multiple data-centers (reliably with low latency); help the team to design and implement software that covers most of the capabilities of software architecture (reliability, scalability, resiliency, performance); design and run assessments/tests to verify these capabilities, support team in recovering quickly from outages (incidents), implement automation tools for CI/CD pipelines, and also help the team to develop good practices around monitoring and incident response.

Who we are looking for:

We are looking for a site reliability engineer that is excited by the opportunity to contribute to the growth of the choreograph create platform. The site reliability engineer must have hands on experience on debugging both automated and human processes, experience in working both in software engineering and in automation. The engineer enjoys teaching and practicing site reliability concepts with the team members, can find a balance in all things, and has experience managing stateful distributed systems.



Role requirements:

  • Degree in Computer Science (or equivalent);
  • 6+ years of experience (or equivalent) in the field of site reliability (and programming);
  • Designing and implementing software that improves stability, scalability, availability, and latency using well established frameworks/solutions/platforms (AWS or GCP – better if both, Terraform, Docker, K8s, Helm, Prometheus, Grafana, Datadog, Elasticsearch, Kibana, Redis, Kafka, chaos engineering tools);
  • Setting up system health monitoring and automated processes to prevent outages (e.g. define SLIs (Service Level Indicators) and related SLOs (Service Level Objectives) to alert about production health issues);
  • Defining correcting actions and support in recovering quickly from actual outages (e.g. incident management, postmortem, cross team collaboration)
  • Implementing automation tools for continuous integration/delivery/deployment of secure code (e.g. CICD knowledge, Git, Docker, K8s, ephemeral environments, GitLab pipelines, GitLab SAST);
  • Help the team to develop good practices around monitoring and response (e.g. on- call escalation procedure, alerts configuration through Prometheus/Alert Manager, incident declaration process);
  • And on top of all the above, it’s an extra if you have:
    • The knowledge of how to design, build, and run assessments/tests to verify stability, scalability, availability, and latency (e.g. maturity self-assessment practices)
    • The ability to program with one or more high level languages (such as Python or Clojure) with a proven record of accomplishment of automation and an algorithmic approach to solving problems.
    • In-depth knowledge and experience in at least one of: troubleshooting, host- based networking, Linux or UNIX engineering, systems programming, distributed systems, databases, cloud computing, and a desire to learn more.



Success attributes:

  • Show why you applied for this position.

  • Show your high energy and passion for the SRE (Site Reliability Engineering) practice;
  • Show that you are motivated, self-starter, self-reliant, resilient, curious, and ambitious;
  • Demonstrate that you are comfortable and thrive in a fast-paced, entrepreneurial, start-up environment, fitting into our culture.


Harris Muhammed



We will update you with new and relevant jobs when they are available!