About Us
We Reimagine Everything.
We
are a multinational technology consulting firm. We help companies and
corporations scale their operations, achieve technology innovation,
elevate their brand and transform their business model.
We
are here to challenge the status quo, flip the script, and blur all the
lines in order to create customized end-to-end tech solutions, from
software to hardware. We are a team of over 500 engineers from around
the world with one shared goal: to leverage and crisscross technology,
creative thinking, and industry-specific expertise to help our customers
become and remain high performers in their industries. Basically, we
take care of it all from A to Z.
Our
expert engineers have contributed to 8 US patents and developed
award-winning innovative tech solutions, serving 80M+ users for over 100
clients worldwide, including top US Fortune 500 companies.
Job Description
This is a remote position.
We are seeking a highly skilled Senior Site Reliability Engineer
(SRE) to join our Platform Engineering team. The ideal candidate will
have a strong understanding of DevOps and Service Level Management (SLM)
metrics. As well as experience working in event-driven infrastructure
projects using tools like Terraform, New Relic, Kubernetes, AWS, and
Kafka.
As a representative of Platform Engineering, you will play a
critical role working with other engineering teams to ensure our
platform infrastructure tooling fulfils their needs and has a positive
impact on Developer Experience. As well as helping them determine the
right settings and thresholds for triggering alerts or automations on
their applications.
Key Responsibilities:
- Scalability and High Availability: Design, implement, and
maintain scalable and highly available systems using load balancing,
auto-scaling patterns, canary releases, and blue-green deployments.
- Monitoring, Logging, and Observability: Develop and
maintain monitoring and logging dashboards using tools like New Relic,
Prometheus, Grafana, and Datadog.
- Ensure observability through metrics, tracing, log aggregation, and alerting.
- Alerting and Automation: Help teams determine the right
settings and thresholds for triggering alerts or automations on their
applications.
- Understand that each application has different performance
requirements, such as varying acceptable response times or resource
constraints.
- System Performance and Reliability: Monitor, optimize, and ensure system reliability and performance using tools like New Relic.
- Apply DORA metrics to measure and improve development and operational performance.
- Ensure compliance with SLM metrics like SLAs, SLOs, and SLIs by tracking uptime, response times, and resolution times.
- Resiliency: Implement and advocate for "Chaos" engineering practices to ensure system resiliency.
- Collaboration: Work with cross-functional teams to enhance
platform engineering practices and gathering the right information for
metrics analysis.
Requirements
- Proven experience working with Infrastructure-as-Code tooling, like Terraform, for infrastructure management.
- Strong understanding of scalability and high availability
patterns, including load balancing, auto-scaling, canary releases, and
blue-green deployments.
- Strong understanding of DevOps metrics (like DORA) and
their application in measuring and improving development and operational
performance.
- Strong understanding of Service Level Management (SLM)
metrics (like SLAs, SLOs, and SLIs). And their importance in defining,
monitoring, and ensuring compliance from the services bound to them.
- Experience with monitoring, logging, and observability tools like New Relic, Prometheus, Grafana, and Datadog.
- Experience working with Kafka and improving performance of
event-driven, realtime data processing and streaming projects and
architectures.
- Familiarity with tooling used for SLM, DevOps and DORA metrics like Apache Dev Lake, Grafana and New Relic.
- Experience working with AWS, Azure or GCP for cloud infrastructure management.
- Experience working with CI/CD pipeline tools such as GitHub Actions, Jenkins, GitLab CI, or similar.
- Analytical Skills. Ability to analyze and interpret metrics to drive improvements.
- Strong communication skills to effectively collaborate with team members and stakeholders.
- Nice-to-haves Familiarity with Observability-as-Code tooling and practices.
- Familiarity with "Chaos" engineering practices for system resiliency