Site-Reliability Engineer

Location: Toronto

Job ID: 39027

Job Description

Requirement:

Proven experience optimizing batch workloads for performance, reliability, and cost.
Proficiency with CI/CD pipelines (GitHub Actions, Azure DevOps, Jenkins) and Infrastructure as Code (Terraform, Ansible).
Proven experience with containers and orchestration (Docker, Kubernetes).
Excellent incident management and root cause analysis skills.
Linux Systems Expertise: Kernel/OS tuning, networking, filesystem optimization, process management, and troubleshooting.
Dynatrace Mastery: Custom dashboards, KPIs, anomaly detection, tagging strategy, and alerting configuration.
Experience with a more modern development languages (Python, Java, etc.)
Airflow Expertise: DAG design best practices, SLA management, scheduler/executor tuning, and scaling strategies.