Site-Reliability Engineer
Location: Toronto
Job ID: 2124
Job Description
Requirement:
-
Proven experience optimizing batch workloads for performance, reliability, and cost.
-
Proficiency with CI/CD pipelines (GitHub Actions, Azure DevOps, Jenkins) and Infrastructure as Code (Terraform, Ansible).
-
Proven experience with containers and orchestration (Docker, Kubernetes).
-
Excellent incident management and root cause analysis skills.
-
Linux Systems Expertise: Kernel/OS tuning, networking, filesystem optimization, process management, and troubleshooting.
-
Dynatrace Mastery: Custom dashboards, KPIs, anomaly detection, tagging strategy, and alerting configuration.
-
Experience with a more modern development languages (Python, Java, etc.)
-
Airflow Expertise: DAG design best practices, SLA management, scheduler/executor tuning, and scaling strategies.

_edited.jpg)