Site Reliability Engineer

YouMail Position #1007

Location:	USA/Remote
Department:	IT Operations Team

Apply Now

Job Description

The YouMail Operations team is looking for Site Reliability Engineers to build and run YouMail's services that are currently migrating to Kubernetes. The best candidates will have strong Cloud Architecture & Operations skills, with acute awareness and experience of bare metal Linux / Systems expertise. The SRE team will have a broad spectrum of knowledge to integrate all software with YouMail services. We're looking for hardworking and passionate people to join this amazing team. If you feel this is you, we'd love to hear from you.

Requirements

Key Qualifications

Strong sense of ownership, customer service, and integrity demonstrated through clear communication and positive action.
Ability to program in high-level programming languages like: Java, Ruby, Python, Perl and C.
Deep understanding of Cloud Architecture and Operations including: migration, resilience, maintainability, and cost efficiency.
Understanding of standard networking protocols and components such as: HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies.
Deep understanding of the Linux Operating System, including: Kernel, Memory, Process, Threads, Static / Shared Libraries, IPC, Signals.
Passion for eliminating repetitive manual processes using automation.
Experience with monitoring applications like Grafana, Nagios, and Kibana.
Knowledge of AWS EC2, S3, EKS, Rancher, Kubernetes is a must for success in the team.

What You'll Do

Fault handling and escalation: Identify and respond to faults on systems and networks, liaising with 3rd party suppliers, handling escalation through to resolution.
Implement and manage access control and security services
Develop tools and processes to improve efficiency and reduce toil
Develop, Test and debug automated tasks (Apps, Systems, Infrastructure)
Conduct performance tests, document and or identify application optimizations
Find and resolve scale issues by restricting apps, tuning memory/disk/cpu usage or adding more capacity to our network.
Actively looking for system improvements and recommending how to improve. Then implement said recommendation.