Back

SRE Expert

Tehran
Share This Job
Full Time
Bachelor's Degree
Expert

Description/Tasks

A Site Reliability Engineer (SRE) plays a pivotal role in ensuring that an organization's IT services and infrastructure are highly available, scalable, and efficient. This position often involves a blend of development, operations, and troubleshooting tasks.

System Reliability and Availability: Ensure high availability and reliability of services and infrastructure. This includes proactive monitoring, incident response, and post-mortem analysis to prevent recurrence of incidents.

Performance Management: Monitor and optimize system performance to meet the service level objectives (SLOs) and service level agreements (SLAs). This involves understanding and managing the capacity and scalability of services.

Incident Management and Response: Lead the response to system outages and performance issues, including on-call duties. Develop automation tools to help in the rapid resolution of incidents and to prevent their recurrence.

Automation and Tooling: Design and implement automation tools and frameworks to reduce manual operational work. This could include scripts for deployment, monitoring, and infrastructure management.

Cross-functional Collaboration: Work closely with development teams to design and implement scalable, reliable, and efficient systems. This involves providing input on architectural decisions, optimizing resource utilization, and ensuring system resilience.

Continuous Improvement: Continuously analyze current processes and systems for improvement opportunities. Implement best practices for system reliability and availability.

Disaster Recovery and Backup: Develop and maintain disaster recovery plans, including regular testing to ensure system resilience.

Documentation: Maintain detailed documentation of the system architecture, configurations, processes, and service records to ensure that the knowledge is shared and accessible within the team.

Requirements/Skills

 

Requirements / Skills

 

Education: A bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.

Experience: Proven experience in a site reliability engineering role or similar, with a strong background in software development and system administration.

Technical Skills:

- Proficiency in programming languages.

- Experience with cloud services and container orchestration tools (Kubernetes, Docker).

- Strong understanding of networking principles and protocols.

- Experience with continuous integration and deployment (CI/CD) practices.

Problem-Solving Skills: Ability to troubleshoot and resolve complex technical issues under pressure.

Communication Skills: Excellent verbal and written communication skills, with the ability to effectively communicate technical concepts to non-technical stakeholders.

Teamwork: Ability to work collaboratively in a cross-functional team and interact effectively with developers, operations teams, and management.

Job Benefits

Job Benefits

Loans

Health insurance

Game room

Snacks

Breakfast

Lunch

Occasional packages and gifts

Learning stipends

Resting space