Site Reliability Engineer,可靠度工程師, Software Shared Services
公司名稱: ViewSonic International Corporation
We are seeking a highly skilled Senior Site Reliability Engineer (Sr. SRE) to enhance the reliability, scalability, and performance of our systems. The ideal candidate will have a strong background in cloud infrastructure, automation, and observability tools, with the capability to lead technical initiatives and mentor team members. While this role does not require direct people management, leadership in project execution and technical guidance is essential. Key Responsibilities: • Technical Leadership: Lead the design, implementation, and optimization of complex systems to ensure high availability and performance. • CI/CD Pipeline Development: Develop and maintain robust Continuous Integration and Continuous Deployment pipelines to streamline software releases. • Cloud Architecture Management: Support design and manage scalable and secure cloud architectures, preferably within AWS. • Automation and Infrastructure as Code (IaC): Implement and manage infrastructure using tools like Terraform and Terragrunt to ensure consistent and repeatable deployments. • Observability Implementation: Set up and maintain monitoring, logging, and alerting systems using tools such as Prometheus and Grafana to ensure system health and performance. • On-Call and Incident Management: Develop and enforce on-call policies using platforms like PagerDuty or Opsgenie and manage stakeholder communications during incidents. • SLI/SLO/SLA Definition: Define and monitor Service Level Indicators, Objectives, and Agreements to align with business reliability goals. • Performance Optimization: Analyze system performance metrics and execute tuning strategies to enhance efficiency and scalability. • Mentorship: Provide technical guidance and mentorship to junior SREs, fostering a culture of continuous learning and improvement. Qualifications: • Education: Bachelor’s degree in Computer Science, Software Engineering, or a related field; advanced degrees are a plus. • Experience: 5+ years in Site Reliability Engineering, DevOps, or related roles, with a proven track record in leading technical projects. • Cloud Expertise: In-depth experience with cloud services, preferably AWS, including EC2, S3, RDS, Lambda, and VPC configurations. • Automation Skills: Proficiency in Infrastructure as Code (IaC) tools such as Terraform and Terragrunt. • CI/CD Proficiency: Hands-on experience with setting up and managing CI/CD pipelines using tools like Jenkins, GitLab CI, or similar. • Observability Tools: Strong knowledge of monitoring and observability tools, including Prometheus, Grafana, ELK stack, or similar. • On-Call Management: Experience in establishing and managing on-call rotations and incident response protocols using platforms like PagerDuty or Opsgenie. • SLI/SLO/SLA Management: Ability to define, implement, and monitor service reliability metrics aligned with business objectives. • No-Touch Policy Implementation: Demonstrated experience in implementing automation strategies to reduce manual interventions in production environments, enhancing operational efficiency and security. • Programming Skills: Proficiency in scripting and programming languages such as Python, Go, or similar for automation and tool development. • Soft Skills: Excellent problem-solving abilities, strong communication skills, and the ability to work collaboratively in a team environment. Preferred Qualifications: • Scalability Principles: Understanding of scalability concepts and experience in designing systems that handle growth efficiently. • Error Budgeting: Familiarity with error budgeting practices to balance innovation and reliability. • Security Awareness: Knowledge of security best practices in cloud environments. • Disaster Recovery Planning: Experience in designing and implementing disaster recovery and business continuity plans.公司地址:
台灣 新北市其他:
None-2025-01-14