DevOps Engineer
薪資範圍:80,000 ~ 120,000 PHP / month
Job Title: DevOps Enginee.
Summary:
As a DevOps Professional at Iron Mountain, you will play a crucial role in providing technical support for our computer applications and hardware, including PCs, servers, and mainframes. You will be responsible for answering system-related queries, collaborating with network services, software systems engineering, and application development teams to restore service and identify issues. Your role will also involve managing the observability strategy, application sustainment, and global service delivery to enhance the performance and reliability of our applications.
Key Responsibilities:
Technical
- Design, build, and maintain cloud infrastructure (e.g., AWS, GCP, Azure).
- Automate infrastructure provisioning using tools like Terraform, Ansible, or CloudFormation.
- Develop and maintain continuous integration/continuous deployment pipelines.
- Work with developers to streamline deployment processes using tools like Jenkins, GitLab.
- Analyze system performance and plan for future capacity needs.
- Optimize application performance by fine-tuning system configurations and identifying bottlenecks.
- Write scripts and tools to automate routine tasks (e.g., system updates, scaling).
- Develop self-healing systems that can automatically resolve issues.
- Design and implement backup and disaster recovery strategies.
- Ensure systems are built with redundancy to handle failures and maintain uptime.
- Implement security best practices for systems and applications.
- Regularly perform security audits and manage patching for vulnerabilities.
Troubleshooting
- Respond to incidents, outages, and degraded performance in real-time.
- Investigate root causes and implement corrective actions to prevent future issues.
- Participate in on-call rotations to handle production issues, escalating as needed.
- Document issues and incidents in postmortems, identifying key learnings.
- Diagnose and resolve complex infrastructure and application issues (e.g., network latency, database performance).
- Use debugging tools (e.g., tcpdump, strace) to analyze system-level problems.
- Analyze logs and metrics to identify patterns and root causes of failures.
- Use log aggregation tools (e.g., ELK stack, Splunk) to efficiently search for issues.
- Diagnose and troubleshoot network issues, such as DNS resolution failures, packet loss, or connectivity problems.
- Understand the interplay between network architecture and application performance.
- Observability & Performance Management
- Implement and maintain monitoring systems (e.g., Prometheus, Grafana, Datadog).
- Ensure proper logging and metrics collection for infrastructure and applications.
- Manage the observability strategy with Engineering/Development and SRE teams to enhance application availability, performance, and reliability.
- Define and manage log-based metrics, alerts, and dashboards using Datadog.
- Support applications built with Google Cloud logging, Identity & Access Management, Cloud network, and projects.
Application Sustainment & Global Service Delivery:
- Ensure continuous availability of critical applications, monitoring uptime and performance against SLOs.
- Collaborate with development teams to ensure that new releases are stable and do not introduce regressions.
- Lead efforts in troubleshooting and resolving application-related incidents.
- Oversee the lifecycle management of applications, including upgrades, patches, and version rollouts.
- Ensure compatibility between application versions and underlying infrastructure.
- Maintain thorough documentation of application architectures, configurations, and known issues.
- Develop and update a knowledge base to provide guidelines for resolving common application issues.
- Plan and implement strategies for scaling services to meet growing demand in specific regions.
- Work with product and engineering teams to ensure that infrastructure is capable of supporting future growth globally.
Required Skills, Background and Experience:
- Bachelor's Degree in Computer Science, Engineering, or related field (4 years degree)
- 5+ years of experience in Information Technology
- Minimum 2 years of experience as an SRE Engineer
- Experience with Agile Scrum methodologies
- Cloud Platforms: Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure
- Motivated individual who learns quickly, has pride in building a new product and can engage others to accelerate technical solutions
- Familiarity with implementation design patterns and performance challenges involved in supporting a globally available Saas product
- Experience in working with remote distributed teams
- Experience with CI/CD tooling such as Terraform, Helm, Jenkins, ArgoCD, GitLab CI/CD, Maven, Artifactory
- Experience utilizing, building and optimizing observability stack with Grafana/Datadog/Prometheus and alerting and notification stacks like, OpsGenie
- Experience with Monitoring and Logging: Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Kafka, Loki, Cloudwatch
- Experience interfacing with and deploying services to cloud platforms
- Extensive experience working with managed Kubernetes services in the Cloud (AWS, GCP and Azure).
- Experience in developing scalable micro services and API gateways
- Experience with server oriented architectures and web platform applications with ability to define and integrate with APIs and REST services
- Deep Understanding of cloud networking concepts such as VPC, ingress, subnetting, TCP/IP, DNS, Load Balancers, network topologies, subnetting, and CIDR notation.
- Experience in Software Application Development using Python, Java, or .Net.
- Familiarity with secret management platforms like Thycotic/Delinea Secret Server, IKeyless, and/or Thycotic DevOps Secret Vault.
- Additional Requirements:
- Strong problem-solving skills and ability to work under pressure.
- Excellent communication and collaboration skills.
- Ability to analyze complex data and situations to make informed decisions.
- Certification Scrum Master/PMP Certification / Agile SAFe certification. (Optional)
公司地址:
桃園市蘆竹區南崁路二段228巷110號其他:
Our purpose: To protect and elevate the power of our customers’ workWe protect, unlock, and extend the value of your information and assets throughout their lifecycle. But we see this as so much more... it’s your work. And in that work lies the insight and power to accelerate your business and drive your organisation forward. At Iron Mountain, weelevate the power of your work.How we do it:Acting as your one strategic partner to manage the full lifecycle of your assets, from creation to dispositionOptimising “anywhere, anytime” workspaces as real estate investments changeImproving productivity and efficiency through workflow automation and information governanceLowering risk exposure across IT assets, personal devices, sensitive files, and recordsReducing the use of fossil fuels through sustainable practicesBridging the gap between physical and digital, and unlocking value from data to make informed business decisionsAt Iron Mountain, we understand that our most valuable asset is our customers' trust. They can rely on us to help them keep in compliance with regulatory changes and manage your information throughout its lifecycle with retention policies and associated advisory services. Whether that information is in physical or electronic format.From the everyday to the extraordinary, we protect what you value most - and securely store information and assets with a chain of custody that ensures they are accessible and recoverable when needed.-2025-03-04