Mô tả công việc
Job description
As a Senior DevOps, you will play a critical role in ensuring the reliability, scalability, and performance of our company&039;s digital infrastructure and services. You will work closely with cross- functional teams, including software engineers, system administrators to design, build, and maintain highly available systems that can handle large- scale traffic and deliver exceptional user experiences.
Your primary focus will be on automating operational tasks, optimizing system performance, and monitoring system health.
Responsibilities:
System Reliability: Monitor and maintain the reliability and availability of the company&039;s digital infrastructure, including servers, networks, databases, and applications.
Documentation: Create and maintain comprehensive documentation for system configurations, procedures, and troubleshooting guides.
Automation: Develop and maintain automation tools and frameworks to streamline operational processes and reduce manual intervention. Automate repetitive tasks and build self- healing systems.
Capacity Planning: Collaborate with the infrastructure team to perform capacity planning and ensure that systems have sufficient resources to handle expected growth and traffic spikes.
Collaboration: Work closely with software engineering and DevOps teams to promote a culture of collaboration and shared responsibility for system reliability and performance.
Deployment and Release Management: Develop and improve deployment and release processes to ensure smooth and error- free deployments. Implement canary releases and A/B testing strategies.
Performance Optimization: Identify system bottlenecks and performance issues, and work with development teams to optimize system performance and scalability.
Continuous Monitoring: Implement monitoring solutions to track system health, performance, and availability. Proactively identify and resolve issues before they impact users.
Incident Management: Respond to and resolve incidents in a timely manner, ensuring minimal downtime and impact on users. Conduct post- incident analysis and implement preventive measures to avoid future incidents.