【Platform Team】網站可靠性工程師 Site Reliability Engineer

We are looking for a Site Reliability Engineer (SRE) to make sure our cloud-based commerce platform is up and running and healthy. As a SRE for iKala Commerce, you will be responsible for everything from our cloud infrastructure and operating systems to developing tools for code deployment and service monitoring. You will also review our code and system design and partner with developers to build our applications. The SRE role is an integral member of our product development team. You will be a part of the team that makes crucial decisions about how to manage and scale complex, high-performance distributed systems. You will also provide your own perspective on our backend systems and constantly develop innovative ways to improve the way we manage the underlying infrastructure. Our ideal candidate should be able to develop applications on his/her own, but more eager to accelerate the whole team by building systems to improve performance and operational efficiency. Ultimately, you should be involved in all stages of software development to define and improve our SLOs, SLAs & SLIs. Our current tech stack include: GCP, Terraform, Kubernetes, Helm, ArgoCD, Gitlab-CI/CD, Grafana LGTM, 【Key Responsibilities】  Designing & implementing infrastructure for collecting metrics, crunching data and improving service monitoring to detect problems before they’re visible to our customers. Building systems to automate our server lifecycle, from configuration management, CI/CD to server bootstrap and decommission. Troubleshooting, performing root cause analysis, and resolving production issues from the application and network layers all the way down to the system level. Participating in solution design and advising other developers when building new features so that they’re scalable, maintainable, and performing well. Improving the observability of our applications through monitoring, alerting, logging, tracing and profiling, and building such observability features into a common platform. Practicing sustainable incident response and blameless postmortems. Proactively identifying and reducing issues through design, testing, and implementation of software-based solutions. More Info»>https://www.ikala.ai