Posted about 6 days

SRE/ Site Reliability Engineer (Middle / Senior)

Roles & Responsibilities:

  • Ensuring the smooth operation of software, environments and company services
  • Analyzing and improving the performance and availability of products
  • Identification of bottlenecks in the architecture and in the infrastructure
  • Improvement of system alerting and incident management
  • Improvements of the monitoring systems based on SLI (Prometheus, Icinga, Grafana etc.)
  • Formalization of SLI under the main business requirements
  • Formation of SLO for services and infrastructure in general
  • Minimization of system recovery time (RPO and RTO)
  • Analysis of incidents in the prod environment
  • Capacity management


  • 3+ years of work experience implementing, troubleshooting, and supporting infrastructure software and distributed systems
  • Support experience software in Golang, python , Ruby
  • Worked with virtualization and containerization technologies (containerd, docker, k8s) for more than 2 years
  • Set up CI of varying complexity (Jenkins) with CD to different environments
  • Experience in creating and maintaining a fault-tolerant system, with log coverage, monitoring, and alerting
  • Understanding the principle of "infrastructure as code" and the ability to test it (Ansible Terraform)
  • Principles of organizing network security (IPsec, WAF, IPS)
  • Experience with maintenance of blockchain nodes
  • Availability in US timezone is required

Our Tech Stack:

  • Infrastructure: Bare-metal / AWS
  • Databases: Clickhouse / MySQL
  • SCM: git / GitHub
  • Message broker: Kafka
  • Repository: Nexus
  • CI/CD: Jenkins
  • Monitoring: Icinga 2, Grafana, Prometheus, Victoria metrics, ELK
  • Orchestration: k8s, Ansible, Terraform
  • Containers: LXC, Docker
  • Scripting: Python, Golang, Ruby, Groovy
  • OS: Debian/Ubuntu
  • Others: Docker compose, IPSec
  • 1Exploratory Interview
  • 2Technical Interview I
  • 3Technical Interview II
  • 4Challenge
  • 5HR decision