Lead Site Reliability Engineer
Company: Kolter Solutions
Location: Orlando
Posted on: March 17, 2023
Job Description:
Kolter Solutions is seeking a Lead Site Reliability
Engineer.
Location: Remote.
Responsibilities:
- Scale up and mature a team of Site Reliability
Engineers
- Build and execute vision to implement tools for monitoring
efficient deployments and remediation SOPs
- Strategize and prioritize roadmaps to drive reliability of
platform(s)
- Collaborate with key stakeholders across Products, IT, and
Security on initiatives to drive operational excellence,
instrumentation, security, grow, reliability and
scalability
- Promote the collaboration of all engineering teams in the
sustainability of platform and help promote a culture of
quantifiable continuous improvements
- Drive service reliability by developing and enabling metric
visibility using KPIs and system/component level SLAs
- Serve as a change agent for driving service prioritization and
help promote a culture of continuous improvement measured by
operational metrics and KPIs. Provide a single pane of glass across
all critical operational components
- Drive end-to-end resolution of production incidents including
root cause analysis, and prevention and correction plans
- Support business infrastructure to ensure service availability,
including outside of business hours as needed
- Optimize services across the company to manage costs, including
right sizing and depreciating systems
- Contribute to the technology strategy by guiding the production
and development technical architecture; maintain high quality
standards, especially with technology, and foster a culture of
long-term thinking and innovation
- Support and unblock the SRE team in delivering on its goals;
you will oversee the technical scoping and planning for the team,
and help guide and empower the development approach within the
team
- Ensure the SRE team is high performing, with a healthy,
inclusive and collaborative culture; coach engineers on the team
and guide them through a fulfilling career
- Research new technologies to solve tomorrow's deployment,
monitoring, and scaling needs
- Run the production environment by monitoring availability and
taking a holistic view of system health
- Build software and systems to manage platform infrastructure
and applications
- Improve reliability, quality, and time-to-market of our suite
of software solutions
- Measure and optimize system performance, with an eye toward
pushing our capabilities forward, getting ahead of customer needs,
and innovating for continual improvement
- Provide primary operational support and engineering for
multiple large-scale distributed software applications
- Manage and participate in 24x7 on-call rotations to ensure site
reliability and performance
- Define best practices for monitoring, alerting, and incident
management
- Lead and participate in root cause analysis and documenting
procedures
Required:
- 10+ years of relevant professional experience in highly
available, public facing SaaS / EMR environments
- Exemplary written and oral communication skills
- Experience leading highly dynamic on-call teams, coaching,
mentoring, and promoting cross team collaboration
- Experience managing multiple projects and priorities
simultaneously
- Proven track record of improving reliability, availability,
incident/crisis management and performance of cloud
services
- Experience troubleshooting and developing highly available
systems that utilize load balancing, horizontal scalability, and
high availability
- Strong technical expertise in troubleshooting, cloud stacks,
operating systems, networking, virtualization, and
containers
- Working knowledge of CI/CD, DevOps, and sophisticated software
deployment techniques
- Experience defining reliability metrics, operations processes
including problem management and automation
- Experience implementing chaos engineering
Knowledge:
- 4+ years' experience managing a team working with cloud
infrastructure (in particular AWS) in a secure environment
(ISO27001, SOC 2 type 2, GDPR, etc.)
- 3+ years of technical operations experience, with a background
in SaaS and cloud-based platforms
- Experience dealing with environments that leverage container
orchestration tools, i.e. Kubernetes
- Experience building scalable and fault tolerant
systems
- Experience migrating from Data Centers to Cloud based
solutions, and migrating solutions from other cloud
providers
- Past experience successfully leading one or more DevOps
projects (CI/CD, pipeline tools, operations management, etc.) to
completion through tools like Jenkins, Helm, Terraform,
etc.
- Experience with system health monitoring tools such as New
Relic, OpsGenie, Uptime Robot, or StackDriver
- Have actively managed hosting at scale at multiple companies,
including costs and 24x7 uptime
- Experience with distributed storage technologies such as NFS,
HDFS, Ceph, and Amazon S3, as well as dynamic resource management
frameworks (Apache Mesos, Kubernetes, Yarn)
- Proficiency with scripting and/or programming languages;
Python, Java, C/C++, Ruby, and JavaScript prefer red
- Familiarity with Intranet tools and processes including
Confluence, Jira, and Microsoft Teams
Kolter Solutions is a leading professional staffing company based
in Central Florida. We place highly skilled individuals on a
contract, contract-to-hire and direct hire positions at clients
nationwide.
Kolter Solutions has proudly been recognized as the " Best Places
to Work by the Orlando Business Journal and Staffing Industry
Analysts (SIA). We are also in the Fast 50 2020 Fastest growing
companies in Central Florida !
We offer:
- Full Health Benefits
- Vision
- Dental
- 401 (k)
- Pet Insurance
- Life Insurance
- Supplemental Benefits such as short-term disability, accidental
insurance, and supplemental dental and vision.
- Employee Discounts
- Referral Program
Kolter Solutions is an Equal Opportunity Employer. We believe in
hiring a diverse workforce and sustaining an inclusive,
people-first culture. We are committed to non-discrimination on any
protected basis, such as disability and veteran status, or any
other basis covered under federal, state or local applicable
law.
Keywords: Kolter Solutions, Orlando , Lead Site Reliability Engineer, Engineering , Orlando, Florida
Didn't find what you're looking for? Search again!
Loading more jobs...