Team Lead - Site Reliability Engineering
Overall Job Summary
As a Team Lead of Site Reliability Engineering you will manage and oversee the engineering teams supporting a large-scale distributed application portfolio across on-prem and Cloud environments. With focus to increase efficiency, eliminate downtime, optimize cost, and maintain performance at scale, you will provide leadership to our performance management, application security & Reliability processes, while managing the health of core E-Commerce systems, site performance and reliability solutions.
Essential Duties and Responsibilities
- Manages end-to-end availability and reliability of E-commerce services, systems, platforms, and infrastructure and ensure they are designed and operated in an optimal manner
- Maintains security and performance of mission-critical applications and services that are part of the E-Commerce ecosystem
- Partners with Information Security with managing application security, vulnerabilities fix remediation, and site compliance
- Partners with Cloud and Infrastructure teams to build and maintain environments, optimize usage and cost with optimal scaling strategy
- Manages the performance strategy, test executions and remediation of critical site findings
- Establishes application and synthetic monitoring, alerting and execution of failover capabilities and automated self-healing and recovery.
- Ensures day-to-day support for multiple environments, ensuring readiness for project development and test activities
- Employs strong site reliability principles and practices, and continuous improvement of processes via automation.
- Partner with internal & external teams & ensure all change & release activities reviewed for trouble-free roll out & reduce risk.
- Owns day-to-day health, uptime, monitoring and reliability of the website and related services
- Lead continuous improvement that create an operating environment that includes dynamically monitoring, alerting, Failover capabilities and automated self-healing and recovery.
- Participate & Maintain 24x7 on call rotations for Site Reliability.
- May perform other duties as assigned *
Required Qualifications
Experience: 9+ years’ experience around performance engineering, application monitoring & security for an organization with large and complex information systems is preferred. 6+ years’ experience in B2B or B2C customer facing software design, development. 3+ years’ experience in cloud PaaS/IaaS environments (Azure, GCP), release management, vulnerability management and automation.
Education: Bachelor’s degree in Computer Science or related field is required. Any suitable combination of education and experience will be considered.
Preferred knowledge, skills or abilities
- Strong experience with IBM/HCL WebSphere Commerce, IBM Sterling Commerce, SOLR and related build and deployment processes. HCL Commerce Version 9 Experience is a plus.
- Strong experience with IBM Http Server, IBM WebSphere Application Server, IBM MQ & Deployment manager ND/Liberty software.
- Strong experience in developing and implementing comprehensive monitoring solutions to provide full visibility to the different platform and application components using tools and services like Kubernetes, Prometheus, Grafana, ECK/ELK, Dynatrace, Rigor, Quantum Metrics, and other similar tools.
- Evaluate Performance trends and expected changes in demand and capacity and establish the appropriate scalability Plans.
- Evaluate production traffic pattern and tune the performance test workload mix and strategy to keep the systems and application in continuous readiness mode.
- Experience with Kubernetes, AKS & Azure Cloud platform design, implement & maintain though cost efficient models.
- Experience with containerization, certificates management, Kafka, Zookeeper & Vaults & pipeline automation, Fisheye, Crucible, Performance & QA Test Tool Integrations.
- Strong Experience with cloud PaaS/IaaS environments Azure.
- Strong ability to work independently, work in a fast-paced environment, and manage workload prioritization to deliver high quality work products on time with minimal direction.
- Strong communication skills, both written and verbal.
- Strong critical thinking skills with the ability to use proven problem-solving approaches to most solutions
- Experience with Mobile App IOS & Android security and Performance management is a plus.
Working Conditions
- Hybrid / Flexible working conditions
- Must be able to work some nights and weekends
- Occasional travel required
Physical Requirements
- Sitting
- Standing (not walking)
- Walking
- Kneeling/Stooping/Bending
- Reaching overhead
- Lifting up to 20 pounds
Disclaimer
This job description represents an overview of the responsibilities for the above referenced position. It is not intended to represent a comprehensive list of responsibilities. A team member should perform all duties as assigned by his/ her supervisor.
Company Info
Nearest Major Market: Nashville