Lead Eng, Site Reliability
Overall Job Summary
As a Lead Site Reliability Engineer you will play a vital role in implementing modern Engineering and DevOps techniques operating a large-scale distributed application portfolio across on-premises and cloud to increase efficiency, eliminate downtime, optimize cost, and maintain performance at scale. You will provide hands on technical expertise to design, deploy, secure, and optimize cloud services and deliver the best customer experience. This role will also be responsible for maintaining and reporting the health of the core E-Commerce systems, page performance and customer experience analytics while working as an adviser to help identify, educate, and foster best-in-class site reliability solutions.
Essential Duties and Responsibilities (Min 5%)
- Leads end-to-end availability, security and performance of mission-critical applications and services that are part of the E-Commerce eco-system
- Drives changes and release activities related to site stability with other teams (internal and external), partnering with the Change Management group to ensure smooth and trouble-free roll out of releases and changes.
- Partners with Information Security with managing application security, vulnerabilities fix remediation, and compliance activities with other teams (internal and external)
- Partners with vendors to ensure all critical patches are tested and applied in both Non-Production and Production environment in time to avoid any business and customer impacts.
- Partners with leads and architects across the organization to define the Performance strategy and executes performance test activities with other teams (internal and external, partners with QA Performance Test Engineers to ensure all changes are tested in both Non-Production and Production to avoid any business and customer impacts.
- Establishment of application and synthetic monitoring, alerting and execution of failover capabilities and automated self-healing and recovery.
- Manages and maintains performance environments, ensuring that these environments are properly setup, configured, and highly available for each project as scheduled.
- Communicates state of reliability to prioritize technical debt and improvements on technology team roadmaps.
- Supports day-to-day health, uptime, monitoring and reliability of the website and related services
- Leads, models, and drives SRE culture and behaviors
- Share a 24x7 On-Call Production support rotation with your team and respond to service incidents.
- May perform other duties as assigned
Required Qualifications
Experience:
- 7+ years of experience in B2B or B2C customer facing software design, development, and deployments.
- 7+ year of experience around performance engineering & application monitoring for an organization with large and complex information systems is preferred.
- 5+ Experience with Application Security treat & vulnerability management and bot traffic Management for B2C or B2B large scale applications.
Education: Bachelor’s degree in Computer Science or related field is required. Any suitable combination of education and experience will be considered.
Preferred knowledge, skills or abilities
- Strong experience with IBM/HCL WebSphere Commerce, IBM Sterling Commerce, SOLR and related build and deployment processes. HCL Commerce Version 9 Experience is a plus.
- Strong experience with IBM Http Server, IBM WebSphere Application Server, IBM MQ & Deployment manager ND/Liberty software.
- Strong Hands-on experience in developing and implementing comprehensive monitoring solutions to provide full visibility to the different platform and application components using tools and services like Kubernetes, Prometheus, Grafana, ECK/ELK, Dynatrace, Rigor, Quantum Metrics, and other similar tools.
- Strong Hands-on experience in Identifying and troubleshoot any availability and performance issues at multiple layers of deployment from Infrastructure, operating Environment, Network, application, and Integration System and solve customer issues on production deployments.
- Evaluate Performance trends and expected changes in demand and capacity and establish the appropriate scalability Plans.
- Evaluate production traffic pattern and tune the performance test workload mix and strategy to keep the systems and application in continuous readiness mode.
- Strong Hands-on experience in Developing & implementing comprehensive performance and security solutions using Akamai Performance Management & Security Solutions.
- Strong Hands-on experience with Kubernetes, AKS & Azure Cloud platform design, implement & maintain though cost efficient models.
- Strong Experience with containerization, certificates management, Kafka, Zookeeper & Vaults & pipeline automation, Fisheye, Crucible, Performance & QA Test Tool Integrations.
- Strong Experience with cloud PaaS/IaaS environments Azure.
Working Conditions
- Normal office working conditions
- Must be able to work some nights and weekends
- Occasional travel required
Physical Requirements
- Sitting
- Standing (not walking)
- Walking
- Kneeling/Stooping/Bending
- Reaching overhead
- Lifting up to 20 pounds
Disclaimer
This job description represents an overview of the responsibilities for the above referenced position. It is not intended to represent a comprehensive list of responsibilities. A team member should perform all duties as assigned by his/ her supervisor.
Nearest Major Market: Nashville