Remote Lead Eng, Site Reliability

Overall Job Summary

As a Lead Site Reliability Engineer you will play a vital role in implementing modern Engineering and DevOps techniques operating a large-scale distributed application portfolio across on-premises and cloud to increase efficiency, eliminate downtime, optimize cost, and maintain performance at scale. You will provide hands on technical expertise to design, deploy, secure, and optimize cloud services and deliver the best customer experience. This role will also be responsible for maintaining and reporting the health of the core E-Commerce systems, page performance and customer experience analytics while working as an adviser to help identify, educate, and foster best-in-class site reliability solutions.

Essential Duties and Responsibilities (Min 5%)

  • Leads end-to-end availability, security and performance of mission-critical applications and services that are part of the E-Commerce eco-system
  • Drives changes and release activities related to site stability with other teams (internal and external), partnering with the Change Management group to ensure smooth and trouble-free roll out of releases and changes.
  • Partners with Information Security with managing application security, vulnerabilities fix remediation, and compliance activities with other teams (internal and external)
  • Partners with vendors to ensure all critical patches are tested and applied in both Non-Production and Production environment in time to avoid any business and customer impacts.
  • Partners with leads and architects across the organization to define the Performance strategy and executes performance test activities with other teams (internal and external, partners with QA Performance Test Engineers to ensure all changes are tested in both Non-Production and Production to avoid any business and customer impacts.
  • Establishment of application and synthetic monitoring, alerting and execution of failover capabilities and automated self-healing and recovery.
  • Manages and maintains performance environments, ensuring that these environments are properly setup, configured, and highly available for each project as scheduled.
  • Communicates state of reliability to prioritize technical debt and improvements on technology team roadmaps.
  • Supports day-to-day health, uptime, monitoring and reliability of the website and related services
  • Leads, models, and drives SRE culture and behaviors 
  • Share a 24x7 On-Call Production support rotation with your team and respond to service incidents.
  • May perform other duties as assigned


7+ years of experience required
Bachelor's degree is preferred
Any suitable combination of education and experience will be considered.

High Demand IT Specialized Skills

Platform Knowledge

Preferred knowledge, skills or abilities

  • Strong experience with IBM/HCL WebSphere Commerce, IBM Sterling Commerce, SOLR and related build and deployment processes. HCL Commerce Version 9 Experience is a plus.
  • Strong experience with IBM Http Server, IBM WebSphere Application Server, IBM MQ & Deployment manager ND/Liberty software.
  • Strong Hands-on experience in developing and implementing comprehensive monitoring solutions to provide full visibility to the different platform and application components using tools and services like Kubernetes, Prometheus, Grafana, ECK/ELK, Dynatrace, Rigor, Quantum Metrics, and other similar tools. 
  • Strong Hands-on experience in Identifying and troubleshoot any availability and performance issues at multiple layers of deployment from Infrastructure, operating Environment, Network, application, and Integration System and solve customer issues on production deployments. 
  • Evaluate Performance trends and expected changes in demand and capacity and establish the appropriate scalability Plans.
  • Evaluate production traffic pattern and tune the performance test workload mix and strategy to keep the systems and application in continuous readiness mode. 
  • Strong Hands-on experience in Developing & implementing comprehensive performance and security solutions using Akamai Performance Management & Security Solutions. 
  • Strong Hands-on experience with Kubernetes, AKS & Azure Cloud platform design, implement & maintain though cost efficient models.
  • Strong Experience with containerization, certificates management, Kafka, Zookeeper & Vaults & pipeline automation, Fisheye, Crucible, Performance & QA Test Tool Integrations.
  • Strong Experience with cloud PaaS/IaaS environments Azure. 

Working Conditions

  • Normal office working conditions
  • Must be able to work some nights and weekends
  • Occasional travel required

Physical Requirements

  • Sitting
  • Standing (not walking)
  • Walking
  • Kneeling/Stooping/Bending
  • Reaching overhead
  • Lifting up to 20 pounds


This job description represents an overview of the responsibilities for the above referenced position.  It is not intended to represent a comprehensive list of responsibilities.  A team member should perform all duties as assigned by his/ her supervisor.


You must apply or refer a friend through our internal portal


Our Mission and Values are more than just words on the wall - they’re the one constant in an ever-changing environment and the bedrock on which we build our culture. They're the core of who we are and the foundation of every decision we make. It’s not just what we do that sets us apart, but how we do it.

Learn More


We believe in managing your time for business and personal success, which is why we empower our Team Members to lead balanced lives through our benefits total rewards offerings. fot full-time and eligible part-time TSC and Petsense Team Members. We care about what you care about!

Learn More


A lot of care goes into providing legendary service at Tractor Supply Company, which is why our Team Members are our top priority. Want a career with a clear path for growth? Your Opportunity is Out Here at Tractor Supply and Petsense.

Learn More

Nearest Major Market: Nashville