Senior System Administrator / Site Reliability Engineer (SRE) – Unix/Linux

Job TypeFulltime

Skills RequiredAzure

LocationAustin, Texas

Salary$100000 - $130000

Date Posted October 31, 2025

RPD Systems

RPD Systems is seeking a Senior System Administrator / Site Reliability Engineer (SRE) specializing in Unix/Linux to ensure the reliability and performance of cloud services. This hybrid role focuses on automation, security, and operational excellence within the VA's Enterprise Cloud.

Job description

Senior System Administrator / Site Reliability Engineer (SRE) – Unix/Linux Only Local to Austin, TX Exp:13 Role: Senior System Administrator / Site Reliability Engineer Focus: Reliability, Performance, Security, and Automation of Mission-Critical Cloud Services Classification: Hybrid (Software Engineering, Systems Engineering, and Operations) Position Overview The Senior System Administrator / Site Reliability Engineer (SRE) is a critical role within the VA’s Enterprise Cloud, responsible for ensuring the resilience, performance, reliability, and compliance of cloud services supporting Veterans and VA stakeholders. This position bridges the gap between software development and infrastructure operations, delivering highly available, secure, and efficient cloud platforms in alignment with VA’s modernization strategy and stringent federal compliance mandates. The successful candidate will be a deep expert in Unix/Linux systems and proficient in cloud platforms, using engineering principles to solve operational problems and drive significant automation. Reliability Engineering & Operations • Observability & Monitoring: Proactively monitor system health, availability, and performance using industry-standard observability tools (e.g., Prometheus, Grafana, Datadog, Splunk). • Incident Management: Respond to alerts and incidents, triage issues rapidly, and lead on-call rotations to ensure 24/7 uptime. • Postmortems: Conduct and document Root Cause Analysis (RCA) and participate in blameless postmortems to prevent recurrence and contribute to knowledge-sharing. • SLA Management: Establish and rigorously enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs). • Performance Optimization: Identify and reduce sources of latency, bottlenecks, and single points of failure. Conduct load and stress testing to validate system performance under peak demand. Automation and Infrastructure as Code (IaC) • Automation: Automate manual operational tasks, including deployments, scaling, and configuration using tools such as Ansible, Terraform, or Puppet. • IaC Management: Manage and maintain Infrastructure as Code (IaC) to ensure consistency, repeatability, and version control across all development and production environments. • CI/CD Optimization: Optimize and maintain Continuous Integration/Continuous Deployment (CI/CD) pipelines for reliable, repeatable software delivery. • Self-Healing Systems: Architect and build self-healing, fault-tolerant systems to minimize downtime and enhance system resilience. System & Cloud Security • Cloud Platforms: Must be proficient in cloud platforms (specifically AWS and/or Azure) with demonstrated experience in deploying and managing production workloads. • Compliance & Patching: Ensure system compliance with all organizational and regulatory requirements. Perform timely patching of operating systems, containers, and dependencies to address security vulnerabilities. • Security Best Practices: Implement robust access controls, utilize secrets management, and enforce the principle of least privilege. Capacity Planning & Mentorship • Resource Management: Monitor and analyze resource utilization (CPU, memory, storage, network) to accurately anticipate scaling needs and plan for growth. • Cost Optimization: Proactively optimize cloud costs by rightsizing instances, implementing autoscaling, and leveraging reserved/spot instance strategies. • Collaboration: Partner closely with software development teams to embed reliability, scalability, and fault tolerance into new services from the design phase. • Mentorship: Mentor and guide team members on best practices for observability, automation, and effective incident handling. Required Technical Qualifications • Experience: Proven experience as a System Administrator or SRE on Unix/Linux based systems. • Cloud: Proficiency in deploying and managing production workloads on AWS and/or Azure. • Automation: Hands-on expertise with Infrastructure as Code (IaC) tools (e.g., Terraform) and configuration management (e.g., Ansible, Puppet). • Observability: Direct experience with modern monitoring and alerting tools (e.g., Prometheus, Grafana, Datadog, Splunk). • Methodology: Strong understanding and application of SRE principles, including SLOs, error budgets, and blameless postmortems.

Requirements

• The successful candidate will be a deep expert in Unix/Linux systems and proficient in cloud platforms, using engineering principles to solve operational problems and drive significant automation

• Reliability Engineering & Operations

• System & Cloud Security

• Cloud Platforms: Must be proficient in cloud platforms (specifically AWS and/or Azure) with demonstrated experience in deploying and managing production workloads

• Experience: Proven experience as a System Administrator or SRE on Unix/Linux based systems

• Cloud: Proficiency in deploying and managing production workloads on AWS and/or Azure

• Automation: Hands-on expertise with Infrastructure as Code (IaC) tools (e.g., Terraform) and configuration management (e.g., Ansible, Puppet)

• Observability: Direct experience with modern monitoring and alerting tools (e.g., Prometheus, Grafana, Datadog, Splunk)

• Methodology: Strong understanding and application of SRE principles, including SLOs, error budgets, and blameless postmortems

Similar Jobs

RPD Systems

Oct 31, 2025

Salary$100000 - $130000

Date Posted October 31, 2025