RPD Systems is seeking a Senior System Administrator / Site Reliability Engineer (SRE) specializing in Unix/Linux to ensure the reliability and performance of cloud services. This hybrid role focuses on automation, security, and operational excellence within the VA's Enterprise Cloud.
Senior System Administrator / Site Reliability Engineer (SRE) – Unix/Linux Only Local to Austin, TX Exp:13 Role: Senior System Administrator / Site Reliability Engineer Focus: Reliability, Performance, Security, and Automation of Mission-Critical Cloud Services Classification: Hybrid (Software Engineering, Systems Engineering, and Operations) Position Overview The Senior System Administrator / Site Reliability Engineer (SRE) is a critical role within the VA’s Enterprise Cloud, responsible for ensuring the resilience, performance, reliability, and compliance of cloud services supporting Veterans and VA stakeholders. This position bridges the gap between software development and infrastructure operations, delivering highly available, secure, and efficient cloud platforms in alignment with VA’s modernization strategy and stringent federal compliance mandates. The successful candidate will be a deep expert in Unix/Linux systems and proficient in cloud platforms, using engineering principles to solve operational problems and drive significant automation. Reliability Engineering & Operations • Observability & Monitoring: Proactively monitor system health, availability, and performance using industry-standard observability tools (e.g., Prometheus, Grafana, Datadog, Splunk). • Incident Management: Respond to alerts and incidents, triage issues rapidly, and lead on-call rotations to ensure 24/7 uptime. • Postmortems: Conduct and document Root Cause Analysis (RCA) and participate in blameless postmortems to prevent recurrence and contribute to knowledge-sharing. • SLA Management: Establish and rigorously enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs). • Performance Optimization: Identify and reduce sources of latency, bottlenecks, and single points of failure. Conduct load and stress testing to validate system performance under peak demand. Automation and Infrastructure as Code (IaC) • Automation: Automate manual operational tasks, including deployments, scaling, and configuration using tools such as Ansible, Terraform, or Puppet. • IaC Management: Manage and maintain Infrastructure as Code (IaC) to ensure consistency, repeatability, and version control across all development and production environments. • CI/CD Optimization: Optimize and maintain Continuous Integration/Continuous Deployment (CI/CD) pipelines for reliable, repeatable software delivery. • Self-Healing Systems: Architect and build self-healing, fault-tolerant systems to minimize downtime and enhance system resilience. System & Cloud Security • Cloud Platforms: Must be proficient in cloud platforms (specifically AWS and/or Azure) with demonstrated experience in deploying and managing production workloads. • Compliance & Patching: Ensure system compliance with all organizational and regulatory requirements. Perform timely patching of operating systems, containers, and dependencies to address security vulnerabilities. • Security Best Practices: Implement robust access controls, utilize secrets management, and enforce the principle of least privilege. Capacity Planning & Mentorship • Resource Management: Monitor and analyze resource utilization (CPU, memory, storage, network) to accurately anticipate scaling needs and plan for growth. • Cost Optimization: Proactively optimize cloud costs by rightsizing instances, implementing autoscaling, and leveraging reserved/spot instance strategies. • Collaboration: Partner closely with software development teams to embed reliability, scalability, and fault tolerance into new services from the design phase. • Mentorship: Mentor and guide team members on best practices for observability, automation, and effective incident handling. Required Technical Qualifications • Experience: Proven experience as a System Administrator or SRE on Unix/Linux based systems. • Cloud: Proficiency in deploying and managing production workloads on AWS and/or Azure. • Automation: Hands-on expertise with Infrastructure as Code (IaC) tools (e.g., Terraform) and configuration management (e.g., Ansible, Puppet). • Observability: Direct experience with modern monitoring and alerting tools (e.g., Prometheus, Grafana, Datadog, Splunk). • Methodology: Strong understanding and application of SRE principles, including SLOs, error budgets, and blameless postmortems.
RNR IT Solutions, Inc. is seeking a DevOps / Site Reliability Engineer (SRE) in Dallas, Texas, to design and maintain CI/CD pipelines and manage cloud infrastructure. The ideal candidate will have extensive experience in DevOps practices and cloud technologies.
Taproot Solutions is seeking a Senior System Administrator / Site Reliability Engineer with expertise in Unix/Linux to ensure high availability and performance of critical systems. The role requires strong skills in cloud environments, automation, and DevOps practices.
Paradyme Management is seeking a DevOps/Site Reliability Engineer (SRE) with Secret Clearance to manage and optimize Kubernetes clusters and cloud infrastructure. The role involves collaboration across teams to ensure reliability and scalability of AI solutions.
The Systems Administrator II at Arandell is responsible for managing server technologies to ensure service availability and system performance. This on-site role requires strong analytical and communication skills, along with proficiency in various technologies.
B12 Consulting is seeking a Senior Linux/Unix Systems Architect/Administrator to design and manage enterprise-scale infrastructure solutions. The role involves hands-on administration of approximately 400 Linux servers in a mission-critical environment.
RPD Systems is seeking a Senior System Administrator / Site Reliability Engineer (SRE) specializing in Unix/Linux to ensure the reliability and performance of cloud services. This hybrid role focuses on automation, security, and operational excellence within the VA's Enterprise Cloud.
RNR IT Solutions, Inc. is seeking a DevOps / Site Reliability Engineer (SRE) in Dallas, Texas, to design and maintain CI/CD pipelines and manage cloud infrastructure. The ideal candidate will have extensive experience in DevOps practices and cloud technologies.
Taproot Solutions is seeking a Senior System Administrator / Site Reliability Engineer with expertise in Unix/Linux to ensure high availability and performance of critical systems. The role requires strong skills in cloud environments, automation, and DevOps practices.
Paradyme Management is seeking a DevOps/Site Reliability Engineer (SRE) with Secret Clearance to manage and optimize Kubernetes clusters and cloud infrastructure. The role involves collaboration across teams to ensure reliability and scalability of AI solutions.
The Systems Administrator II at Arandell is responsible for managing server technologies to ensure service availability and system performance. This on-site role requires strong analytical and communication skills, along with proficiency in various technologies.
B12 Consulting is seeking a Senior Linux/Unix Systems Architect/Administrator to design and manage enterprise-scale infrastructure solutions. The role involves hands-on administration of approximately 400 Linux servers in a mission-critical environment.
RPD Systems is seeking a Senior System Administrator / Site Reliability Engineer (SRE) specializing in Unix/Linux to ensure the reliability and performance of cloud services. This hybrid role focuses on automation, security, and operational excellence within the VA's Enterprise Cloud.
RNR IT Solutions, Inc. is seeking a DevOps / Site Reliability Engineer (SRE) in Dallas, Texas, to design and maintain CI/CD pipelines and manage cloud infrastructure. The ideal candidate will have extensive experience in DevOps practices and cloud technologies.
Taproot Solutions is seeking a Senior System Administrator / Site Reliability Engineer with expertise in Unix/Linux to ensure high availability and performance of critical systems. The role requires strong skills in cloud environments, automation, and DevOps practices.
RPD Systems is seeking a Senior System Administrator / Site Reliability Engineer (SRE) specializing in Unix/Linux to ensure the reliability and performance of cloud services. This hybrid role focuses on automation, security, and operational excellence within the VA's Enterprise Cloud.