Senior DevOps/SRE Engineer

Job TypeFulltime

Skills RequiredPython GitHub CI/CD +3

LocationEllicott City, Maryland

SalaryNo salary information was found.

Date Posted October 17, 2025

VITG

VITG is seeking a Senior DevOps/SRE Engineer to ensure the reliability and performance of enterprise services across cloud and on-prem environments. The role involves implementing SRE principles, developing monitoring solutions, and collaborating with cross-functional teams.

Job description

Job Description We are seeking a skilled mid-level Senior DevOps Site Reliability Engineer (SRE) to ensure the reliability, availability, and performance of enterprise services hosted across Cloud Service Providers (CSPs) and on-prem data centers. The SRE is responsible for the practical implementation of Site Reliability Engineering (SRE) principles through best practices, operations, and monitoring. Speed and stability are carefully balanced; and the SRE team acts as versatile problem solvers, filling gaps in knowledge and expertise to ensure efficient software operations. If you are a proactive problem solver with a passion for continuous learning and innovation, join us as we endeavor to increase the dynamism and efficacy of our DevOps practices. Applicant Requirements: • Must be a US citizen or must be authorized to work in the United States. • Must have lived in the USA for three (3) of the last five (5) years. • Must be able to obtain a US federal government badge and eligible for Public Trust clearance. • Must be able to pass a VITG background check, including a drug test. We’re looking for candidates who: • Demonstrate hand-on expertise in SRE principles, with a strong understanding of maintaining quality and stability of enterprise services in a continuous development environment • Must possess experience designing and developing solutions using various AWS services • Must possess experience in developing scripts in Shell/Bash, Python and deploying them as step/lambda functions • Must possess experience working with monitoring and administering observability tools like Splunk, Datadog, and New Relic • Possess extensive knowledge in troubleshooting issues while leveraging monitoring tools like Splunk, Datadog, New Relic, AWS services, etc. • Possess skill related to analyzing, identifying and documenting root cause analysis. • Possess a strong technical background and be able to provide clear explanations of technical concepts verbally and in writing • Demonstrate ability and passion to learn new technologies quickly and perform Proof of Concepts (POCs) based on project needs • Apply strong problem solving skills in monitoring system performance, troubleshooting issues, crisis management, etc. • Produce high quality work independently and collaboratively • Excel in a fast-paced environment • Demonstrate effective communication and collaboration, and be a team player. Job Responsibilities: • Design and develop monitoring solutions leveraging approved AWS services using Infrastructure as Code (IaC) tools. • Develop and maintain CI/CD pipelines using Github, Jenkins. • Develop serverless functions and scripts using python, curl, and/or bash. • Leverage observability best practices to proactively identify potential software issues and implement preventive measures to minimize potential for system incidents and outages. • Set and monitor critical metrics to gain insights into system reliability, including latency, traffic, errors, and saturation levels. • Learn and adapt new technologies to perform POCs (Proof of Concepts) based on project needs. • Provide guidance, training, and support for external development teams to manage their infrastructure independently. • Develop, publish, and maintain all required documentation in the repository and ticketing system (i.e., Confluence and Jira). • Respond quickly and effectively to critical incidents, conduct post-incident reviews to identify root causes and implement preventive measures. • Collaborate effectively with cross-functional teams and communicate SRE concepts and recommendations clearly to both technical and non-technical stakeholders. • Participate in reliability-based release management processes. • Plan, participate and manage on-call rotations to ensure prompt response to reported performance and reliability issues. • Attend ongoing and ad hoc meetings with internal and external stakeholders. • Stay up-to-date with the latest industry trends, technologies, and best practices related to SRE, DevOps, and infrastructure management. Our Tech Stack (Must have): • CI/CD: GitHub, CI/CD, Jenkins, Terraform, CloudFormation, Containers, Docker • Cloud Infrastructure: AWS, Azure • Monitoring & Alerting: Datadog, AWS CloudWatch (including canaries and x-ray), Splunk (Enterprise, ITSI and On-Call), New Relic • OS: Windows servers, Amazon Linux, Red Hat, Citrix VDI Certifications • AWS Certified SysOps/DevOps Associate or equivalent AWS certification (Required) • Splunk Core Certified Certification (Strongly Preferred) • Datadog Certification (Strongly Preferred) Job Type: Full Time (No 1099 or C2C) Salary: BOE Benefits: • 401(k) with employer contribution • Medical/Dental/Vision insurance (option for full coverage for employee) • Life, ST/LT insurance • Professional development opportunities • Company-paid holidays and paid vacation (PTO) Schedule: • 8 hour shift during core business hours • May include minimal after hours support depending on on-call schedule Work Type: • Currently hybrid remote in Ellicott City, MD 21043 • Minimum 2 days in office weekly

Requirements

• Must be a US citizen or must be authorized to work in the United States

• Must have lived in the USA for three (3) of the last five (5) years

• Must be able to obtain a US federal government badge and eligible for Public Trust clearance

• Must be able to pass a VITG background check, including a drug test

• Demonstrate hand-on expertise in SRE principles, with a strong understanding of maintaining quality and stability of enterprise services in a continuous development environment

• Must possess experience designing and developing solutions using various AWS services

• Must possess experience in developing scripts in Shell/Bash, Python and deploying them as step/lambda functions

• Must possess experience working with monitoring and administering observability tools like Splunk, Datadog, and New Relic

• Possess extensive knowledge in troubleshooting issues while leveraging monitoring tools like Splunk, Datadog, New Relic, AWS services, etc

• Possess skill related to analyzing, identifying and documenting root cause analysis

• Possess a strong technical background and be able to provide clear explanations of technical concepts verbally and in writing

• Demonstrate ability and passion to learn new technologies quickly and perform Proof of Concepts (POCs) based on project needs

• Apply strong problem solving skills in monitoring system performance, troubleshooting issues, crisis management, etc

• Produce high quality work independently and collaboratively

• Excel in a fast-paced environment

• Demonstrate effective communication and collaboration, and be a team player

• CI/CD: GitHub, CI/CD, Jenkins, Terraform, CloudFormation, Containers, Docker

• Cloud Infrastructure: AWS, Azure

• OS: Windows servers, Amazon Linux, Red Hat, Citrix VDI

• AWS Certified SysOps/DevOps Associate or equivalent AWS certification (Required)

• Job Type: Full Time (No 1099 or C2C)

Similar Jobs

Irish Life Group Services Limited

Oct 6, 2025

DevOps/SRE Engineer at Irish Life Group Services Limited New York, NY

New York, New York

Full-time job

Irish Life Group Services Limited is seeking a DevOps/SRE Engineer in New York, NY, to support their transformation towards cloud adoption and innovative solutions. The role involves designing, implementing, and maintaining applications while ensuring compliance with security standards.

View Details

AzurePythonC#

VITG

Oct 17, 2025