Aicadium is seeking a DevOps/SRE Engineer to operationalize and optimize AI-driven applications and infrastructure. This role requires collaboration with cross-functional teams to ensure high availability and performance of AI products.
As a Production Operations Engineer, you will play a critical role in operationalizing, maintaining, scaling, and optimizing our AI-driven applications and supporting infrastructure. With a blend of software development and infrastructure skills, you will work closely with cross-functional teams including software engineers, data scientists, and platform engineers, to ensure the delivery and operation of highly available, low latency, and optimally performing AI products. Your expertise will be crucial in developing solutions, automating processes, monitoring system health, troubleshooting, and managing incidents to ensure our products deliver a seamless experience for our clients. Key Responsibilities: • Software Development and Operations: • Collaborate with Software Engineers to design, implement, and maintain scalable, efficient, and secure systems using React, Python, Docker, and Kubernetes stack. • Optimize application performance by profiling and tuning frontend and backend services for speed, scalability, and resilience. • System Monitoring & Maintenance: • Monitor production systems and services, ensuring optimal uptime and performance. • Implement monitoring tools and dashboards for proactive incident detection. • Infrastructure Automation: • Automate repetitive tasks, deployment processes, and infrastructure provisioning using tools such as Ansible, Terraform, or similar. • Develop and maintain CI/CD pipelines to facilitate smooth deployments. • Incident Management & Troubleshooting: • Respond to system incidents, troubleshoot issues, and work towards timely resolutions. • Conduct root cause analysis (RCA) of system failures and develop strategies to prevent future incidents. • Performance Optimization: • Optimize AI model deployment and data pipelines for speed, efficiency, and cost-effectiveness. • Collaborate with data scientists and engineers to ensure AIsystems are running efficiently in production environments. • Scalability & Reliability: • Design and implement scalable infrastructure solutions for AI applications. • Ensure system reliability, fault tolerance, and high availability through effective architecture and best practices. • Security & Compliance: • Work with security teams to ensure all systems are compliant with company security protocols and industry standards. • Implement security best practices across production environments. Required Qualifications: • Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience). • 5 years of experience in a combination of software development, production operations, DevOps, infrastructure engineering, or security roles. • Strong application development experience in Javascript and Python, particularly with RESTful API design and development using a service-oriented architecture. • Strong experience with cloud platforms (AWS, GCP, or Azure). • Proficiency in container orchestration technologies (e.g., Kubernetes, Docker). • Solid understanding of CI/CD pipelines and automation tools (Github Actions, ArgoCD, Jenkins, GitLab CI, etc.). • Experience with infrastructure as code (Terraform, Ansible, etc.). • Hands-on experience with monitoring and logging tools (Datadog, Prometheus, Grafana, ELK stack, etc.). • Strong experience with Bash, and other similar scripting languages. • Solid understanding of frontend frameworks, particularly React, and their interaction with backend services. • Strong problem-solving skills and attention to detail. • Experience in handling large-scale distributed systems. Preferred Qualifications: • Proficiency in working with NoSQL databases (MongoDB) and understanding of document-based data models. • Knowledge of object storage systems and experience with S3-compatible APIs (MinIO) for storing and managing large-scale unstructured data. • Experience in supporting AI/ML pipelines and production systems. • Knowledge of data engineering and distributed data systems (e.g., Kafka, Spark, Hadoop). • Understanding of GPU-based infrastructure for AI workloads. • Familiarity with security best practices in cloud and AI environments. • Flexible work environment and culture that promotes work-life balance. About Us: Aicadium is a global technology company delivering AI-powered industrial computer vision products into the hands of enterprises. With offices in Singapore and San Diego, California, and an international team of data scientists, engineers, and business strategists, Aicadium is operationalizing AI within organizations where advanced machine learning innovations were previously out of reach. Team Join a growing team of data scientists, machine learning, and software engineers in an agile development environment. Work together with some of the best in the field to tackle challenging projects and operationalize the solutions you develop across a variety of industries and use cases. Culture We work in a casual and collaborative startup environment. Every member of the team plays a key role in shaping the solutions we develop and creating positive business value for the companies we work with. We are building a hub of the best talent in San Diego, CA but we are open to working with people all over the U.S. Benefits Aicadium has a great benefits package to come with your salary. Benefits include PTO, Health insurance, Vision and Dental Insurance, Life and AD&D, 401k with matching, and more!
DRC Systems is seeking a DevOps Engineer with an SRE background to implement and manage observability solutions using Datadog. The role involves collaborating with teams to enhance monitoring across various environments and automating configurations.
Info Way Solutions LLC is seeking a skilled DevOps/SRE with expertise in Python automation to join their team in Seattle. The role involves migrating the observability stack and ensuring high availability and performance across infrastructure.
Irish Life Group Services Limited is seeking a DevOps/SRE Engineer in New York, NY, to support their transformation towards cloud adoption and innovative solutions. The role involves designing, implementing, and maintaining applications while ensuring compliance with security standards.
RNR IT Solutions, Inc. is seeking a DevOps / Site Reliability Engineer (SRE) in Dallas, Texas, to design and maintain CI/CD pipelines and manage cloud infrastructure. The ideal candidate will have extensive experience in DevOps practices and cloud technologies.
Upshop is looking for an experienced SRE / DevOps Manager to lead their reliability and operations engineering team in Austin, Texas. The role focuses on ensuring infrastructure scalability, security, and performance while promoting automation and continuous improvement.
Aicadium is seeking a DevOps/SRE Engineer to operationalize and optimize AI-driven applications and infrastructure. This role requires collaboration with cross-functional teams to ensure high availability and performance of AI products.
DRC Systems is seeking a DevOps Engineer with an SRE background to implement and manage observability solutions using Datadog. The role involves collaborating with teams to enhance monitoring across various environments and automating configurations.
Info Way Solutions LLC is seeking a skilled DevOps/SRE with expertise in Python automation to join their team in Seattle. The role involves migrating the observability stack and ensuring high availability and performance across infrastructure.
Irish Life Group Services Limited is seeking a DevOps/SRE Engineer in New York, NY, to support their transformation towards cloud adoption and innovative solutions. The role involves designing, implementing, and maintaining applications while ensuring compliance with security standards.
RNR IT Solutions, Inc. is seeking a DevOps / Site Reliability Engineer (SRE) in Dallas, Texas, to design and maintain CI/CD pipelines and manage cloud infrastructure. The ideal candidate will have extensive experience in DevOps practices and cloud technologies.
Upshop is looking for an experienced SRE / DevOps Manager to lead their reliability and operations engineering team in Austin, Texas. The role focuses on ensuring infrastructure scalability, security, and performance while promoting automation and continuous improvement.
Aicadium is seeking a DevOps/SRE Engineer to operationalize and optimize AI-driven applications and infrastructure. This role requires collaboration with cross-functional teams to ensure high availability and performance of AI products.
DRC Systems is seeking a DevOps Engineer with an SRE background to implement and manage observability solutions using Datadog. The role involves collaborating with teams to enhance monitoring across various environments and automating configurations.
Info Way Solutions LLC is seeking a skilled DevOps/SRE with expertise in Python automation to join their team in Seattle. The role involves migrating the observability stack and ensuring high availability and performance across infrastructure.
Aicadium is seeking a DevOps/SRE Engineer to operationalize and optimize AI-driven applications and infrastructure. This role requires collaboration with cross-functional teams to ensure high availability and performance of AI products.