채용

DevOps Engineer - AIOps

Endava

São Paulo

On-site

Full-time

1w ago

We are seeking a hands-on Site Reliability Engineer (SRE) / AI Platform DevOps Engineer to own infrastructure provisioning, CI/CD automation, telemetry pipelines, and production deployment for AI-powered services, agents, and orchestration systems.

This is an SRE-heavy, infrastructure-first role, focused on ensuring AI systems operating in production are:

Reliable
Observable
Scalable
Secure
Cost-efficient
Safe to deploy and operate

You will play a critical role in building and maintaining the platform foundation that enables AI services to run safely and efficiently at scale.

Key Responsibilities:

Infrastructure Provisioning & Automation:

Design and manage cloud infrastructure using Infrastructure as Code (Terraform or similar)
Provision and maintain Kubernetes clusters and supporting services
Automate environment setup across development, staging, and production
Manage networking, IAM, secrets, storage, and compute scaling
Ensure high availability, resilience, and disaster recovery readiness

CI/CD & Deployment Engineering

Build and maintain CI/CD pipelines for:

AI services

Agent frameworks
Orchestrators
Model artifacts
Implement automated testing and reliability validation gates
Enable blue/green and canary deployments
Build safe rollback mechanisms for services and models
Integrate reliability and health checks into deployment workflows

Model & Agent Deployment Governance:

Package, version, and deploy models into containerized environments
Manage model artifact storage and promotion across environments
Monitor model performance and detect degradation
Support retraining cycle integration and model refresh workflows
Ensure safe rollout and rollback of model versions
Implement monitoring for inference latency, throughput, and cost

Data Pipelines for Telemetry & Observability:

Design and maintain data pipelines to ingest, clean, and process high-volume telemetry (logs, metrics, traces, events)
Enable structured telemetry for AI and orchestration workflows
Ensure reliability for real-time and batch processing
Optimize pipeline scalability and performance

AIOps Platform Integration:

Evaluate, deploy, and integrate AIOps platforms
Improve anomaly detection, correlation, and alert intelligence
Reduce alert noise and improve signal quality
Integrate AIOps outputs into operational workflows and incident management

Intelligent Incident Automation:

Automate incident detection and remediation workflows
Build self-healing scripts and intelligent runbooks
Reduce MTTD and MTTR through automation
Integrate AI-driven root cause analysis insights into operational tooling
Improve prevention of recurring incidents

Production Reliability & SRE Excellence:

Define and manage SLIs, SLOs, and error budgets
Implement monitoring, dashboards, and alerting systems
Participate in on-call rotation
Lead incident triage and root cause analysis
Improve resilience, scaling, and failure handling
Implement circuit breakers, rate limits, and failover mechanisms

Security & Governance:

Implement least-privilege access controls
Manage secrets and credential rotation
Enforce environment isolation
Ensure auditability and compliance for AI systems

Required Experience:

5+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles
Strong hands-on experience with cloud platforms (AWS, Azure, or GCP)
Proven expertise with Kubernetes and containerized workloads
Experience with Infrastructure as Code (Terraform, CloudFormation, etc.)
Strong CI/CD implementation experience (GitHub Actions, GitLab CI, Jenkins, etc.)
Experience building observability stacks (Prometheus, Grafana, Open Telemetry, ELK, Datadog, etc.)
Experience defining and managing SLIs/SLOs and error budgets
Hands-on experience with incident response and production support
Strong scripting skills (Python, Bash, or similar)

AI/ML Platform Experience (Strongly Preferred)

Experience deploying and managing AI/ML services in production
Familiarity with model packaging, versioning, and artifact management
Understanding of model lifecycle management and retraining workflows
Experience monitoring inference performance, latency, and cost
Exposure to AIOps tools and intelligent alerting systems

Additional Skills:

Strong understanding of distributed systems reliability patterns
Knowledge of security best practices in cloud-native environments
Experience implementing high-availability and disaster recovery strategies
Excellent problem-solving and root cause analysis skills
Strong communication skills and ability to collaborate across engineering and AI teams

Discover some of the global benefits that empower our people to become the best version of themselves:

Finance: Competitive salary package, share plan, company performance bonuses, value-based recognition awards, referral bonus;
Career Development: Career coaching, global career opportunities, non-linear career paths, internal development programmes for management and technical leadership;
Learning Opportunities: Complex projects, rotations, internal tech communities, training, certifications, coaching, online learning platforms subscriptions, pass-it-on sessions, workshops, conferences;
Work-Life Balance: Hybrid work and flexible working hours, employee assistance programme;
Health: Global internal wellbeing programme, access to wellbeing apps;
Community: Global internal tech communities, hobby clubs and interest groups, inclusion and diversity programmes, events and celebrations.

At Endava, we’re committed to creating an open, inclusive, and respectful environment where everyone feels safe, valued, and empowered to be their best. We welcome applications from people of all backgrounds, experiences, and perspectives—because we know that inclusive teams help us deliver smarter, more innovative solutions for our customers. Hiring decisions are based on merit, skills, qualifications, and potential. If you need adjustments or support during the recruitment process, please let us know.

Technology is our how. And people are our why. For over two decades, we have been harnessing technology to drive meaningful change.

By combining world-class engineering, industry expertise and a people-centric mindset, we consult and partner with leading brands from various industries to create dynamic platforms and intelligent digital experiences that drive innovation and transform businesses.

From prototype to real-world impact - be part of a global shift by doing work that matters.

Total Views

Apply Clicks

Mock Applicants

Scraps

Similar Jobs

Senior Consultant Cloud DevOps Engineer (AWS)

EY ·

Lead Software Engineer, Java, Platform Engineering

JPMorgan Chase · Australia, AU

IX-5941564(10), 5941567(10)-DevOps Engineering Expert

Accenture ·

Data Platform Engineer

Accenture ·

Data Platform Engineer

Accenture ·

About Endava

Endava

A software development outsourcing company that creates dynamic platforms and intelligent digital experiences for businesses.

10,001+

Employees

London

Headquarters

$1.5B

Valuation

Reviews

4.1

28 reviews

Work Life Balance

4.0

Compensation

4.3

Culture

4.1

Career

4.0

Management

3.8

73%

Recommend to a Friend

Pros

Interesting projects and challenges

Opportunity for career growth

Competitive compensation and benefits

Cons

Some organizational bureaucracy

Room for improvement in processes

Work-life balance varies by team

Salary Ranges

91 data points

Junior/L3

Junior/L3 · Data Analyst

0 reports

$41,790

total / year

Base

Stock

Bonus

$35,522

$48,058

Interview Experience

1 interviews

Difficulty

3.0

/ 5

Duration

14-28 weeks

Interview Process

First round interview (30 minutes)

News & Buzz

Endava: Margins Are The Stock's Downfall, But They Should Recover (NYSE:DAVA) - Seeking Alpha

Source: Seeking Alpha

News

6w ago

Paradice Investment Management LLC Increases Stake in Endava PLC - GuruFocus

Source: GuruFocus

News

6w ago

Endava PLC Sponsored ADR (NYSE:DAVA) Receives Average Recommendation of "Hold" from Analysts - MarketBeat

Source: MarketBeat

News

7w ago

The Best Tech Stocks to Buy - Morningstar

Source: Morningstar

News

7w ago