採用

求人 Wipro

Senior Cloud Architect

Wipro

Mountain View, United States

On-site

Full-time

2w ago

Job Description

Job Description: Cloud Architect (GPU/TPU Infrastructure)

Location: Mountain View, CA

Experience Level: 10–15+ Years

Engineering Function: Cloud Infrastructure / AI & Data Engineering

Role Objective:

You will be the lead architect responsible for designing scalable, high-performance cloud infrastructure optimized for AI/ML workloads. Your goal is to architect environments that maximize the compute efficiency of NVIDIA H100/B200 (GPUs) and Google Cloud TPUs, ensuring low-latency communication and high-throughput data pipelines for enterprise-scale AI.

Key Responsibilities:

Cluster Design: Architect and deploy large-scale GPU/TPU clusters using Kubernetes (GKE/EKS) or specialized orchestrators like Slurm.
High-Performance Networking: Design the interconnect fabric (e.g., Infini Band, RoCE v2, or Google’s ICI) to prevent "communication bottlenecks" during distributed training.
Storage Optimization: Implement high-speed data solutions (e.g., Lustre, Weka, or GPFS) to feed massive datasets to accelerators without starving the processors.
Cost & Capacity Orchestration: Balance performance vs. cost by implementing "Spot" instance strategies, autoscaling, and resource quotas to prevent $100k+ overruns.
Framework Integration: Optimize the infrastructure for AI frameworks like Py Torch, JAX, and Tensor Flow, ensuring proper driver/library (CUDA, cuDNN) compatibility.

Technical Requirements & Skills:

Category

Requirements:

Compute

Expertise in NVIDIA HGX/DGX architectures and Google TPU v5p/Trillium pods.

Orchestration

Mastery of Kubernetes (specifically Device Plugins for GPUs) and Terraform/Ansible for "Infrastructure as Code."

Networking

Deep understanding of RDMA (Remote Direct Memory Access) and non-blocking Clos topologies.

AI Workloads

Familiarity with Distributed Training techniques (Data Parallelism, Model Parallelism, Pipeline Parallelism).

Cloud Platforms

Professional Certifications in GCP with a focus on high-performance compute (HPC) instances.

Experience Screening:

Distributed Training at Scale: Proven experience managing jobs across 128+ GPUs or multiple TPU pods.
Telemetry & Monitoring: Experience setting up Prometheus/Grafana dashboards specifically for GPU metrics (utilization, memory bandwidth, thermal throttling).
Security: Implementing "Confidential Computing" and secure data enclaves for sensitive AI training data.

Technical Interview Scorecard: GPU/TPU Cloud Architect

Compute & Accelerator Architecture:

Focus: Understanding the "metal" and how the OS interacts with it.

The Question: "Explain the architectural difference between an NVIDIA H100 GPU and a Google TPU v5p. In what scenarios would you recommend one over the other for a client?"
What to look for: Mentions of HBM3 (High Bandwidth Memory), systolic arrays (TPU) vs. Streaming Multiprocessors (GPU), and the difference between CUDA (vendor-locked) and JAX/XLA (portable/optimized for TPUs).
Red Flag: Treating a GPU/TPU like a standard CPU instance that just "runs faster."

Distributed Training & Interconnects:

Focus: Networking is almost always the bottleneck in AI.

The Question: "A client’s LLM training job is showing high 'GPU Wait' times during the All-Reduce step. How do you diagnose and fix this at the infrastructure level?"
What to look for: Discussion of RDMA (Remote Direct Memory Access), Infini Band vs. RoCE v2, and ensuring a non-blocking Clos Topology. They should mention checking for "noisy neighbors" on the network or incorrect NIC-to-GPU mapping.
Red Flag: Suggesting more RAM or a faster CPU; these rarely fix inter-node communication lag.

Orchestration & Scheduling:

Focus: Kubernetes is the standard, but it wasn't built for AI.

The Question: "How do you handle 'Gang Scheduling' in a Kubernetes environment for a job that requires 64 GPUs across 8 nodes?"
What to look for: Familiarity with tools like Kueue, Volcano, or Slurm. They should explain that in AI, all pods must start simultaneously; if one node fails to spin up, the entire job must wait or fail to avoid wasting compute.
Red Flag: Assuming standard K8s Horizontal Pod Autoscaling (HPA) works for deep learning jobs.

Storage & Data I/O

Focus: Feeding the beast.

The Question: "An H100 can process data at massive speeds. How do you design the storage layer to ensure the GPU isn't 'starving' for data?"
What to look for: Knowledge of GPUDirect Storage (GDS), parallel file systems like Lustre or WekaIO, and the use of local NVMe SSDs for caching intermediate checkpoints.
Red Flag: Suggesting standard S3/Object storage for direct training without a caching or high-speed middle layer.

Scoring Rubric

Score

Level

Description:

1-2

Novice

Understands Cloud (EC2/S3) but treats GPUs as "black boxes." No RDMA knowledge.

Intermediate

Can set up a GPU node and run a container; understands CUDA versions.

Advanced

Understands multi-node scaling, Infini Band, and the impact of the software stack (NCCL/RCCL).

Expert

Can design a 1024-GPU "AI Factory" from scratch, including power, cooling, and high-speed fabric.

総閲覧数

応募クリック数

模擬応募者数

スクラップ

類似の求人

Staff ML Infrastructure Engineer - Embodied AI

General Motors · Mountain View, California, United States of America

Staff Hardware Reliability Engineer - Sensors

Aurora · Mountain View, California

Senior Cloud Infrastructure Engineer

Gatik · Mountain View, CA

Staff Cloud Engineer

Split.io · Mountain View, California, United States

Staff Data Scientist - Infrastructure

Databricks · Mountain View, California

Wiproについて

Wipro

Public

A technology services and consulting company focused on building solutions that address clients' digital transformation needs.

10,001+

従業員数

Bengaluru

本社所在地

$8.5B

企業価値

レビュー

3.1

10件のレビュー

ワークライフバランス

3.5

報酬

2.3

企業文化

3.8

キャリア

2.5

経営陣

2.2

45%

友人に勧める

良い点

Good training and learning opportunities

Flexible work hours and remote options

Supportive colleagues and teamwork

改善点

Low and uncompetitive compensation

Limited growth and career advancement opportunities

Poor management direction and support

給与レンジ

41,395件のデータ

Mid/L4

Mid/L4 · Analyst - Business Process L2

1件のレポート

$128,283

年収総額

基本給

$111,550

ストック

ボーナス

$128,283

面接体験

5件の面接

難易度

2.0

/ 5

期間

14-28週間

内定率

40%

体験

ポジティブ 100%

普通 0%

ネガティブ 0%

面接プロセス

Application Review

Online Assessment/Aptitude Test

Technical Interview

HR Interview

Offer

よくある質問

Coding/Algorithm

Technical Knowledge

Behavioral/STAR

Past Experience

Culture Fit

ニュース＆話題

Wipro share buyback, target prices: What Jefferies, Morgan Stanley, others say after soft Q1 guidance - MSN

MSN

News

2d ago

Wipro attrition falls to 13.8%, headcount inches up by 136 - The Economic Times

The Economic Times

News

3d ago

Wipro shares slide up to 4% after weak Q4, muted outlook dents sentiment - The Times of India

The Times of India

News

3d ago

Indian shares rise on peace deal hopes; Wipro, HDFC Life cap gains - TradingView — Track All Markets

TradingView — Track All Markets

News

3d ago