EngineeringSan Francisco/RemoteFull-time

Cluster Network Engineering Lead

Lead the design and operation of the network fabrics that make Sesterce's AI clusters run at full speed.

Role

You will lead the design and operation of Sesterce's AI cluster network fabrics — owning InfiniBand and RoCE architectures, RDMA performance, and fabric reliability for frontier-scale training and inference workloads.

What you will do

Define the long-range architecture for AI fabric evolution across RoCE, NVIDIA InfiniBand, Clos and spine-leaf topologies, mesh interconnects, and emerging optical fabric designs; lead migration from 100G to 400G, 800G, and 1.6T
Own congestion control strategy including PFC, ECN, DCQCN, queue management, path diversity, and routing policy for high-performance collective traffic
Drive topology design decisions that improve NCCL all-reduce performance, collective completion times, tail latency stability, and fault containment
Establish observability standards for fabric telemetry, queue behavior, packet loss, jitter, retry behavior, and end-to-end job impact; set cable plant strategy across fiber topology, optics qualification, and DAC/AOC standards
Lead vendor engagement for switches, optics, NICs, and fabric management tooling; define upgrade and migration playbooks; mentor principal and staff engineers and serve as the final escalation point for fabric architecture decisions

What we are looking for

Demonstrated experience in hyperscale networking, HPC fabrics, RDMA systems, or distributed systems networking at large scale (hundreds to hundreds of thousands of accelerators)
Deep expertise in ECMP, adaptive routing, queueing theory, network telemetry pipelines, InfiniBand, and optical systems
Proven track record designing and operating AI or HPC interconnects; strong working knowledge of network behavior under distributed training frameworks and collective communication libraries (NCCL, UCX)
Experience leading major technology transitions across multiple hardware generations and mixed-vendor environments
Ability to bring credibility with hardware vendors, datacenter teams, software platform leaders, and executive stakeholders