Role
You will lead the design and operation of Sesterce's AI cluster network fabrics — owning InfiniBand and RoCE architectures, RDMA performance, and fabric reliability for frontier-scale training and inference workloads.
What you will do
- Define the long-range architecture for AI fabric evolution across RoCE, NVIDIA InfiniBand, Clos and spine-leaf topologies, mesh interconnects, and emerging optical fabric designs; lead migration from 100G to 400G, 800G, and 1.6T
- Own congestion control strategy including PFC, ECN, DCQCN, queue management, path diversity, and routing policy for high-performance collective traffic
- Drive topology design decisions that improve NCCL all-reduce performance, collective completion times, tail latency stability, and fault containment
- Establish observability standards for fabric telemetry, queue behavior, packet loss, jitter, retry behavior, and end-to-end job impact; set cable plant strategy across fiber topology, optics qualification, and DAC/AOC standards
- Lead vendor engagement for switches, optics, NICs, and fabric management tooling; define upgrade and migration playbooks; mentor principal and staff engineers and serve as the final escalation point for fabric architecture decisions
What we are looking for
- Demonstrated experience in hyperscale networking, HPC fabrics, RDMA systems, or distributed systems networking at large scale (hundreds to hundreds of thousands of accelerators)
- Deep expertise in ECMP, adaptive routing, queueing theory, network telemetry pipelines, InfiniBand, and optical systems
- Proven track record designing and operating AI or HPC interconnects; strong working knowledge of network behavior under distributed training frameworks and collective communication libraries (NCCL, UCX)
- Experience leading major technology transitions across multiple hardware generations and mixed-vendor environments
- Ability to bring credibility with hardware vendors, datacenter teams, software platform leaders, and executive stakeholders