US Infrastructure & Operations Technical Lead

Work from home Full-time role Hiring

Role Overview We are seeking a US Infrastructure Operations Technical Lead to drive the operational excellence, technical leadership, and growth of Radiant’s US Infrastructure Operations function. This is a hands-on player-manager role designed for an infrastructure-focused engineering leader with a strong Site Reliability Engineering mindset and deep understanding of large-scale distributed infrastructure environments. Working closely with the UK Infrastructure Operations Manager during overlapping morning hours (US Eastern Time), you will help coordinate cross-regional operations, strategic planning, incident management, and infrastructure delivery across Radiant’s global AI and HPC platform. During US business hours, you will lead and mentor the local Infrastructure Operations team, currently consisting of three engineers, while helping scale operational maturity and team capability as the business continues to grow. The ideal candidate will come from a hyperscale, HPC, or large-scale cloud-native SaaS infrastructure background, with experience operating complex distributed systems at scale. This role requires breadth across datacentre compute, Linux systems, networking, and storage fabrics, with the ability to troubleshoot and lead continuous improvement of our infrastructure. You should be comfortable operating and troubleshooting bare-metal environments, low-latency networking, storage protocols, and core infrastructure technologies underpinning high-performance AI and GPU compute platforms. This role requires strong operational leadership capabilities, including experience running small engineering teams, participating in ITIL-aligned operational processes, and supporting high-availability production environments through structured incident, change, and problem management practices. You will also participate in an on-call rota to lead major incidents, orchestrating technical resources to quickly resolve large scale issues. As Radiant expands its global footprint, your operational leadership, technical expertise, and ability to build high-performing teams will play a critical role in shaping the future of our US infrastructure operations. What’s in It for You Join a globally distributed engineering organisation operating cutting-edge GPU, AI, and high-performance compute infrastructure at scale. As the US Infrastructure Operations Technical Lead, you’ll work hands-on with advanced compute and networking technologies powering large-scale AI and machine learning workloads. This is an opportunity to operate at the forefront of modern infrastructure engineering, helping shape operational standards, automation practices, and reliability engineering across a rapidly scaling global platform. You’ll collaborate with highly skilled engineers across Infrastructure Operations, HPC SRE, Networking, and Platform Engineering teams in an environment that values technical excellence, ownership, and continuous improvement. We move quickly, solve meaningful infrastructure challenges, and provide engineers with the opportunity to influence how next-generation AI infrastructure is designed, operated, and scaled globally. You can also expect: Exposure to industry-leading GPU and AI infrastructure Opportunities to help build and scale a growing US operations function A collaborative, inclusive, and globally connected engineering culture Real ownership and influence across operational strategy and execution Work at the intersection of reliability, automation, performance, and scale A flexible remote-first working environment with ambitious growth plans

Key Responsibilities

Leadership & Operational Ownership Lead a small but high-impact US Infrastructure Operations team, owning both people leadership and technical execution Ensure 99.9%+ platform uptime across US-region services. Act as the senior US operational owner for production infrastructure, accountable for reliability, incident outcomes, and day-to-day operational execution Partner tightly with the UK Infrastructure Operations Manager to align priorities, respond to incidents, and execute global infrastructure plans in real time Own US-side incident leadership, driving fast and effective resolution of production-impacting infrastructure issues Build and reinforce a strong ownership culture built on do, document, automate Ensure operational knowledge is captured and shared through lightweight, high-signal documentation rather than process overhead Hire, onboard, and develop Infrastructure Operations engineers as the team scales Run direct 1:1s and performance conversations focused on raising technical bar and operational effectiveness Ensure disciplined execution of core operational processes (incident, change, problem management) without slowing delivery Participate in on-call rotation and lead from the front during major incidents Willingness to travel within the US and Europe as required to support infrastructure deployments, data centre work, and cross-regional collaboration (UK-headquartered company) Help define how Infrastructure Operations scales globally as the company grows Technical Day-to-Day Stay hands-on and close to the systems while leading the team — this is not a purely managerial role Take ownership of real infrastructure problems and actively contribute to debugging, fixing, and improving production systems Work across production infrastructure spanning compute, storage, networking, and platform services Assist with resolution of deep infrastructure issues across the stack, including: Linux systems (performance, stability, kernel behaviour, resource contention) Networking (routing, switching, DNS, TCP/IP, latency, packet-level troubleshooting) Storage systems (distributed storage performance, consistency, and failure modes) Bare-metal infrastructure (hardware issues, firmware, lifecycle and deployment failures) Operate and improve large-scale Linux environments across on-prem, private cloud, and hybrid infrastructure Take ownership of infrastructure reliability through automation, configuration management, and system hardening Build and improve Infrastructure as Code workflows (Terraform, Ansible or equivalent) Drive observability as a first-class requirement — metrics, logs, traces, and actionable alerting Lead or directly participate in major incident response, helping drive technical resolution under pressure Act as a senior technical problem solver across Infrastructure Ops, Networking, Platform, and SRE teams Identify repetitive operational work and eliminate it through automation and system improvements Contribute directly to scaling decisions, capacity planning, and reliability improvements Participate in on-call rotation with real escalation authority and accountability Essential Skills & Experience 8+ years infrastructure engineering, SRE, platform ops, or large-scale production infrastructure experience 2+ years in technical leadership or engineering management with direct reports Strong experience operating production infrastructure at scale (on-prem, private cloud, or hybrid) Deep Linux expertise: performance tuning, debugging, kernel/system behaviour, production troubleshooting Hands-on across the stack: bare metal → OS → network → storage → platform Strong infrastructure fundamentals: compute, storage, networking in real-world production environments Incident-heavy environment experience (24x7 ops, on-call, major incident response, postmortems) Strong networking: TCP/IP, routing, switching, DNS, latency, packet-level debugging Bare-metal operations experience (Redfish, IPMI, lifecycle management, hardware troubleshooting) Strong automation + config management (Ansible preferred) Strong scripting (Python, Bash or similar) Strong ownership mindset: simplifies, automates, removes operational toil Highly desirable: Experience in HPC, AI/ML, GPU compute, or large-scale high-performance infrastructure environments Distributed / parallel storage experience (Lustre, WEKA, or equivalent) InfiniBand or other high-performance, low-latency networking experience

Preferred Qualifications

Bachelor’s or Master’s degree in Computer Science, Information Technology, Engineering, or a related field, or equivalent experience NVIDIA NCP type qualifications PMP, ITIL, or equivalent project/operations management certification. LPI or equivalent Linux certifications. Why should you join us? What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive. Here are just some of the great things you can expect from us: 15 days of annual leave: we value your peace of mind. With 20 days off (excluding public holidays) and access to mental health resources, we make sure you're as strong mentally as you are professionally. A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work. Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together. Learning Time: we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day-to-day job. Health & Wellbeing: we want everyone to feel healthy and happy, so we offer private medical insurance via UnitedHealthcare. Participation in the company shares program Diversity, Equality, Inclusion and Belonging We are an equal opportunity employer and we strive to reduce unconscious bias throughout our hiring process. All applicants will be considered for employment without attention to ethnicity, religion, sexual orientation, gender identity, family or parental status, national origin, veteran, neurodiversity status or disability status. To ensure our recruitment processes provide an equal opportunity for all applicants to succeed, we encourage you to let us know if there are any adjustments that we can make. Apply To This Job

Apply

US Infrastructure & Operations Technical Lead

Key Responsibilities

Preferred Qualifications

You might like

Distribution Manager – Rock Tools – West US

Sr. Network Engineer

Account Executive

Distribution Manager – Rock Tools – East US

Customer Care Advocate (Customer Care Representative)

Sales Manager Berlin (d/w/m)

Sales Manager München (d/w/m)

Office Manager

Reclutador de Personal Bilingue - Remoto

Business Development Manager

Solution Architect

Field Director- Learning and Development

Experienced Full Stack Customer Support Specialist – Remote Live Chat Support

Experienced Spanish Bilingual Remote Customer Service Representative – Delivering Exceptional Customer Experiences in a Dynamic and Growing Industry

Experienced Ergonomics Specialist – Aviation and Health Services

Software Development Engineer - COBOL, DB2

Remote Data Entry Specialist – Precision Data Management for arenaflex’s E‑Commerce & Cloud Operations

Experienced Technical Support / Customer Service Representative – Remote Night Shift at arenaflex

Experienced Data Entry Operator (Part-time) – Remote Opportunity in Ohio

[Hiring] Bilingual Certified Pharmacy Technician @TEKsystems