A New York–based managed service provider with a 30-person network operations team was running into a problem common to a lot of growing NOCs: the information existed, but engineers couldn't get to it fast enough during an incident. Their internal runbooks had grown over years of tribal knowledge, vendor documentation, ticket notes, escalation procedures, and one-off fixes scattered across multiple systems.
Junior engineers were spending 30–45 minutes digging through documentation, searching old tickets, or messaging senior staff just to determine the next diagnostic step. Escalation became the default path instead of the exception. Tier 1 tickets were routinely being pushed to Tier 3 because engineers lacked confidence they were following the correct process.
During after-hours incidents, that dependency became even more expensive — senior engineers were being pulled into VPN failures, switch outages, routing issues, and monitoring alerts that should have been resolved at the first level. Resolution time for recurring incidents stretched from one to two days once queue delays, escalations, and handoffs were factored in. The real cost was not just engineer time; it was operational drag. Senior staff stopped focusing on architecture and preventative work because they were constantly being interrupted for troubleshooting support.
just finding the right answer
for recurring incidents
for solvable problems
AOtech built a custom AI assistant specifically for the client's network engineering workflow instead of deploying a generic chatbot and calling it "AI." The foundation of the system was a verified knowledge base built from the client's internal runbooks, SOPs, escalation paths, historical fixes, vendor documentation, and troubleshooting standards. The information was chunked, classified, tagged by device type and incident category, and indexed into a custom retrieval pipeline designed for technical accuracy instead of conversational fluff.
The assistant was then connected to live operational data sources — real-time pulls from their RMM platform, SNMP monitoring systems, and device configurations. That architectural decision changed the assistant from a static documentation search tool into a context-aware operational system. Instead of telling an engineer how BGP troubleshooting generally works, the assistant could identify the actual affected router, surface current interface status, reference historical incidents tied to that device, recommend the correct diagnostic commands, and provide the organization's approved escalation path if thresholds were met.
Engineers no longer had to mentally correlate monitoring alerts, configs, and documentation across five systems while under pressure.
The operational change was immediate. Engineers resolved incidents 60% faster because the first troubleshooting step was usually the correct one instead of a guess. Average resolution time for recurring incidents dropped from one to two days to under an hour.
Escalations fell because Tier 1 engineers stopped getting stuck at the "what do I do next?" stage — a network engineer responding to an outage at 2 AM could immediately surface likely root causes, validated commands, affected dependencies, and escalation criteria without opening six browser tabs or pulling another engineer off the bench.
Troubleshooting stopped depending on who happened to be working that shift and started depending on a system that preserved operational knowledge at scale.
across the team
for recurring incidents
without waking senior staff
"The problem was never that we didn't know how to fix things. The problem was that finding the right answer took longer than the fix. That's gone."Senior Network Engineer · NY-based MSP