Outsourced Network Monitoring: Week-One Checklist

OUTSOURCED NETWORK MONITORING: WEEK-ONE CHECKLIST | ITcare

A network that has been stable for three years is not one we trust yet. Stable often just means nobody has touched the part that is quietly broken.

When we take over monitoring for a new ISP, data center, hosting provider, or WISP, the first week is not about lighting up dashboards. It is about finding the failures that have not happened yet.

Some of what we actually check:

Management plane exposure. SSH and SNMP reachable from the public internet, no VTY ACLs, no control plane policing. Everything works fine right up until it very much does not.

BGP, but not whether the sessions are up. We look at the policies. Are inbound prefixes from peers and customers actually filtered, or is the session accepting whatever shows up? Is there a max-prefix limit on every eBGP session? Do the outbound policies announce only what they should, or is the network one missing filter away from leaking transit between two providers? We check it against MANRS as a baseline, because most setups follow it on paper and not in the config.

NTP. Boring, ignored, and the reason half of incident timelines are useless. If the edge routers and the syslog server disagree on what time it is, every correlation made during an outage is fiction.

Config backups. When was the last one, and does running-config still match startup-config? A box that has not rebooted in two years often will not survive the next reboot.

Redundancy that only exists on paper. Two upstreams is not redundancy if the failover was never tested. We check whether the upstream paths are actually set up to fail over on their own, whether there is internal path diversity through the route reflectors and alternate paths across the network, and whether those alternate paths are genuinely alive and able to carry traffic, not waiting on someone to trigger the switch by hand.

Optical health. DOM readings on every uplink. Marginal transceivers throw CRC errors long before they fail outright, and the errors get blamed on everything except the optic. For WISPs we go further and check whether the wireless links are actually monitored: RSSI, SNR, modulation rate, retransmits, and capacity headroom, so signal degradation is caught as a trend, before it becomes a customer-down ticket.

None of this shows up as red on a status page. That is the point. By the end of the first week we are not monitoring a network. We understand how it breaks.