My employer is in the same situation at the moment. We have been using a combination of a very customised nagios, cacti, and perl syslog parsing scripts for years, and are currently evaluating various free and commercial offerings.

We would like to to have a single package that can monitor and graph the figures it gets back from the various pollers or checks without duplication of snmp gets/walks on every network device, and something that can handle snmptraps and other arbitrary events in some intelligent way.  For example, one of the more annoying things about nagios is that you can't send it an alarm for something unless you've first defined that alarm. In other words, I can't receive a critical *-1-* message from a cisco device and pass it on to Nagios intact - I have to at best create a generic "critical cisco event" alarm, and submit it there, which can be problematic if I then receive another similar alarm from a different device while the first is already acknowledged. I could create hundreds of passive critical cisco event checks, one for each device, and do it that way, but then what if get more than one critical event for the same device.  I also get very annoyed by the flap detection, which results in us getting a critical (hard) alarm for a device, and then never seeing the OK message because flap detection quietly suppresses it. That might possibly be a result of the way we've customised it though - I'm not sure.

However, I would very much like to hear more on this thread about what people are using, and have found to work. Even the commercial packages seem to have serious limitations on what they can do, and run aground when <unknown but critical device that can only be queried via expect scripts> is added to the mix and expected to be monitored and graphed.

On 30/08/2011 10:18 p.m., Jonathan Brewer wrote:
Hi Folks,

If you had it all to do over again, what would you use for network monitoring: Nagios, OpenNMS, or something else entirely?

I care about availaility, latency, loss, jitter, and trap handling for interface up/down, loss of power, etc. Sensible behavior in situations where parent routers/links are flapping is also important.

I would very much appreciate input from folks monitoring 1000+ network elements.