My employer is in the same situation at the moment. We have been using a
combination of a very customised nagios, cacti, and perl syslog parsing
scripts for years, and are currently evaluating various free and
We would like to to have a single package that can monitor and graph the
figures it gets back from the various pollers or checks without
duplication of snmp gets/walks on every network device, and something
that can handle snmptraps and other arbitrary events in some intelligent
way. For example, one of the more annoying things about nagios is that
you can't send it an alarm for something unless you've first defined
that alarm. In other words, I can't receive a critical *-1-* message
from a cisco device and pass it on to Nagios intact - I have to at best
create a generic "critical cisco event" alarm, and submit it there,
which can be problematic if I then receive another similar alarm from a
different device while the first is already acknowledged. I could create
hundreds of passive critical cisco event checks, one for each device,
and do it that way, but then what if get more than one critical event
for the same device. I also get very annoyed by the flap detection,
which results in us getting a critical (hard) alarm for a device, and
then never seeing the OK message because flap detection quietly
suppresses it. That might possibly be a result of the way we've
customised it though - I'm not sure.
However, I would very much like to hear more on this thread about what
people are using, and have found to work. Even the commercial packages
seem to have serious limitations on what they can do, and run aground
when <unknown but critical device that can only be queried via expect
scripts> is added to the mix and expected to be monitored and graphed.
On 30/08/2011 10:18 p.m., Jonathan Brewer wrote:
If you had it all to do over again, what would you use for network
monitoring: Nagios, OpenNMS, or something else entirely?
I care about availaility, latency, loss, jitter, and trap handling for
interface up/down, loss of power, etc. Sensible behavior in situations
where parent routers/links are flapping is also important.
I would very much appreciate input from folks monitoring 1000+ network