Rev. Jeffrey Paul
2008-06-26 09:22:04 UTC
Hi. I've a (theoretically) simple problem and I'm wondering how others
solve it.
I've recently deployed ~40 Linux instances on ~20 different Dell blades
and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and
assorted switchable PDUs and whatnot.
We need to monitor standard things like cpu, memory, disk usage on all
OSes. This is straightforward with net-snmp. It would also be cool if
I could monitor more esoteric things, like ntp synchronization status,
i/o statistics, etc.
Other stuff we really need to keep an eye on is hardware - redundant
PSU status in our 7204s and Dells, temperatures and voltages (one of
our colos in New York peaked at over 40C a few weeks ago, for
instance), and disk array status (I'd like to know of a failed disk
in a hardware RAID5 before I get calls about performance issues). Our
blade chassis have DRACs in them and I think they export this data via
SNMP (I'm trying to avoid the use of SNMP traps), but not all of our
other PowerEdges have the DRACs in them so some of this information may
need to be pulled via IPMI from within the host OS. Presumably the
Cisco gear makes the temperature available via SNMP.
Finally, service checks - standard stuff (dns, http, https, ssh, smtp).
Now, to the questions.
1) Is SNMP the best way to do this? Obviously some of the data (service
checks) will need to be collected other ways.
2) Is there any good solution that does both logging/trending of this
data and also notification/monitoring/alerting? I've used both Nagios
and Cacti in the past, and, due to the number of individual things being
monitored (3-5 items per OS instance, 5-10 items per physical server,
10-50 things per network device), setting them both up independently
seems like a huge pain. Also, I've never really liked Nagios that much.
I recently entertained the idea of writing a CGI that output all of this
information in a standard format (csv?), distributing and installing it, then
collecting it periodically at a central location and doing all the
rrd/notification myself, but then realized that this problem must've
been solved a million times already.
There's got to be a better way. What do you guys use?
(I'm not opposed to non-free solutions, provided they work better.)
Cheers,
-jp
solve it.
I've recently deployed ~40 Linux instances on ~20 different Dell blades
and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and
assorted switchable PDUs and whatnot.
We need to monitor standard things like cpu, memory, disk usage on all
OSes. This is straightforward with net-snmp. It would also be cool if
I could monitor more esoteric things, like ntp synchronization status,
i/o statistics, etc.
Other stuff we really need to keep an eye on is hardware - redundant
PSU status in our 7204s and Dells, temperatures and voltages (one of
our colos in New York peaked at over 40C a few weeks ago, for
instance), and disk array status (I'd like to know of a failed disk
in a hardware RAID5 before I get calls about performance issues). Our
blade chassis have DRACs in them and I think they export this data via
SNMP (I'm trying to avoid the use of SNMP traps), but not all of our
other PowerEdges have the DRACs in them so some of this information may
need to be pulled via IPMI from within the host OS. Presumably the
Cisco gear makes the temperature available via SNMP.
Finally, service checks - standard stuff (dns, http, https, ssh, smtp).
Now, to the questions.
1) Is SNMP the best way to do this? Obviously some of the data (service
checks) will need to be collected other ways.
2) Is there any good solution that does both logging/trending of this
data and also notification/monitoring/alerting? I've used both Nagios
and Cacti in the past, and, due to the number of individual things being
monitored (3-5 items per OS instance, 5-10 items per physical server,
10-50 things per network device), setting them both up independently
seems like a huge pain. Also, I've never really liked Nagios that much.
I recently entertained the idea of writing a CGI that output all of this
information in a standard format (csv?), distributing and installing it, then
collecting it periodically at a central location and doing all the
rrd/notification myself, but then realized that this problem must've
been solved a million times already.
There's got to be a better way. What do you guys use?
(I'm not opposed to non-free solutions, provided they work better.)
Cheers,
-jp
--
--------------------------------------------------------
Rev. Jeffrey Paul -datavibe- ***@datavibe.net
aim:x736e65616b pgp:0xD9B3C17D phone:1-800-403-1126
9440 0C7F C598 01CA 2F17 D098 0A3A 4B8F D9B3 C17D
"Virtue is its own punishment."
--------------------------------------------------------
--------------------------------------------------------
Rev. Jeffrey Paul -datavibe- ***@datavibe.net
aim:x736e65616b pgp:0xD9B3C17D phone:1-800-403-1126
9440 0C7F C598 01CA 2F17 D098 0A3A 4B8F D9B3 C17D
"Virtue is its own punishment."
--------------------------------------------------------