I’m remembering a very not fun discussion my team had about “the monitoring system not sending any alerts doesn’t inherently mean everything is ok” after an outage that was missed by our monitoring system.
You need to make sure you’re monitoring connectivity as well as specific problem states. No data is a problem state often overlooked, and it’s not always considered for every resource type in these systems out of the box.
And you probably want a heartbeat notification. Yes, it’s noise, but if you don’t see anything from monitoring you need to question if monitoring is the thing that broke. It sending out a notification every so often going “yes I am online” is useful.
You need monitoring
I’m remembering a very not fun discussion my team had about “the monitoring system not sending any alerts doesn’t inherently mean everything is ok” after an outage that was missed by our monitoring system.
You need to make sure you’re monitoring connectivity as well as specific problem states. No data is a problem state often overlooked, and it’s not always considered for every resource type in these systems out of the box.
And you probably want a heartbeat notification. Yes, it’s noise, but if you don’t see anything from monitoring you need to question if monitoring is the thing that broke. It sending out a notification every so often going “yes I am online” is useful.
One alert daily reporting that there are no alerts is probably good for a home lab…
Kubernetes? New Relic?