As more and more things get moved to clouds, I am wondering how you sys-admin types handle outages. For us, we usually get an help desk request about some error then we investigate it as if it is a problem which we have control over. If we can rule that out then we contact the appropriate cloud support team. With AthenaHealth there are probably dozens of clouds including redundant ones.
I have been using monitoring tools for the clouds specific to our organization. I am still building it and it will always be a work in progress since cloud vendors rarely want to disclose the specifics needed to set up such monitoring. I want to know if the clouds are down so that I don't waste my time looking into an issue for which I have absolutely have no control. What I use is an on-prem version of what you would see at downdetector.com and downrightnow.com. When you need to monitor specific things like our Athena clouds there is nothing. A lot of the time when I open a case with support they don't even know it is down and tell me everything is OK. That makes me get back to work sometimes, thinking it is something I am responsible for.
When I added some monitoring recently, specifically plug-ins, I was immediately contacted by Athena support to stop checking because I was filling up the log file with "gets". I complied but I did request to look into this further (with technical people) and tweak things so that I can know if a specific cloud resource is up or not. After that initial call, I and others at my place of employment got some really special treatment from sales people even, like they were trying to figure out what I was trying to do, citing a case I had open where I asked for specific server information, even though I explained multiple times, to multiple the need to have such information. They just don't get it. I did hear that there is an internal tool used there but it doesn't seem like that tool is available to the support team that I talk to if they think everything is working fine.
As I write this, Med Management (drfirst.com) has been down for hours (unless it is back up and they haven't told us yet-I have not yet seen en email telling us it is back up). I had about 20 minutes of trying to figure it out before we received an email but that is still 20 minutes of work for NOTHING.
How do you all handle things like cloud outages? Is it just a matter of lowering the expectations of your doctors? I haven't tried to do that but with each outage it is almost a reflection on our internal IT staff because it always has been that way (before the clouds). Our on-prem uptime has been good to great as we have hypervisor redundancy, UPS, generator backup, HVAC controls, etc. With each hardware refresh we do a better job, all to end up having so many cloud things which do go down.
I am hoping to have a discussion to see what tools you all use and how your experiences have been.
Mike Zavolas
Tallahassee Neurological Clinic