This feature is the eighth in a series focused exclusively on issues impacting higher ed IT administrators, running through the beginning of the annual Educause conference, Oct. 25-28. For the series' previous entry, click here.
Despite technology's intention of making things simpler and easier, it's still susceptible to human error, malice or the adverse effects of dated technology. And when things go wrong, they can often go horrifically wrong. In our research for this series, we asked 8 higher ed CIOs to share their scariest campus IT horror stories. This is what they had to say.
You know, I think that we’ve been pretty lucky where we’ve not had any huge things blow up, but I remember several years ago, we had faculty who were doing a lot of their own web page stuff. We had a faculty member who decided that he wanted to have students log in using social security numbers. When we discovered that, it was like one of those huge kind of, “Oh my god, you can’t do this!” So we began bringing down these systems and cleaning up that mess. But luckily we only had like three or four people who actually had their social security numbers out there using that system.
But we’ve been actually pretty blessed with no major horror stories. We’ve had our spattering of things. We’ve had a couple of instances where students have attempted to use key loggers to steal credentials. It may have been more than that. We’ve caught a couple of them. We have pretty good logging systems that help us track stuff. But I’d say over the years, we’ve been pretty good where nothing huge has hit us where we’ve been stung with any kind of a “bad news to the president” kind of scenario.
At a former campus, we had, for that time, a rather large security breach in a service that was not run by central IT. It was sufficiently large that even though the information security folks had an incident response procedure, they didn’t really have one for a major information security breach that required external notification, and that was just a mess.
I wasn’t brought in for six weeks. It had been churning for six weeks and finally I was approached and just was asked, “Can you clean this up?” Not having a major breach incident type of response procedure in place with all the things you need to think about when you respond to a major breach was just something that — I mean, it’s beyond the pale.
So if there’s a lesson learned there, it’s to have a major incident response procedure in conjunction with your normal incident response procedure for information security, because there will be extra elements for a major breach that people don’t usually think about.
I was only a few weeks into a new CIO gig. It was a Friday afternoon, and we had scheduled an expansion of a large NAS (network attached storage) array that housed all of our financial data. We had just closed the fiscal year a couple of weeks before and thought we had our plan of attack well in hand, with a longtime and trusted service provider that managed our servers and backups.
A tech, who worked for the service provider, skipped several important checkpoints in our work protocols for the expansion and wiped out two terabytes of our fiscal year with, literally, a press of a button. Poof. Gone.
This was bad enough. But, when we went to restore our backups, we found that the NAS backup service had not been checked — and the daemon process that was supposed to be monitoring our backups had not completed for well over 100 hours. The status was still showing green, because no errors had (technically) occurred, but neither had anyone noticed that a process that should have run in under a handful of hours had been stalled instead for days. So our most recent recoverable data was no more recent than the last successful backup we ran, more than 100 hours in the past.
I had just moved my family halfway across the country, and wasn’t too keen to have an RGE (Resume Generating Event) happen so early in my new tenure. We only escaped catastrophic meltdown because this event happened when there was almost zero activity on our financial systems, so close to the close of the fiscal year in the early summer (a relatively quiet time in higher ed administration). We restored what we could, performed a few manual entries and were back in business.
The service provider tech was fired, and we extracted a sizable penalty from the vendor; new checks, processes and procedures were strenuously adhered to thereafter. Life went on. The vendor relationship was severely damaged, but not irreparably so.
This is still my personal "gold standard" when I think, "Well, things could be worse."
I will say that, at a previous institution, there was a data center that was built prior to my even joining the institution. By the time I became CIO, we found out the hard way — it was a repurposed space, and it worked well for many years and is still in the same area — it was right over plumbing that processed waste and sewage. We had a bad day where the sewage backed up under the raised floor of the data center, and we had inches of water across the whole data center floor with sewage.
I will just say — and I’ll stop there — that wasn’t a pleasant day for us at our institution for several days.
About a year ago, we worked with a group of students who were gamers. They had developed a UC League of Legends groups, and they were going to host the All Midwest eSports Gaming Event in October. They were expecting about 500 students to come from the Midwest, and it ended up being about 850.
But the horror story is that, very close to the event, they realized they were going to need a lot of help to pull off 1,000 additional data jacks in the venue — that’s about a mile of ethernet cable — and at least 1,000 temporary power outlets. That was really an example of the IT and facility folks coming together with the students to pull that off. I remember visiting that event very early on a Saturday morning, and they had been crimping cable and installing electric until about 5 AM. So it could have been a crisis, but the village came together and pulled it off, and we had a very successful event.
At one of my institutions — not the one that I’m currently at — we had a 2-4 AM complete outage of our data center, which basically took all of our servers and systems offline. It took our website offline. It was a pretty big deal.
We recovered everything fairly quickly. When you take everything offline without warning, just a hard shutdown, you can have system failures. We did not have that.
As it turns out, it was caused because, in the building that we were in, the janitors were in the hallways polishing the floors, and the floor polisher tripped a circuit breaker. The IT office was adjacent to the hallway that this happened in. There were circuit breaker panels in the public part of the office that would have solved the problem, but they actually let themselves into our data center and flipped a bunch of circuit breakers in the data center, which took the entire data center offline.
In small institutions, it’s not uncommon for facilities staff and janitors and everything to have master keys that let them into everywhere, and we had never been able to successfully fight that battle until that day, when we were able to much more seriously limit the access to our data center.
We had a phone system at my previous institution that was end of life, end of support, no longer supported by the company anymore. We already were in process of replacing it with a different system when it failed. So, just a month or two months before we were going to roll out the whole new system, half of our phone system died. We couldn't make any changes, voicemail failed.
Because it wasn't supported anymore, we couldn't get support from Cisco, the actual vendor. So we had to pay a third party company x-number of thousands of dollars to fix a system that we were then going to tear out and replace in another month.
... It isn't until you have that horror story that you realize "we never tracked that," or "we never had time to go back and track it." No, of course, we learned our lessons and we're better about that. It goes back to those legacy situations, we also got to make sure we don't have a phone system that's no longer supported anymore. We got to replace it at least a year before that, for instance.
We have the same situation with the phone system, by the way, at my current institution. I'm just hoping it doesn't turn into a horror story. It's stable right now.
We did have a major power outage. It was a city issue that affected our campus. Our campus proper, we also have some extended pieces of campus that's across a major thruway. Our campus safety was actually in this building called College Park Hall, which is a block or two away from the main campus. So the main campus lost power, the whole campus. But [campus safety] had power. But at the time they were on a spur — we didn't have them closed on a network loop — they couldn't get to the servers here on campus to send out notifications. The only thing I could get to was with my phone was to our Twitter account. I'm madly trying to Tweet out "no power, this is down, that's not down." That was a couple hours.
But we were on the phone, our ISPs are calling us saying, 'Hey, we're noticing there's a problem." We're like, "Yeah, we lost power."
I believe at that time, it was a double whammy. We lost power, then somehow our generator didn't kick in immediately. So the data center's dying on me, so it wasn't good. It took us a couple hours, but we were sitting here trying to figure out, "OK, I don't have power in the data center. I don't have power to my computers. But I could try to get to some system to Tweet out." ... I couldn't email. I literally had only Twitter to post some alerts to people.
... I took full advantage of that particular crisis to say, "See, this is why we need some redundancy for campus safety." And I got funding to complete a loop for them so that they would not be cut off like that. It was a horror story, but with a good ending for me at least.
Would you like to see more enterprise technology news like this in your inbox on a daily basis? Subscribe to our CIO Dive email newsletter! You may also want to read CIO Dive's look at how the device boom is creating bandwidth bedlam for higher ed CIOs.