Real-Life IT Horror Stories: The Avoidable DNS Disaster

Part 3 of the Real-Life IT Horror Stories Series

Editor’s note: In celebration of Halloween, we’ve asked a few of our instructors to share some of the horror stories from their own consulting careers. This four-part series includes tales of espionage, employee sabotage and website theft. Read on if you dare.

ITHorrorStoryNerd01While acting as vice president of a performance-based Internet marketing company, I oversaw the migration to a new infrastructure, an important part of which involved changing over 300 of our DNS records. Imagine my horror when I discovered that we’d mistakenly misrouted nearly all of those records, taking the entire company down.

DNS Goes Down Taking Revenue Along With It

Performance-based marketers take all the risk, like buying media and sending emails, and are only reimbursed when and if, a resulting sale or conversion occurs. Our company ran the strong chance of bleeding cash just by advertising a product that didn’t convert. Days like our DNS outage were tough not only because they took the company offline, but mainly because they were completely avoidable.

What we did wrong is a checklist of all the top no-nos:

  • Too few resources – First, our tiny team was already stretched to the limit, with everyone juggling six to eight mission critical deliverables every day. We didn’t give this migration the elevated “holy” status it needed to succeed.
  • Not vetting employee qualifications – I assumed the person I’d assigned the task to (let’s call him “Mr. P.”) had DNS experience because of his two decades in software development. I only asked high level questions like, “Are you OK doing this?” and “Can you do it on Friday?”
  • Bad timing – What was maybe the biggest mistake was performing a major change like this on a Friday, or more specifically, EOD Friday. Too many times at this company, employees sacrificed their weekends because executives insisted on Friday releases.
  • Inadequate TTL settings – I did not personally ensure that the time to live (TTL) settings, which represent the time it takes to propagate DNS records, were set to the lowest value possible. Lower TTLs would have allowed us to quickly roll back any issues.
  • A false sense of security – I shoulder surfed the start of Mr. P.’s work on that Friday, witnessing five to seven of the DNS records get correctly get re-assigned. That made me think, “Of course he’ll do the other 300 correctly”.

The Nightmare Continues

That next day (Saturday, of course) while enjoying brunch with my family, I got a panicked call from our senior vice president of ad buying:

“All the URLs are going to other websites. The mortgage URL is going to the acne products, the acne products URL brings up the cell phone landing page. I’ve shut off all campaigns.

I pulled up some URLs, and sure enough, each one was completely misrouted, at the exact time (weekend) when they’re supposed to be making the most money. I tried to call Mr. P, but only got voicemail. I immediately went into the office only to find that, (again, because we were all so busy, and because I’d not made this a priority for Mr. P.), we’d not documented how to access the DNS provider.

Lessons Learned

When we were finally able to track Mr. P. down, hours later, he discovered that the TTLs had not been set to five minutes as we’d thought. Most were still at six hours or more. Only after much internal stress, completely destroyed weekends, apologies to clients and the loss of over $20,000, did we get our 300 DNS records going back to the correct servers. Looking back, this incident really reinforced that fact that you can never take anything for granted. Just because somebody should know how to complete a task does not mean that they do. My advice to you is to make sure your team has the necessary skills to complete a task before you turn them loose. It’s much easier to address items with your staff than with your clients – trust me.

Do you have any stories of your own? If so, please share below or on Twitter and make sure you use #ITHorrorStories.

Related Real-Life IT Horror Stories
When Your Website Goes Dark and Tales of DNS Malfunctions
The Espionage Denial Nightmare
The Day the Logic Bomb Went Off

In this article

Join the Conversation