Ever look at a screen’s output and get that puckered feeling in the pit of your stomach? If you have been working in this profession for any amount of time, you know the feeling I’m talking about. The feeling that makes you think you would rather be living in Montana making woodcarvings at a roadside stand than being a DBA. In my next two posts, I’ll be taking a somewhat lighthearted look at the perils of our profession and discuss ways to reduce problem occurrences.
The Perils of our Profession
One of the common challenges that all DBAs face, no matter what vendor’s database they work on, is the absolute attention to detail our profession demands. Switch a couple of characters in a script, forget to set your database identifier at the prompt, set the wrong flag at the wrong time and the end result usually isn’t very pretty. Many commands we issue on a regular basis are destructive by their very nature. This is the reason why I have a great respect for all technicians who have selected database administration as their chosen profession.
I know they have all experienced that uncontrolled “eye-twitching” at 2 AM when they are ready to hit the final enter key to execute the command. You know what command I’m talking about too. It’s that one command that you really, really, really hope is going to come back with a successful return code and ultimately end with a database that is finally usable. Whether it’s a recovery, a file fix or corrupt data is immaterial. It’s the wait that we are talking about.
There is no longer wait in our profession than waiting for the message below after a database recovery:
SQL> Database opened.
Time seems to stand still. The longer the recovery, the messier the recovery, the more critical the database – the longer you wait. It’s the latest and greatest SAN technology that just totally failed wait. You know, the SAN that the disk storage group told you was “totally fault tolerant.” It’s the ritual cross your fingers, spin around three times, face towards Oracle headquarters and pray to everything that is Larry Ellison wait. I don’t care how sure you are of your capabilities, or how much of an Oracle “Ace” you are – you know the anticipation I’m talking about. It happens to all DBAs.
You then either breathe a sigh of relief or you are in absolute disgust when you see an error message appear. How about my favorite Oracle message “File 1 needs more recovery to be consistent” or the “File 2 not restored from a sufficiently old backup”? Those messages are enough to make anyone cringe. I’m an ex-Oracle instructor. I’ve seen those messages A LOT in class. That was usually after the student said “well I moved the control files and forgot to notify the database and deleted one of the log groups by mistake. I then renamed two of the data files in the operating system. Can you help?”
At a previous job, I once had to run through 36 hours of tapes to restore a multi-terabyte warehouse. A disaster occurred that required us to do a recovery. THAT was the longest wait for a database open message I ever experienced. We restored it successfully because we were prepared, documented our recovery steps, agreed upon our plan of attack and executed it.
Not only must we try to prevent our own mistakes, we must safeguard our environments against the mistakes of others. Operating system administrators, disk storage technicians and application developers are just like us. We are all part of the human community that makes mistakes from time to time.
If you never make mistakes, send me a resume. I’m always looking for a “Patron Saint of Databases” here at RDX. It will also save us on travel costs because I’m sure you’ll be able to spread your wings and fly here on your own. It is your responsibility as a database professional to identify your weak points and take steps to reduce your chance of causing an issue. It is your job as a database manager to architect an environment that reduces the chance of errors occurring.
But as my old boss Dan Pizzica used to tell me (when I was a VERY junior DBA) “It really doesn’t make a difference who broke the database. You are the technician who is ultimately responsible for fixing it. The buck stops with you. If you can’t protect your environments, you aren’t doing your job.” We all know he’s absolutely correct. That’s the strategy we have implemented from day one at RDX. We assume ownership for all databases we support.
Then there’s the software glitches. The problems that pop up out of the blue and make you go:
“WHAT THE? – How did THAT happen? I’ve done this 317 times in a row and it worked every time.”
For you math majors, here’s my calculation for this:
CLOSER YOU ARE TO PRODUCTION TURNOVER
+ THE GREATER THE VISIBILITY OF THE PROJECT
= THE MORE LIKELY A PREVIOUSLY UNKNOWN SOFTWARE GLITCH WILL OCCUR
I don’t care what software you are using, you will run into the “only occurs on this release, on this version of the operating system, using this particular feature on the third Tuesday of the sixth month when it’s cloudy outside” BUG. Be sure to expect management to stop by and ask “well, why didn’t you test this on the third Tuesday of the sixth month when it was cloudy outside?”
The more complex the database ecosystem, the more paranoid I become. Which is why I’m not a follower of “the database is getting so easy – we won’t need DBAs” mantra that mindless industry pundits profess on a seemingly regular basis.
So now we know that our jobs are somewhat unforgiving and we do make a mistake from time to time. What can we do to reduce the chance of an error occurring? Find out next week in Part II of my post on Paranoid DBA Practices.
Thanks for Reading,
Director Of Service Delivery