Tuesday, May 27, 2008

Managing a server critical to your business

At some point, you'll end up buying or assembling a server that provides some critical service that your whole business relies on. For a retailer, the Point Of Sale system going out would mean they couldn't (easily) sell merchandise until it was fixed. For a manufacturer or distributor, the inventory and purchasing system going down means no one can record what they are doing. For a payroll company, when the server running the payroll and accounting software fails, they can't calculate or write checks. Similarly, the media or file server for an advertising or architecture firm has to be up and running or their staff can't _do_ anything for their customers.

All of these scenarios point out how important this server is to your business. You may not
plan it that way, you may find one particular server grows into a critical role, or you might intentionally load a server with mission-critical software, but the one common thing to remember is you must protect your critical server(s).

Just like a car, these servers require regular maintenance to continue running reliably, they require insurance to cover the times when they crash or fail, and you have to build up a plan for continuing to do business even when the server fails.

Here at Allegro Consultants, we provide hosting, monitoring and maintenance for companies who have mission critical servers, but don't want the costs of a full IT staff to maintain them. Here is a short list of what should be done for these important servers, whether YOU do it or whether you have a professional support firm like us do it for you:

Firewalls
Your mission-critical server should be protected from outside attacks by a perimeter firewall. This is a firewall between your internal company network and the company (ISP) who provides your connectivity to the Internet. This is a very basic kind of protection.

This server should also be protected from internal attacks, those coming from a PC or server inside your company, so it can survive an attack even after one of your other internal machines is hacked into.

Power
You need to keep the mission-critical servers running even when power goes out. Many offices have some kind of generator that kicks in after a minute or so of powerloss. While it's good to have a generator, that minute of powerloss before it kicks in will crash your server.

You'll need a good UPS, uninterupptable power supply, connected to the server to keep it running until the generator has a chance to start providing power. UPSes also condition the incoming electricity so spikes, brownouts and switchovers don't hurt the server.

Spikes and surges
Each device connected to the server: the power line, the network cables, the phone line for the fax or modem, the serial cables to dumb terminals or old printers, etc., all can be the path that a electrical surge or spike uses to reach your precious server.

Each of these lines must have a surge suppressor that shunts the spike of voltage to ground. To make this work you must use three prong plugs every where. Never use a two-prong to three-prong connector. That eliminates the "safety valve" a surge suppressor needs to protect your equipment.

Hacks and viruses
You will need to protect your server from viruses, hackers and disgruntled employees. You'll need a server-level anti-virus system, an intrusion detection system that watches for and notifies you if someone tries to break in, and some kind of VPN so people outside your office who are allowed to use this server can't have their communication "snooped" by people watching their connection.

Patches
The operating system, each application you run on the server, and every add-on (like anti virus software) will have patches come out to close off security holes and add new features from the day you install it to the last day you use the software.

Applying these patches can be good, sometimes, and can be bad, other times. You'll need a an automated tool to tell you when important patches are available and a strategy to tell you if you want to apply those patches.

For many mission-critical servers, you only apply security patches and save all new-feature-only patches for planned downtime weekends. It's counterproductive to apply a new-feature patch to a working server only to have that patch crash the server and halt production.

Backups
You should have nightly backups, held offsite, and you should try to provide some kind of regular, frequent transaction backups during the working day. In the Progress, Oracle, and MS-SQL worlds, these are called “roll forward logs”. When you institute roll forward logging and transfer those logs, every 15 minutes or so, off site, they can be combined with your last full backup to provide for very, very little loss of data even under the worst disaster.

Your backups and roll forward procedures and validity should be tested twice a year.

A virtual image of the server, a cold metal backup plus exact hardware, or careful exact steps and media needed to perform a cold recovery should be held somewhere geographically separate form your primary server. You should create and test a DR (disaster recovery) plan once a year and “operate” that equipment to prove it worked.

Physical server management
All enterprise-class x86 servers from “name brands” such as Dell, IBM and HP come with management software. Once installed and configured, this software watches the physical hardware and alerts someone via SMS text messaging or email that a problem is about to occur or has occurred.

For all mission-critical server, you should use name-brand, enterprise-class servers and install the comes-with-it server management software. This may seem like more cost that the whitebox you can buy cheaper from the PC shop around the corner, but that won't help you when it's 3 AM on a holiday the day before all your payroll clients expect checks and your server has crashed with no more explanation than a blinking red light on the faceplate.

Remote control
All enterprise-class server have an option for out-of-band server management. This is typically a piece of hardware or an add-in card that allows you to diagnosis, reboot and “watch the console” even when the machine is not powered or not fully booted yet. For servers that are not 100% off-the-shelf proven designs (hardware, OS and application) and are hidden away in a lights out data center, these remote management cards can mean the difference between 15 minute recovery and 4 hour recovery.

Hardware warranties and support
You should leverage the hardware warranties and support plans available when you purchase these servers. The 24x7 4 hr onsite hardware diagnosis and repair is a very cheap insurance policy that no one but the vendor can supply as well as they do or within the warranty terms.

Server lifespan
You should expect a lifespan of each x86 server you buy to be only as long as the original manufacturer’s warranty. If Dell only offers 5 years of protection, plan to replace that server with a new one, under a new warranty, before that 5 years runs out.

Local monitoring
You should also install OS-level monitoring that proactively watches for problems and alerts you before they occur. We use Nagios for customers who buy monitoring from us and it warns us about low disk space, too much use of a CPU, running out of memory, network cards starting to fail, etc. Our Nagios management server, once it gets those “cries for help” or “warnings”, sends us SMS text messages and we ask you what you’d like to do. This usually happens early enough that maintenance to correct the problem can happen well in advance of any failure and with planned downtime.

Remote monitoring
Lastly, you want some simple checking, from the outside world, that the server is still accessible. This is a server we can offer and it uses various mechanisms to verify the server is still reachable. This detects problems like the network has gone down, the firewall has stopped allowing packets through, you software has stopped running, etc.



So, there's a lot to know about "running" a critical server. Bigger companies have dedicated IT staff who are trained and experienced in these areas. You simply count on them to provide steady, reliable service of the applications and they run the servers.

When it's just YOU as the Chief Everything Officer, you may find all the above too much to handle. But someone has to handle it as your business relies, literally, in this server working properly.

You can call a managed service provider like Allegro Consultants and have them do all this for you, or you can follow the recommendations above and do most of it on your own.

No comments: