Getting folks to work together can be a grueling exercise in itself, as the different folks who administer, develop and use a specific server may have trouble coordinating an appropriate downtime period : this is especially true for enterprise applications that require a 24/7 uptime. One of the options that may come up is a late night restart—yes…someone has to do it. In most cases, the best time to restart a server is when there will be the least amount of activity on it; this could be lunchtime, right after normal working hours or midnight. Either way, it’s got to be done, but be sure you’re ready to respond if something should go wrong. One option is to schedule your downtime during a larger scale planned outage. Just be aware of possible system dependencies that your application may rely on, like network connectivity or authentication. Those opportune times may not be as good as you think they are, and this can lead to further havoc if you can’t verify that your machine is back in a good state.
Now, for proper scheduling to happen, you need to have a clear and effective Service Level Agreement (SLA) with your clients. Without this in place, no rules are set and you, as the sys admin, won’t have any ground to stand on when it comes to working with others’ schedules. Hours of operation need to be defined with those who depend on your system, so that you can easily identify downtime windows for working on a machine.
One effective way to shorten or eliminate downtime during a patch cycle is to configure fail-over partners. Generally, this just means building two machines that run the same app, but keeping one server as a primary box and the second server as a backup. This keeps one machine available to the user community while the second server is in hot fail-over mode, in case the first server should go down. When it comes to patching, the Sys Admin can patch the backup box, have the production application operation confirmed, and switch the application’s functionality from the primary server to the secondary server. This can be done using built in clustering utilities or a manual DNS change. Either way, this helps prevent any long-term downtime, so users can continue with their work with minimal or no interruption at all.
Start Up Scripts
Patches are great, but if you’re not careful and don’t bother to test your machine before you reboot, you may find that your start-up scripts may have been rearranged : this is especially likely to happen to third-party and custom applications. This change in the start-up order can hang the machine during the start-up process, if the moved item is dependent on the network daemon being up for it to start. In some cases, an application will be moved to a position before the network startup script is executed causing the app to hang because of the lack of a networking process for it to start with.
To avoid problems, after applying the last patch, check your startup scripts in /etc/init.d/rcx.d or /etc/rcx.d (depending on your flavor of Linux) and verify that your scripts haven’t been renamed and moved up earlier in the start-up process. This will save you the trouble of having to reboot the machine into single-user mode or using a rescue disk so that you can rename the startup files.
Be sure to check your startup scripts after patching your machines. Reorganized RC directories can keep your machine from starting up correctly, especially for applications that require network connectivity.