Marathon Banner

Blog Entries in business continuity

Thursday, June 24th, 2010 - 11:37 am EDT

Tech Tip: Common Ways to Tell You Are Not Prepared to Recover from a Disaster

Posted by: Michelle Liro

Today's tip comes to us from author Eric Beehler via our friends at Realtime Publishers.

Disaster recovery is somewhat of a buzzword in the IT industry, and IT professionals have all been exposed to their share of great disaster recovery ideas from business managers. These ideas are often based on the industry buzz and seem to only make more work for you with little gain overall. This is usually because the idea is not backed up with a real plan. The actual implementation of disaster recovery is usually a big chore to undertake correctly, but in the end, it is well worth the trouble.

It's important to be ready to recover your data and systems when a disaster strikes, but it is rarely a top priority in the grand scheme of IT projects when crisis has yet to strike close to home. Unless your company has decided to make disaster recovery a high-level objective, it's usually the front-line administrator that will be saddled with the responsibility of implementing some sort of plan to save the day -- but you will likely be short changed on training and resources to get the job done.

There are many ways to deal with a disaster, from having a set of cold standby machines to employing a fully redundant hot data center. In reality, as the administrator, your job doesn't change much based on the scenario for recovery; it has to be up and available to keep your business running. You likely have some kind of plan now, but if you haven't been through the real thing, you really don't know if your plan will hold water. For Windows administrators, there are several problems that seem to expose themselves when it's time to exercise a disaster recovery plan, or worse yet, go through the real thing. Here are some common ways to tell that you are not ready for a disaster.

Plan for an Alternative Site
You are not ready for a disaster if you don't have a place to go, which requires planning for a full on-site disaster in which your site is down or inaccessible. There are several methods to address this issue if you don't have a solution today, from having an alternative site with servers waiting to be loaded up for operation to a warm site that is always ready and waiting to take traffic. These decisions are not usually made by you but by the CIO. All you can often do is consider the solution given to you and how that will impact your ability to recover. A cold site, for example, will allow you to have hardware and connectivity available, but you will need to account for operating systems (OSs), drivers, configuration differences, and data center differences. In a warm site, you have to ensure that changes to configurations and data remain synched across the two sites.

Plan for Downtime
You also have to consider whether the site solution will support the Recovery Time Objective (RTO) required by the applications and business. Simply put, the RTO is the amount of time your users will be without the functions supported by your server, which could be a Web site, a mailbox, or the ability to log on to the domain. You should have this time defined per application or function supported by your server. This, of course, in a bigger effort for disaster recovery, may be defined for you, but don't be surprised if the business people you support have no idea that your server supports the functionality they require. You may need to interject with your personal knowledge of how your server functions in order to get this definition correct.

There are generally accepted categories for RTO that fall into tiers, as Figure 1 shows. Use these as a guideline but feel free to create standards within your own organization to meet your needs. If you have a need to recover applications with 2, 4, and 8 hours, redefine the tiers so that they make sense to your business through an analysis of the business impact of downtime. Just be sure that you can apply the standards as broadly as possible across the organization.
 

Plan Your Tolerance to Data Loss
You are not ready for a disaster if you don't know your tolerance for data loss. Let's start with the basic foundation of the backup. Whether you use simple tape backups or an advanced nearline solution, you have to consider that most solutions are put in place to account for day-to-day operational needs. First, the exercise you went through with RTO must be done for the Recovery Point Objective (RPO), which is the amount of data that can be lost. You have to understand what the business can afford to lose; this value is not necessarily tied to an RTO tier. Take, for example, a point of sale system. If the system is down for 5 hours, the business may be able to recover by entering the orders taken while the system was down, but data loss of 5 hours may mean millions of dollars in lost sales.



The gut reaction for your RPO on some of your systems may be that no data loss is acceptable. In other cases, 24 hours of data loss may be acceptable. The goal is to understand what can be tolerated, not what is desired. Everyone will desire no data loss, but put a realistic perspective to the real value of the data. If you define Tier A RPO as no data loss, then you have to put systems in place that allow for that reliability. This means copying transactions as they happen to a backup site, which is an expensive solution that should be used only on your critical business applications, depending on your budget. If you have Tier B systems as defined in Figure 2, you will need some sort of solution that will be separate from your nightly backups, as you cannot count on having your last nightly tape backup at your recovery site.

Considering the Loss of a Backup
You are not ready for disaster if you rely on your daily backup for a recovery scenario. You may have in your head that you can rely on the last tape backup in the event of a disaster. Whether such is the case depends on a key question: can you get your restore process to work offsite? Don't be so quick to answer this one. If you take advantage of offsite storage either through a vendor or your own in-house process, it is an excellent step, but offsite storage doesn't necessarily guarantee you can restore at your disaster recovery site within the specified RPO and RTO.

Tape drive compatibility, backup software, delivery time, drivers, and OSs are all considerations that you must address prior to saying your solution is ready. This is especially true for a third-party backup site that will provide you with "like" hardware. That equipment will not be your equipment, and even if it is, expect aspects of the infrastructure to be different, such as IP address schemes, firmware (which can be a nightmare when working with SANs), and simple access to the hardware.

You also have the issue of archive requirements and the fact that you likely rely on these tapes for your day-to-day restores. If you perform restores for file recovery and other issues, you likely want to keep those tapes close by. If you ship them away for maximum protection, it's going to cost a pretty penny in order to request tapes from your offsite storage vendor.

You also have to consider how those tapes make it to the recovery site. If you make full backups only once a week and you only do offsite storage once a week, you might only get a restore from 2 weeks prior. Why? Because if you are lucky enough to get your tapes offsite a day or two after the full backup and you get the shipment to your disaster recovery site 4 to 8 hours after they are requested, you can almost bet that Murphy's Law will strike and you will get a bad tape somewhere in the set. Then you have to move back in the chain, and with most full backups run weekly, you might be taking you system back 2 weeks or more if Murphy continues to strike. Now, the RPO of your plan that you expected to meet with your existing backup plan is not being met.

Even if you do recover your servers with no issues, how long will it take to recover them all? Consider the queuing on the tape drives, with multiple servers waiting for those tapes to be loaded. It could take quite a long time before you even get a chance to try a restore to your server depending on the technology present at the recovery site. What can you do? Well, time to restore will be reduced if you can restore large chunks at one time. Consider putting systems with like RPO and RTO requirements in the same backup set.

Better yet, host them on a LUN or set of LUNs on your SAN or other logical storage method in your situation so that a restore can be done all at once. You might even consider booting from the SAN, which might save you from having to restore the local disk of many servers. If you have a blade server solution, this may even be baked into your infrastructure.

Using Disk-Based Backup
Let's also consider disk-based backup. This solution has become increasingly popular because of the low cost of hard disks and the ease of backup and restore. In addition, disks often take minutes to back up and restore what used to take hours. The software supported by these systems even has versioning, much more frequent backups, and nifty utilities that make life much easier on the administrator. This is usually all handled by complex backup management software such as Microsoft System Center Data Protection Manager. When using this kind of solution, consider employing these often-integrated features to support data replication of some sort, although vendors name these types of features differently.

You can even copy your live data to your recovery site using a SAN/NAS vendor's Failure Resistant Disk Solution (FRDS). You should, however, consider the fact that this kind of solution will be much more expensive than tapes because it will require duplicate equipment with data replication happening across a wide area network (WAN).

You should refer to your RTO and RPO tiers to determine whether certain servers and data sets could stand to be away from your disk replication and rely more on a tape solution. You should also consider your disaster site and understand whether it can support this kind of solution. You should treat your server restores as a form of triage. You need to know, based upon RTO and RPO, what you are going to recover first and what can wait.

Considering Configuration
If you can't identify the full configuration of your servers, you are not ready for a disaster. Realistically, can you keep track of 300 shares on a terabyte SAN served by a load-balanced Windows cluster server? Do you know which shares go to which directories on which LUNs? You have to document configurations. This is true whether you have a basic bare metal restore plan or a full redundant data center. The luxuries of a production environment won't be at your disposal. A normal production environment allows you the opportunity to compare configurations when something goes wrong and work through a problem. A disaster affords you no such luxury.

No matter how familiar you are with your systems, you need to have everything documented that can be changed. For any applications, you should have a guide for their installation in your environment. You should have the servers documented with everything from IP addresses and patches to database connections and configuration files. If you run IIS for Web applications, you should have that configuration documented as well. Some sort of context diagram is often useful to determine how your server interacts with other systems.

Utilize configuration management systems, such as SMS, to do some of the heavy lifting for you. Create reports and keep them up to date in an alternative location, either a paper copy offsite or an electronic one. Configuration problems seem to be a killer when recovering because changes sometimes get applied without strict control. What seems like a small change can kill you in a disaster when it hasn't been documented.

Documenting the infrastructure goes beyond your own servers, but is just as important when it's time to troubleshoot. You can bring your file server back and you can bring your application servers back, but if you don't have proper DNS or connectivity, no one will be connecting to those systems you've recovered. If you have dependencies on other systems, you need to identify them. Know what names should be in DNS, what IP addresses and subnet you are on, what systems you interact with such as database servers or other back-end services such as the DMZ or Internet access. When you tell a database administrator that your application is taking SQL errors, you should know what database server, database, port, connection type, and authentication type you are using. You should also know the user name and password being used, if there is one. Does the server break down into pieces? Does it have multiple applications or functions? Document those functions separately.

You can't think of server as a single system if your customers don't see it as a single function. Remember that restoring an infrastructure is many pieces to a whole, and you should not expect any of those pieces to work correctly as you can in a production environment. In fact, when you face an issue in production, it usually has a single root cause, but a disaster recovery will usually experience several major issues at the same time. You need to know where you stand in the ecosystem of your environment to understand how to identify and help fix those issues.

Identifying Single Points of Failure
If you have a single point of failure, you are not ready for a disaster. A single point of failure can ruin your nicely laid out plans. Although not a requirement for a disaster recovery, the ‘N + 1' definition used when considering disaster recovery is many components backed up by a single component. You can still run into problems using N + 1, especially at a cold site where you have not been exercising your disaster recovery equipment to ensure its health. You might consider having additional servers of a similar capacity available above the minimum number required to recover just in case you experience a failure at your recovery site.

An optimal solution will have redundancy built-in to your recovery site the way you have it outfitted at your production site. If you have a failover cluster in one location, you would do the same in the recovery site, even though you could technically get by with a single server, assuming that server functions as expected. You should also consider the interdependencies of your infrastructure, such as network, when you think of this issue. Single switches, routers, domain controllers, and sources of power can also be points of failure.

Single point of failure doesn't stop at the system level. You might have that one guy or gal who knows everything about your environment. When you're at his desk and something goes wrong with the system or a specific application, he always has the answer. This gal is a good person, but when it comes down to it, you can't rely on a single person. When a disaster strikes, the go-to person may not be available during the recovery phase-yourself included. When everyone looks around and throws up their hands because such and such is down, what do you do? You wish you could go back in time and document that ingrained knowledge. This is also true for day-to-day operations, but especially necessary when everything is going wrong because of a disaster. The person who knows it all is not what you need, you need full documentation of the knowledge that person possesses. Your go-to should really be your documentation.

Integrating Disaster Recovery into Daily Life
If you don't integrate disaster recovery into your daily operations, you are not ready for a disaster. Organizations that plan for disaster recovery as a single project with a start and an end will fail. Don't let the hard work go to waste. When you put these plans in motion, get all that documentation done, have recovery solutions in place, and continue to update your documentation and test your systems. If you don't test you disaster recovery process regularly, how do you know it will work? If you don't update your documentation day-to-day when changes are made, your documentation is outdated and may even be detrimental to your recovery efforts. Don't let apathy or a disconnected process of change management get you in the end. Not only does integration help your readiness, it reduces the dedicated time necessary to getting disaster recovery ready. Find a way to make what you use in disaster recovery a part of daily life.

Eric Beehler has been working in the IT industry since the mid-90s and has been playing with computer technology well before that. From Help desk technician to solutions provider, he has been involved at many layers of enterprise solutions from the desktop to the network to the server and the SAN. He currently has certifications from CompTIA (A+, N+, Server+), and Microsoft (MCITP: Enterprise Support Technician and Consumer Support Technician, MCTS: Windows Vista Configuration, MCDBA SQL Server 2000, MCSE+I Windows NT 4.0, MCSE Windows 2000, and MCSE Windows 2003). He also holds a Master’s degree in Business Administration from the University of Colorado at Colorado Springs. His experience includes more than nine years with Hewlett-Packard’s Managed Services division, working with Fortune 500 companies to deliver network and server solutions and, most recently, IT experience in the insurance industry working on highly available solutions and disaster recovery. He has co-authored books, including MCITP: Microsoft Windows Vista Desktop Support Enterprise Study Guide (Sybex/Wiley Publishing), authored several white papers, and co-hosts the "CS Techcast" podcast aimed at IT professionals. He provides consulting and training through Consortio Services, LLC.

For additional information about Disaster Recovery and High Availability topics, be sure to check out Marathon's Resource Center which has an extensive library of white papers, webinars and eBooks availabile for download.

 

Show Discussion / Comments (0)
Disaster Recovery  Availability  Business Continuity  Disaster Tolerance  High Availability 

| More



Monday, May 24th, 2010 - 11:58 am EDT

The Changing Dynamics of Data Protection

Posted by: Michelle Liro

Frank Ohlhorst, former Executive Technical Editor for eWeek and award-winning IT expert, was our expert guest speaker this week for the webinar, “Cut Your DR Costs and Get Better Data Protection.” During his presentation, Frank reviewed why he believes that now is the time to rethink traditional approaches to disaster recovery. He explained why the total cost of ownership for disaster recovery solutions is on the rise, and why changing data protection dynamics are making it more economical to focus your time and budget on the prevention of downtime and data loss, rather than recovery.

Below is the summary of the audience questions from the Q&A portion of the webinar.

Q: You talked about how HA can give you a geographic advantage. What do you mean by that?
Frank Ohlhorst: High availability systems are designed to work with multiple servers and there’s no reason why you can’t have those servers located hundreds or thousands of miles apart. You get a geographic advantage because your data centers is in multiple places and regional areas, so if a weather-related or other event occurs, let’s say a blizzard up north with a power outage, your data center down south can pick up the slack without kicking users off the system. The same can be said about a data center located in an area with hurricanes or other natural disasters. The geographic separation gives you added protection.
When high availability is paired with load balancing, it helps to locate the data resources closer to where the users are requesting them. Let’s say you have users in Utah, it’s better performance-wise to have them talk to the data center in Nevada rather than Virginia. It helps on that level also. HA solutions also have the tools for monitoring what is going on with your users and network, to help you plan out how you should assign users to specific data centers for the most efficiency.

Q: I understand how high availability can handle unplanned downtime, but what about planned downtime? Can it help there as well?
Frank Ohlhorst: Yes, the idea there is being as you have multiple active systems to meet the user’s needs, you can take one of those systems down for maintenance and have the users serviced by the active machines while you make the updates and improvements. Then when you are done, just resynchronize with the other systems, move the users over to those systems and update the rest of the servers.
Another great benefit of this is for testing upgrades and changes. So take one system offline and test your upgrades to see if they work properly before you return that system to production.

Q: If I have an HA solution in place, is back-up still necessary?
Frank Ohlhorst: 99% of the time the answer to that question is yes. It depends on what your corporate needs are. There are certain situations where HA might not deal with your catastrophe. Those are usually software-damaging events, like a virus infection, that winds up getting replicated across the system. Of course, that should really be part of your security planning to prevent events like that from even happening. With today’s security technologies, it’s pretty easy to prevent that. But if you did ever have one of those events, you do need something to roll-back to, and that’s where the back-up comes in to play. Ideally though, you should be preventing that type of event, because you also have the potential to lose active data if that happens. When it comes to compliance or auditing, you have to restore data relevant to that time period to meet the needs of e-discovery, compliance, accounting audits and other similar requirements. So you can’t just say, “I have HA in place, so I don’t need to back-up.”

Q: What about data de-duplication technologies, don’t they help solve this problem of managing large volumes of data?
Frank Ohlhorst: They reduce the data footprint for sure, but what we’re talking about here is availability of the data. They can certainly reduce the size of your data footprint, you can use de-dup to speed up backups. At the end of the day though, if the system or application is not accessible to the user, then it’s not available and you haven’t met your objectives. It’s a simple matter of business logic that data de-duplication can improve performance and reduce the size of the footprint, but it doesn’t solve the problem of providing access to users during catastrophic events.

Q: Do you see continuous availability and high availability as the same, and if so, how do you differentiate between the two and the costs?
Frank Ohlhorst: There was a time when those technologies were very, very different. That was way back when we relied on expensive hardware-based solutions or appliances that provided continuous availability. High availability at that time was thought of as a method to switch from one server to another using a manual process in the case of an emergency.

High Availability technology has evolved significantly since then. Now, the two are really one in the same from a planning and software point of view. Today’s HA solutions eliminate that step of manual switchover. What you see with the vendors today is automatic HA technology that really delivers continuous availability. And the cost gap today is pretty much zero, since the technology for continuous availability and high availability has evolved to be almost one in the same.

Q: With an SRDF/S-type solution, how can we get around the fact that being geographically more separated to mitigate regional disruptions can mean slower primary system response times due to the need to remain synchronous?
Frank Ohlhorst:
Let’s look at this first from the ideology of what we’re trying to do which is business continuity. So, if you encounter a situation when you lose connectivity to a system and it’s still available at another location, then you’ve met the goal there of providing continuity. And you’re in much better shape than you would be at that point if you had a disaster recovery solution instead of a business continuity solution.

The question you have to ask yourself at that point in time is: Is reduced performance better than no performance at all? For most businesses, the answer is yes. For others, if the performance lag is significant enough it can impact business. In those cases, you’ll have to work out a way to develop geographically dispersed sites can that can provide enough performance to the user sets that need access to the data. You also need to make sure that your connectivity has enough bandwidth to support your BC/HA solutions, which means the ability to replicate the data in real time across the wire. You might have to invest in larger pipes for better connectivity to support that. But again, that depends on your particular business and your needs. There is no one correct answer to this question, but the good news is that there are several solutions today that can help you solve this problem and meet the levels of availability that you need for your business.

Show Discussion / Comments (0)
Disaster Recovery  Availability  Business Continuity  Continuous Availability  Data Replication  Disaster Tolerance  Fault Tolerance  High Availability  Interview  Webcast  Webinar 

| More



Monday, June 15th, 2009 - 8:52 am EDT

Business Resilience in Virtual Environments

Posted by: Brian Mullins

One of the promises of server virtualization is improved high availability and more efficient disaster recovery. But according to the Enterprise Strategy Group (ESG) to achieve these goals with virtualization requires the modification of existing processes. It also requires looking beyond the virtualization vendor's HA and DR products to achieve SLAs for all workloads.

Mark Bowker and Lauren Whitehouse, analysts from ESG, recently put together an exclusive report on this topic called “Business Resilience in Virtual Environments” that explores:

• The advantages and challenges of server virtualization for improved business resiliency

• The limitations of the virtualization platform vendor's offerings for HA and DR

• Misconceptions about common business resiliency terms such as high availability, fault tolerance and disaster recovery

• How to use a combination of third-party solutions and new processes such as V2V, P2V, V2P and V2C to meet your application SLAs and DR objectives

This report is available exclusively from Marathon. You can download a copy here.
 

Show Discussion / Comments (0)
Virtualization  Business Continuity 

| More



Wednesday, December 10th, 2008 - 7:11 am EST

Webinar: Assessing the Impact of Planned and Unplanned Downtime in the Contact Center

Posted by: Brian Mullins

Business continuity planning ranks among the top trends in a recent Dimension Data report on contact center technology. Yet many call centers aren’t equipped to deal with unexpected downtime from a system failure. These centers would lose productivity and sacrifice service levels when mission-critical tools like real-time reporting systems go dark.

Real-time reporting provider Inova Solutions, along with new partner Marathon Technologies, will host a webinar to discuss best practices for business continuity and high availability in the contact center. Presenter Scott Thompson from Marathon Technologies will discuss how to protect your real-time reporting investment from costly downtime and data loss.

Participants can register for the webinar here. Details are below:

What: Webinar: “Assessing the Impact of Planned and Unplanned Downtime in the Contact Center”
When: Wednesday, December 10, 2008, 2:00 pm EST

via Inova Solutions website.

Show Discussion / Comments (2)
Availability  Business Continuity  Downtime  High Availability  Marathon  Partners  Webinar 

| More



Wednesday, October 15th, 2008 - 7:45 am EDT

How Midsize Companies Can Get Practical Business Continuity and Disaster Recovery Using Server Virtualization

Posted by: Brian Mullins

On October 21 at 10:00 a.m. EST, our CTO Jerry Melnick will be a featured presenter at the 2008 NorthEast Disaster Recovery Information X-Change (NEDRIX). Jerry’s presentation, Better Business Continuity and Disaster Recovery through Virtualization, will help attendees learn how and why server virtualization done right can:

• Make disaster recovery planning and execution much easier
• Simplify the notoriously difficult process of high availability maintenance
• Deliver high availability protection tailored for each application

Are any of you currently using virtualization for business continuity or disaster recovery? If so, what have your experiences been like thus far?

This year’s conference will take place from October 20-22 at the Hyatt Goat Island Newport, RI. For more information about the event and how you can register please visit NEDRIX’ website.

Show Discussion / Comments (0)
Business Continuity  Disaster Recovery  Events  High Availability  Virtualization 

| More