Are RAID disks worth paying for?

Forum Moderators: phranque

Message Too Old, No Replies

Are RAID disks worth paying for?

keym

9:41 pm on Dec 6, 2006 (gmt 0)

As a small startup development firm, we host the clients whose projects we develop. We don't host sites we haven't developed ourselves, as we are not in the hosting business.

We just started doing client work about a year ago, we don't have enough clients for a large dedicated server costing $300/mo. and more.

We are paying $170/mo. for a small dedicated server, a 2.8MHz Pentium 4, with 80G single hard drive and 1500GB bandwidth/mo. It comes with cPanel and the normal hosting stuff.

I know we can get this same thing for under $100/mo., but we are paying for (and getting) decent support.

My question is: are we being unwise not going with a RAID solution? The price is around $100/mo more for a 3.6GHz CPU with 80G RAIDs. We take backups offsite using ftp.

I'm guessing the main difference is, in both cases you need to take backups, just that without RAID you have a greater chance of needing to restore the backup... is this correct? Or are there other issues (performance?) of RAID?

Do people usually start out hosting without raid, then move to raid boxes as their client base grows?

MatthewHSE

9:46 pm on Dec 6, 2006 (gmt 0)

If the sites change often or use databases, then RAID can give a lot of peace of mind by providing data redundancy. On the other hand, if the sites are static and don't change often, then the normal FTP backups should be sufficient, provided of course you have adequate redundancy locally at the office.

As far as I understand, RAID on webservers is normally used for data redundancy, not speed, but a striped RAID array (no redundancy) can give better performance.

RonPK

11:17 pm on Dec 6, 2006 (gmt 0)

Me thinks that RAID is not about data redundancy but about keeping your server and sites running when a disk crashes. Without RAID, you need to consider two things:

1. how long will it take your provider to install a new disk?

2. how long will it take you to get your sites running on that new server? Virtual hosts setttings, firewall settings, database settings, email settings, user settings, file permissions, and so on...

If you can afford a few days downtime, you can do without RAID. Otherwise, I'd certainly know what to do...

plumsauce

11:48 pm on Dec 6, 2006 (gmt 0)

The way raid can help with respect to data protection is that the redundant disk(s) will keep you alive while the dead drive is replaced. Of course, this means hotswap drives are required. Don't look for this capability in whitebox servers.

Non-redundant stripe sets, raid 0, can still be beaten by performance using redundant stripe sets, raid 10. Write performance is the same because the write operations are buffered by the raid controller in battery backed ram. Read operations can be spread across TWICE as many drives. It's the reads where you gain performance. So, stripe and mirror is the way to go for the ultimate in performance and data safety.

I know of one company, at the price point mentioned, where this is the default configuration for their dedicated servers. I don't know about what anyone else does. Make sure the *hardware raid* isn't just a Promis card attached to a pair of ide drives. Clones of those cards are cheap. Really cheap.

The raid controllers I am talking about are in the thousands new, cheaper on ebay.

A word about backups. Always use a staging server as the file source for the production server. Never edit live on the production server, no matter how trivial the edit. That way, the staging server becomes a live backup at all times. Of course, your local development server can be your staging server. Serves the same purpose. Always making sure that a production file exists somewhere else before taking it live.

jtara

1:23 am on Dec 7, 2006 (gmt 0)

RAID solves a problem that hardly exists any more - failed disk drives.

You are MUCH more likely to have a problem related to human error. RAID doesn't protect you from human error.

bcc1234

1:41 am on Dec 7, 2006 (gmt 0)

RAID solves a problem that hardly exists any more - failed disk drives.

The last server I got had a drive failure on the 5th day of operation. Good thing it was a part of a raid1. Otherwise, I would be in a lot of trouble because it happened just as I moved a lot of my stuff onto that box and the DNS has finally propagated. Not to mention the amount of time I spent setting the whole thing up would have been wasted.

With a raid, I just got an e-mail notification from the box, waited till night, contacted the data center staff, they replaced the drive. I was back up and running in 20 minutes. And it took another couple of hours to sync up the new drive in the background.

So that's 20 minutes of downtime at night, vs. I don't know how much down time (which means lost search enging ranking among other things), and restoration of the whole system.

[edited by: bcc1234 at 1:42 am (utc) on Dec. 7, 2006]

jomaxx

1:42 am on Dec 7, 2006 (gmt 0)

I had a RAID drive fail just last week. The failover was seamless, it didn't require any downtime (fixing it didn't even seem to require a reboot), and AFAIK my ISP didn't charge me a penny. It certainly saved me many hours of work, and although in this case I wouldn't have lost anything irretrievably, I can imagine that a community-built website would find it a godsend.

The only downside I can see is, and I don't know that much about how they work, but it seems to me they must logically have a failure twice as often as a conventional hard drive system.

bcc1234

1:44 am on Dec 7, 2006 (gmt 0)

You are MUCH more likely to have a problem related to human error. RAID doesn't protect you from human error.

You got a point there. A lot of people think that raid is the same thing as a backup, and get screwed when something non-drive-hardware related goes wrong. Either human error, or ram failure for example.

plumsauce

5:49 am on Dec 7, 2006 (gmt 0)

RAID solves a problem that hardly exists any more - failed disk drives.

Maybe. But when it happens, it will be in the worst way possible.

As for human error, may I suggest mirrored knives? Very sharp ones.

Of course, not specifying raid counts in my book as human error to begin with.

webdoctor

10:35 am on Dec 7, 2006 (gmt 0)

(...)keep you alive while the dead drive is replaced. Of course, this means hotswap drives are required. Don't look for this capability in whitebox servers.

IMHO you don't have to use hotswap drives to benefit from RAID.

In fact, you don't even have to use hardware RAID to benefit from RAID :-)

The "poor man's RAID" story:

I was called in to deal with a drive failure on a server, very much a white box job, no-name components, two IDE drives, running linux, everything mirrored using md devices. One hard drive dead, in fact so dead the server couldn't get past the BIOS screen. Since the md setup was done correctly (i.e. grub was on both drives, the swap was also set up on an md device) it was a simple as pulling the dead drive, and booting the server. 24 hours later we shut the server down, added a new drive in to restore the mirror - booted from a linux live CD to set up partitions and copy grub across, then boot from the working drive and rebuild the mirror in the background. Sitting watching /proc/mdstat until it reaches 100%, server working all the time.

If the customer had had a spare hard drive handy in the datacentre the total downtime would have been around 5 minutes to swap the drive out.

I've never seen anyone manage a full restore in 5 minutes....

The more expensive RAID story:

I've got a backup/test server in my office - Supermicro 1U chassis, four removable SATA drives, 3ware RAID card. Oh, and 900GB of data lives on this system.

A hard drive failed. I shut the server down, pulled the dead drive out, put a new one in, booted back up, everything just worked. This was the nicest hard drive failure I have ever experienced.

plumsauce

10:59 am on Dec 7, 2006 (gmt 0)

webdoctor,

i can see where you're coming from.

the "don't look for this in whitebox..."

was to protect myself from being hung for the statement
"(...)keep you alive while the dead drive is replaced. "

so, raid is nice.

raid + hotswap is nicer :)

this box has hardware raid + hotswap

dead drives are fixed by just sliding out the sled and sliding in a new replacement.

the replacement is autosensed and the system starts
rebuilding the new drive.

if i turn it off during the rebuild, it will continue
the rebuild where it left off the next time the machine
is turned back on.

all very convenient.

been running the same install of NT4/SP6a since 1999
without ever having to reinstall.

i keep changing machines, and drives die,
but it's the same install :)

oh, after re-reading your reply, i would add that
while software raid is better than nothing, where
hardware raid with BATTERY BACKED CACHE is really
great is in the performance gain you get because
the driver releases the OS immediately on a write
as soon as the cache is written *and* the safety
in power outages. the drive can be in mid-write
during a power outage and your data will survive.

on this system, if power goes out during a write,
on the next powerup i get a warning message that
unwritten data exists in the cache and will be
written out automatically.

on these systems, the power cord test is not a
problem.

Romeo

1:41 pm on Dec 7, 2006 (gmt 0)

Keym said

I'm guessing the main difference is, in both cases you need to take backups, just that without RAID you have a greater chance of needing to restore the backup... is this correct? Or are there other issues (performance?) of RAID?
Do people usually start out hosting without raid, then move to raid boxes as their client base grows?

While Jtara says that

RAID solves a problem that hardly exists any more - failed disk drives.

I have, unfortunately, seen that otherwise:
my private statistics is about one crashed disk per 18 months per server.

Restoring content backups is time-consuming and annoying. And to re-configure and re-test all server setups and security stuff is even more time-consuming and annoying, and -- after hard work to get all working again, you get pressure from customers about the long outage. Due to its annoyance level (work, time, customer dissatisfaction), this should be avoided whenever possible.

So the answer (well, not THE answer, but just my opinion based on own experience, YMMV) to the original poster is: yes, while you still need a backup against a total loss of the box and/or its data -- either by human error or a fire in the datacenter -- a RAID-1 disk system (2 disks being simply mirrored) is the most convenient method to overcome the loss of a single disk (which is the highest hardware-imposed risk for your data).
And even if a cheap RAID controller would not support hot-swapping disks: it is a piece of cake to shut the server down, change the disk, and boot again within 3 minutes -- compared to the restore from backups and reconfig scenario above.

Moving from a non-RAID box to a RAID-box is like starting from scratch (the RAID disks array needs to be initialized, so you can't bring in your old disk full of data).

So the math is simple:

-- comparing costs of an outage and complete recovery (your work/time and the customers') due to a disk crash versus costs of the additional RAID installation.

-- comparing the lower costs of starting without RAID versus the costs of moving the whole stuff to another RAID system later.

Kind regards,
R.

lammert

3:51 pm on Dec 7, 2006 (gmt 0)

Moving from a non-RAID box to a RAID-box is like starting from scratch (the RAID disks array needs to be initialized, so you can't bring in your old disk full of data).

Not entirely true. In the past I have moved many Windows NT based servers from non-RAID to RAID with Norton Ghost by simply ghosting all partitions to an external SCSI disc, adding a RAID card as primary controller and then restoring the partitions from the external disc. The only thing I had to take care of was to load the RAID driver as a harddisk driver in the operating system before the move. After moving all partitions with Norton Ghost the servers detected during the restart the RAID controller and automatically loaded the pre-installed drivers for it.

keym

6:14 pm on Dec 7, 2006 (gmt 0)

The consensus seems to be RAID may not worth it for someone just starting out, or if you are hosting only a small number of non-changing sites that can afford a day or so outage, and there will be no loss of critical data.

But as the number of sites which would need to be restored grows, the impact of a disk crash and the associated recovery effort grows, then the cost of RAID becomes much easier to justify.

With the number of sites we currently have, we are approaching the crossover point soon, me thinks.

Thanks everyone for all your contributions.

webdoctor

6:27 pm on Dec 7, 2006 (gmt 0)

The consensus seems to be RAID may not worth it for someone just starting out, or if you are hosting only a small number of non-changing sites that can afford a day or so outage, and there will be no loss of critical data.

I'd paraphrase that as

"RAID may not be worth if if you don't really care about your data"

in which case I'd agree 100%.

If you're on linux, the use of md mirroring over two hard drives is so easy (and cost-effective) that IMHO it's simply reckless not to do it on a production system.

Remember the old quote:

If you think safety is expensive try having an accident

Only ONCE have I recommended to a client that they pay a data recovery service to get data back from a "dead" hard drive. Every client since then gets a quick explanation: data recovery fees = $$$$$, setting the kit up correctly in the first place = $$$, which do you want?

jtara

6:46 pm on Dec 7, 2006 (gmt 0)

I think before you consider RAID, you should consider your disk drives.

Unfortunately, too many servers are built today using drives that were never intended for 24/7 commercial service.

MOST SATA drives are unsuitable. There's really only one model currently manufactured that is suitable for use in a web server.

MOST SCSI drives make today are suitable. This isn't because SCSI technology is inherently more reliable - it's because of the market they are made for - high-end servers. (Having been abandoned for cheaper ATA and SATA drives for desktops, workstations, and low-end servers years ago.) Most of these drives are built like tanks.

Yet, most web servers today use SATA drives. What do you think the odds are that most of those servers are using that one model of SATA drive that was designed for servers?

Do you even know what kind of drives are in your server? I fear in most cases, what servers with RAID wind-up with is an array of cheap, unreliable drives.

I think in most cases, more important than RAID is to make sure you use a quality drive that is up to the task, and make sure you have a solid backup and recovery plan. IF you are running the kind of operation that cannot tolerate downtime, THEN consider RAID as well.

webdoctor

8:31 pm on Dec 7, 2006 (gmt 0)

Yet, most web servers today use SATA drives.

As an aside: I think SCSI vs SATA is a dangerously religious topic - just like AMD vs Intel. Given that there are many ways to build reliable servers... would you prefer to put critical data on one SCSI drive or for the same price have it on three SATA drives in a RAID5 array?

jtara

9:51 pm on Dec 7, 2006 (gmt 0)

would you prefer to put critical data on one SCSI drive or for the same price have it on three SATA drives in a RAID5 array?

One SCSI drive, vs. three cheap SATA drives in a RAID array.

You won't get three good SATA drives for the price of one SCSI drives.

While the highest quality SATA drives are cheaper than the highest-quality SCSI drives, this isn't what is being used in most servers.

I think SATA is fine for most servers. It's just that there are not yet a wide variety of high-quality SATA drives made for continuous service yet, and most servers are being built with inappropriate drives. (IMO, there's only one suitable drive currently on the market, being the WD Raptor ADFDs).

A 150GB SATA drive intended for 24/7 server use goes for about $250. A commodity 150GB SATA drive goes for $50-$100. Which kind do you think most servers are being built with?

Does it make sense to put in three $50 drives that are certain to fail, or one $250 drives that is unlikely to? On the surface, the former would seem to save you $100, but that's only if you use software RAID, and don't include the cost of dealing with a failure.

There are, of course, performance differences as well, which I am waving-away here.

plumsauce

10:42 pm on Dec 7, 2006 (gmt 0)

Yet, most web servers today use SATA drives

take a look around the next time you have the chance to be in a high end data centre. the whitebox boys might have them, but it still begs the question of whether they are in a raid configuration.

anyways the *type* of drive is not important, raid is.

a 1U dual cpu recent vintage compaq dl360 with raid can be had for $700 or less on ebay. if you buy a rackful, they can be had for $400/unit. perfectly adequate for most dedicated servers. not hotwap, but they do have built in raid and usually come with a couple of 36 gig scsi drives.

oh, and redundant hotswap power supplies and fans are pretty standard on these units.

in my experience those are the vulnerable parts. never had anything else fail on me. not memory, system boards, cpu's, peripheral controllers, just the parts that have spinning parts.

anyways, the newest generation of high end servers from compaq have hotwap memory and hot standby memory.

really high end sun midrange servers even had hotswap system boards and hot standby system boards.

got 600 volts handy :)

jtara

11:04 pm on Dec 7, 2006 (gmt 0)

anyways the *type* of drive is not important, raid is.

The type of drive isn't important when it comes to quality. It might be when it comes to performance, but that's not the issue on this thread.

I'll conceed that SATAII with NCQ should come close to SCSI performance.

(For the greatest versatility today, go with a SASI controller, which will work with EITHER SATA or Serial-Attached SCSI drives mixed in the same cage...)

In any case, the reliability of the drive IS important. It makes little sense to use RAID as a means of making-up for using poor reliability drives.

Commodity drives are NOT built to withstand 24/7 kerplunking.

a 1U dual cpu recent vintage compaq dl360 with raid can be had for $700 or less on ebay. if you buy a rackful, they can be had for $400/unit. perfectly adequate for most dedicated servers. not hotwap, but they do have built in raid and usually come with a couple of 36 gig scsi drives.

Sounds good to me. Run these in RAID or non-RAID configurations, and you'll be in good shape.

I'm just cautioning against putting lipstick on a pig.

lammert

1:11 am on Dec 8, 2006 (gmt 0)

I'm just cautioning against putting lipstick on a pig.

But that is what RAID was designed for: using inexpensive disks with a high failure rate in such a way that the total cluster becomes reliable.

Using server grade SCSI disks may reduce the failure rate, but it won't reduce the costs to get the data recovered. Using consumer grade disks in a RAID configuration is a better solution than a single SCSI disk IMHO, if you have a limited budget.

jtara

3:13 am on Dec 8, 2006 (gmt 0)

RAID makes perfect sense when your needs go beyond that which can be satisfied by a single drive.

As you increase the number of drives, the probability of failure goes up. If you need, say, 4 drives, now you have a failure rate 4x that of one drive. For the cost of only a single additional drive, you can have an "effective" failure rate better than that of only one drive.

For the typical web server, which uses a single drive, I think you are better served with a single high-quality drive.

bcc1234

3:14 am on Dec 8, 2006 (gmt 0)

you need, say, 4 drives, now you have a failure rate 4x that of one drive.

Only for raid0, which is not meant for reliability, but for performance.

BillyS

3:19 am on Dec 8, 2006 (gmt 0)

With a two drive system, you can do RAID 0 or RAID 1.

Think of RAID 0 as a system where you double your platters and heads. On the bright side, you get a big boost in I/O performance and you doubled your drive space. On the down side, you just doubled your risk of drive failure.

Think of RAID 1 as a system where you've got two hard drives constantly recording what you're doing. When reading from disks, your performance improves because the controller can read from both drives at the same time. When you write to the disks, you take a performance hit because you have to write the same information twice - once to each disk. You don't gain any space with RAID 1 because the second drive mirrors the first.

If you lose a disk with RAID 0 you're toast. And you can get burnt faster because either one of the two drives can take you out.

If you lose a disk with RAID 1 you're still in business because the second drive has the same information. If the server supports it (most do), you can hot swap the bad drive and plug a new one in. You take a performance hit while the mirror rebuilds itself but once complete you're sailing again.

The question of RAID really comes down to risk. If a drive goes bad is it worth $100 / month to keep everything running or can you afford to wait a couple of hours (check with support on this) until a new drive is in place and restored from back up?

[edited by: BillyS at 3:23 am (utc) on Dec. 8, 2006]

jtara

3:45 am on Dec 8, 2006 (gmt 0)

you need, say, 4 drives, now you have a failure rate 4x that of one drive.
Only for raid0, which is not meant for reliability, but for performance.

Not true.

4 drives have 4x the failure rate of 1 drive. Period.

The RAID configuration, or lack therof is irrelevant.

If you have 4 drives, you will have a failure, on average, 4 times as often as you will with one drive.

If you have RAID, and it works the way it is supposed to, you won't have any data loss or downtime. But you will still have the drive failures.

4 times as many.

The failures still come at a cost - the cost of the replacement drive, plus the cost of the labor to replace it.

whoisgregg

4:00 am on Dec 8, 2006 (gmt 0)

4 drives have 4x the failure rate of 1 drive. Period.

Correct. However, the odds of all four failing at the same time are ridiculously lower than the odds of one drive failing.

Added: We have a few 5 drive RAIDs on our in-house servers and we run a 2 drive RAID on our web server. I sleep better knowing that the chance of me losing the web server due to hard drive failure and needing to rebuild it from scratch is effectively nil. But, we pay quite a bit to have that feature and, if our sites weren't critical to our business, I wouldn't worry so much about ensuring that type of redundancy.

bcc1234

4:09 am on Dec 8, 2006 (gmt 0)

But you will still have the drive failures.

Well, of course. That's kind of obvious.

The idea is what happens to the system and its ability to function, not the individual physical drives.

plumsauce

4:14 am on Dec 8, 2006 (gmt 0)

jtara,

failure rate, let's call it by its real name, MTBF.

purely an engineering estimate of a the likely lifespan of a product.

now, if you have 4 times the number of drives, your MTBF is MTBF/4, or thereabouts.

raid haters can keep on singing their song,

for the rest, i will point this out:

with raid you *may* have a higher *predicted* rate of a *single* drive failure.

BUT NO *SINGLE* DRIVE FAILURE WILL BE FATAL.

(note: does not apply to pure striping, which should not even be called raid, because there is nothing *redundant* about it.)

so, my raid might have a drive failure.

my uptime is still better than the guy who stays shutdown until he can find his backups, buy a new drive, wait for shipping, change the drive out, restore, reboot.

in my systems, not only will the mirrored drive stay operational, the *online* hot spare will immediately be activated by the raid controller and start rebuilding the mirror, giving me time to fix the failed drive.

these drives have been in service so long that the standard fix procedure is to pull out the failed drive and wipe off the contacts with a t-shirt to clean off the corrosion. pop it back in and away she goes. happens about once a year to one of fourteen drives in the array.

downtime? zero.

peace of mind? priceless

plumsauce

4:17 am on Dec 8, 2006 (gmt 0)

okay, last word.

anyone been able to convince a bank to run without raid?

would you do business with them?

thought so.

lammert

10:26 am on Dec 8, 2006 (gmt 0)

4 drives have 4x the failure rate of 1 drive. Period.

Math for MTBF, failure rate and system reliability is a little bit more complicated than this. You have to take into account the amount of time you want to use the system. Let's say you have a disk with an MTBF of 100,000 hours. The probability that this disk will not fail in a timespan T is given by the formula:

R(T) = exp(-T/MTBF)

If you plan to use your system for a period of 3 years this becomes:

R(26,280) = exp(-26,280/100,000) = 77%
i.e. 23% chance the disk will crash

Let's now replace this one drive with two others with the same failure rate each. The failure rate of the pair doubles, and the MTBF--which is by definition the reciprocal of the failure rate--becomes 50,000 hours. For the pair, the probability that they will survive 3 years without a crash is:

R(26,280) = exp(-26,280/50,000) = 59%
i.e. 41% chance at least one disk will crash

So the failure rate doubles because of the two disks, but the chance a crash occurs in the estimated period of operation doesn't double: 41% vs. 23%.

Now replace the single harddrive with a high quality one with an MTBF of 500,000 hours.

R(26,280) = exp(-26,280/500,000) = 95%
Still 5% chance that the disk won't survive the first three years of operation.

The highest MTBF of disks currently available that I know of is 1,400,000 hours. For a three year operation period this gives us:

R(26,280) = exp(-26,280/1,400,000) = 98%

If you go for basic hosting with an IDE or SATA based disksystem, chances that the system will fail are fairly high and RAID will pay itself back rapidly by reducing costs for downtime and recovery.

For a high end SCSI disk, the chance of a failure are still between 2% and 5% for a three year period. Add to this that many SCSI disk based servers have more than one disk on-board, the chance at least one disk will fail in that three year period is still significant.

My conclusion is that if uptime of the webserver is important for you and downtime causes considerable costs or problems, RAID is the way to go, as disks will fail eventually, even high quality SCSI disks.