Tuning Software RAID (mdadm) under Linux


Had an issue with our iSCSI server having a lot of wait states on it, so I began researching how to tune the underlying software RAID set. I found a bunch of information on this, but especially enticing was a script written by alfonso78 in 2007. You can see the post at http://ubuntuforums.org/showthread.php?p=11646960.  This is an older script, written in BASH, and had some limitations I wanted to work around. To say my BASH scripting is limited is an understatement, so I rewrote it in Perl and made some modifications. The main modification is the ability to pass the raid set (md0, md1, mdwhatever) on the command line, and also include the members so the script doesn't have to make its guess. This script must be run as root, and the target mount directory must have 30 G of free disk space. The target mount directory default is /mnt, but you can change the value of $testMount around line 73 of the script.

tunemdadm.pl md [members]

An example would be (on an HP Proliant DL-380 G5)

./tunemdadm.pl md1 cciss/c0d0 cciss/c0d1 cciss/c0d2

Warning

I have totally locked up a machine (kernel panic) while running this. HP DL380G6 with 8x146G 15k HDD all exported as RAID-0 (only way to get jbod). Do not do this on any production machine. This should be part of the initial setup.

How to use the script

First, run the script, remembering the output goes to STDOUT, so you must either screen scrape it when it is done, or pipe the output to a file. I prefer to nohup and background it so I can just come back a later and see it. If I'm at a console, I generally tail -f nohup.out so I can get an idea of when it will be done

nohup ./tunemdadm.pl md0 &
tail -f nohup.out

When the run is through, I generally copy the output to some workstation that has a spreadsheet program on it. I then open the file and grab the table created (about line 525). The table looks similar to this (only a few rows are shown, there are 36 data rows and a header).

Run	disk max_sectors_kb	disks readahead	md readahead	md stripe_cache_size	read	write
8192:4096:8192:4096	8192	4096	8192	4096	134.67	139.33
128:128:256:128	128	128	256	128	147.33	45.30
64:256:64:256	64	256	64	256	133.00	62.70
1024:512:1024:512	1024	512	1024	512	125.33	102.03

This table is tab delimited, so I can just copy/paste into a spreadsheet. I have a sample spreadsheet attached (podkayne.ods), and I set it up so you can paste the table into it starting at row 2 (don't copy the header if you do this) and it will automatically calculate a bunch of stuff for you. Then, sort by column J (Average). The "best" combination is found at the bottom or top, depending on whether your sort was descending or ascending

This may not be the best for you. If you will be doing nothing but reads, find the largest value in column H (Read Max) that doesn't kill column I (Write Max). Or whatever. However, you will almost always find some combination that is better than the default (row 24, run 512:256:4096:256 in the example sheet). In my case, since I guess that writes are more important than reads, but just barely, I took the highest average, ie the highest balance between read and write speed

Near the bottom of the output file, you'll see its "recommendations" for highest read speed and highest write speed. It looks like this

For maximum write, add the following lines to your /etc/rc.local
echo 16384 > /sys/block/md0/md/stripe_cache_size
blockdev --setra 16384 /dev/sda /dev/sdd /dev/sdc /dev/sdb
blockdev --setra 16384 /dev/md0
echo 16384 > /sys/block/sda/queue/max_sectors_kb
echo 16384 > /sys/block/sdd/queue/max_sectors_kb
echo 16384 > /sys/block/sdc/queue/max_sectors_kb
echo 16384 > /sys/block/sdb/queue/max_sectors_kb

Modify it as you like, based on your spreadsheet, then insert those lines in /etc/rc.local (on Debian Wheezy, not sure what you'd do on something running the abortion named System-D)

You can run /etc/rc.local to set your drives, and on your next reboot, the values will be automatically set

 Notes

There are many other things you can do with software RAID to tune it; changing block size is the primary one I've read about. But, after you have the array built, you can run this script to fine tune a few things on the disk subsystem also.

The script is rather limited. To really tune your drives, to find the absolute best combination, you should really try every combination possible. Four items, each capable of having 9 values, if I remember statistics class in college, is 4^9, or 262,144 combinations. Each combination requires writing 3x30G files, and then reading them. I estimate that would take close to 60,000 hours (over 6.6 years), by which time you will need to replace the drives.

Instead, we only do 36 combinations. That still takes 7-12 hours, and on one really messed up machine, it took 2.5 days. So, this is not perfection, but is "good enough" for me. On one machine, with allowing the server to search those 36 combinations, I was able to speed up reads by 4%, while simultaneously speeding up writes by 70%!

Two systems I was able to work on, just so you have some comparisons

Machine RAID Level Num Drives Run Time Speed up Write Speed up Read
HP DL-380 G6  6  5 14hr, 53min  68% -1%
Supermicro X7DCU  6  4 7hr 00min  70% 4%

 

You can see, in the first one, I decided to accept a 1% loss in maximum read speed to be able to gain a 68% gain in write speed.

The HP is kind of weird. It has a hardware RAID controller, but I really do not like being locked into those. So I had to export each drive as JBOD, then take the resulting drives and turn them into software RAID. They are nice machines, but the drive controllers truly cause issues. The Supermicro is a much older machine, but it does not have a built in RAID controller; perfect for software RAID.

Attached files: tunemdadm.pl, podkayne.txt, podkayne.ods

Tags: mdadm, raid, software raid, tune, tuning
Last update:
2015-10-18 06:15
Author:
Rod
Revision:
1.3
Average rating:0 (0 Votes)

You can comment this FAQ

Chuck Norris has counted to infinity. Twice.