Split RAID: Proposal for archival RAID using incremental batch checksum

Mon Nov 24 08:19:19 EST 2014

On November 24, 2014 1:48:48 AM EST, Anshuman Aggarwal <anshuman.aggarwal at gmail.com> wrote:
>Sandeep,
> This isn't exactly RAID4 (only thing in common is a single parity
>disk but the data is not striped at all). I did bring it up on the
>linux-raid mailing list and have had a short conversation with Neil.
>He wasn't too excited about device mapper but didn't indicate why or
>why not.

If it was early in your proposal it may simply be he didn't understand it.

The delayed writes to the parity disk you described would have been tough for device mapper to manage.  It doesn't typically maintain its own longer term buffers, so that would have been something that might have given him concern.  The only reason you provided was reduced wear and tear for the parity drive.

Reduced wear and tear in this case is a red herring.  The kernel already buffers writes to the data disk, so no need to separately buffer parity writes.

>I would like to have this as a layer for each block device on top of
>the original block devices (intercepting write requests to the block
>devices and updating the parity disk). Is device mapper the write
>interface?

I think yes, but dm and md are actually separate.  I think of dm as a subset of md, but if you are going to really do this you will need to learn the details better than I know them:

https://www.kernel.org/doc/Documentation/device-mapper/dm-raid.txt

You will need to add code to both the dm and md kernel code.

I assume you know that both mdraid (mdadm) and lvm userspace tools are used to manage device mapper, so you would have to add user space support to mdraid/lvm as well.

> What are the others? 

Well btrfs as an example incorporates a lot of raid capability into the filesystem.  Thus btrfs is a monolithic driver that has consumed much of the dm/md layer.  I can't speak to why they are doing that, but I find it troubling.  Having monolithic aspects to the kernel has always been something the Linux kernel avoided.

> Also if I don't store the metadata on
>the block device itself (to allow the block device to be unaware of
>the RAID4 on top...how would the kernel be informed of which devices
>together form the Split RAID.

I don't understand the question.

I haven't thought through the process, but with mdraid/lvm you would identify the physical drives as under dm control.  (mdadm for md, pvcreate for dm). Then configure the split raid setup.

Have you gone through the process of creating a raid5 with mdadm.  If not at least read a howto about it.

https://raid.wiki.kernel.org/index.php/RAID_setup

I assume you would have mdadm form your multi-disk split raid volume composed of all the physical disks, then use lvm commands to define the block range on the the first drive as a lv (logical volume).  Same for the other data drives.

Then use mkfs to put a filesystem on each lv.

The filesystem has no knowledge there is a split raid below it.  It simply reads/writes to the overall, device mapper is layered below it and triggers the required i/o calls.

Ie. For a read, it is a straight passthrough.  For a write, the old data and old parity have to be read in, modified, written out.  Device mapper does this now for raid 4/5/6, so most of the code is in place.

>Appreciate the help.
>
>Thanks,
>Anshuman

I just realized I replied to a top post.

Seriously, don't do that on kernel lists if you want to be taken seriously.  It immediately identifies you as unfamiliar with the kernel mailing list netiquette.

Greg
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.