Split RAID: Proposal for archival RAID using incremental batch checksum

Mon Nov 24 12:28:08 EST 2014

On 24 November 2014 at 18:49, Greg Freemyer <greg.freemyer at gmail.com> wrote:
>
>
> On November 24, 2014 1:48:48 AM EST, Anshuman Aggarwal <anshuman.aggarwal at gmail.com> wrote:
>>Sandeep,
>> This isn't exactly RAID4 (only thing in common is a single parity
>>disk but the data is not striped at all). I did bring it up on the
>>linux-raid mailing list and have had a short conversation with Neil.
>>He wasn't too excited about device mapper but didn't indicate why or
>>why not.
>
> If it was early in your proposal it may simply be he didn't understand it.
>
> The delayed writes to the parity disk you described would have been tough for device mapper to manage.  It doesn't typically maintain its own longer term buffers, so that would have been something that might have given him concern.  The only reason you provided was reduced wear and tear for the parity drive.
>
> Reduced wear and tear in this case is a red herring.  The kernel already buffers writes to the data disk, so no need to separately buffer parity writes.

Fair enough, the delay in buffering for the parity writes is an
independent issue which can be deferred easily.

>
>>I would like to have this as a layer for each block device on top of
>>the original block devices (intercepting write requests to the block
>>devices and updating the parity disk). Is device mapper the write
>>interface?
>
> I think yes, but dm and md are actually separate.  I think of dm as a subset of md, but if you are going to really do this you will need to learn the details better than I know them:
>
> https://www.kernel.org/doc/Documentation/device-mapper/dm-raid.txt
>
> You will need to add code to both the dm and md kernel code.
>
> I assume you know that both mdraid (mdadm) and lvm userspace tools are used to manage device mapper, so you would have to add user space support to mdraid/lvm as well.
>
>> What are the others?
>
> Well btrfs as an example incorporates a lot of raid capability into the filesystem.  Thus btrfs is a monolithic driver that has consumed much of the dm/md layer.  I can't speak to why they are doing that, but I find it troubling.  Having monolithic aspects to the kernel has always been something the Linux kernel avoided.
>
>> Also if I don't store the metadata on
>>the block device itself (to allow the block device to be unaware of
>>the RAID4 on top...how would the kernel be informed of which devices
>>together form the Split RAID.
>
> I don't understand the question.

mdadm typically has a metadata superblock stored on the block device
which identifies the block device as part of the RAID and typically
prevents it from directly recognized by file system code . I was
wondering if Split RAID block devices can be made to be unaware to the
RAID scheme on top and be fully mountable and usable without the raid
drivers (of course invalidating the parity if any of them are written
to). This allows a parity disk to be added to existing block devices
without having to setup the superblock on the underlying devices.

Hope that is clear now?
>
> I haven't thought through the process, but with mdraid/lvm you would identify the physical drives as under dm control.  (mdadm for md, pvcreate for dm). Then configure the split raid setup.
>
> Have you gone through the process of creating a raid5 with mdadm.  If not at least read a howto about it.
>
> https://raid.wiki.kernel.org/index.php/RAID_setup

Actually, I have maintained a RAID5, RAID6 6 disk cluster with mdadm
for more than a few years and handled multiple failures. I am
reasonably familiar with md reconstruction too. It is the performance
oriented but disk intensive nature of mdadm that I would like to vary
on for a home media server.

>
> I assume you would have mdadm form your multi-disk split raid volume composed of all the physical disks, then use lvm commands to define the block range on the the first drive as a lv (logical volume).  Same for the other data drives.
>
> Then use mkfs to put a filesystem on each lv.

Maybe it can also be done via md raid creating a partitionable array
where each partition corresponds to an underlying block device without
any striping.

>
> The filesystem has no knowledge there is a split raid below it.  It simply reads/writes to the overall, device mapper is layered below it and triggers the required i/o calls.
>
> Ie. For a read, it is a straight passthrough.  For a write, the old data and old parity have to be read in, modified, written out.  Device mapper does this now for raid 4/5/6, so most of the code is in place.

Exactly. Reads are passthrough, writes lead to the parity write being
triggered. Only remaining concern for me is that the md super block
will require block device to be initialized using mdadm. That can be
acceptable I suppose, but an ideal solution would be able to use
existing block devices (which would be untouched)...put passthrough
block device on top of them and manage the parity updation on the
parity block device. The information about which block devices
comprise the array can be stored in a config file etc and does not
need a superblock as badly as a raid setup.

>
>>Appreciate the help.
>>
>>Thanks,
>>Anshuman
>
> I just realized I replied to a top post.
>
> Seriously, don't do that on kernel lists if you want to be taken seriously.  It immediately identifies you as unfamiliar with the kernel mailing list netiquette.
>
> Greg
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Sorry. Just getting used to the kernel mailing list and most tools put
the default reply on the top.  Thanks for replying and reminding me.

Anshuman