Split RAID: Proposal for archival RAID using incremental batch checksum

Mon Nov 24 23:56:04 EST 2014

On November 24, 2014 12:28:08 PM EST, Anshuman Aggarwal <anshuman.aggarwal at gmail.com> wrote:
>On 24 November 2014 at 18:49, Greg Freemyer <greg.freemyer at gmail.com>
>wrote:
>>
>>
>> On November 24, 2014 1:48:48 AM EST, Anshuman Aggarwal
><anshuman.aggarwal at gmail.com> wrote:
>>>Sandeep,
>>> This isn't exactly RAID4 (only thing in common is a single parity
>>>disk but the data is not striped at all). I did bring it up on the
>>>linux-raid mailing list and have had a short conversation with Neil.
>>>He wasn't too excited about device mapper but didn't indicate why or
>>>why not.
>>
>> If it was early in your proposal it may simply be he didn't
>understand it.
>>
>> The delayed writes to the parity disk you described would have been
>tough for device mapper to manage.  It doesn't typically maintain its
>own longer term buffers, so that would have been something that might
>have given him concern.  The only reason you provided was reduced wear
>and tear for the parity drive.
>>
>> Reduced wear and tear in this case is a red herring.  The kernel
>already buffers writes to the data disk, so no need to separately
>buffer parity writes.
>
>Fair enough, the delay in buffering for the parity writes is an
>independent issue which can be deferred easily.
>
>>
>>>I would like to have this as a layer for each block device on top of
>>>the original block devices (intercepting write requests to the block
>>>devices and updating the parity disk). Is device mapper the write
>>>interface?
>>
>> I think yes, but dm and md are actually separate.  I think of dm as a
>subset of md, but if you are going to really do this you will need to
>learn the details better than I know them:
>>
>> https://www.kernel.org/doc/Documentation/device-mapper/dm-raid.txt
>>
>> You will need to add code to both the dm and md kernel code.
>>
>> I assume you know that both mdraid (mdadm) and lvm userspace tools
>are used to manage device mapper, so you would have to add user space
>support to mdraid/lvm as well.
>>
>>> What are the others?
>>
>> Well btrfs as an example incorporates a lot of raid capability into
>the filesystem.  Thus btrfs is a monolithic driver that has consumed
>much of the dm/md layer.  I can't speak to why they are doing that, but
>I find it troubling.  Having monolithic aspects to the kernel has
>always been something the Linux kernel avoided.
>>
>>> Also if I don't store the metadata on
>>>the block device itself (to allow the block device to be unaware of
>>>the RAID4 on top...how would the kernel be informed of which devices
>>>together form the Split RAID.
>>
>> I don't understand the question.
>
>mdadm typically has a metadata superblock stored on the block device
>which identifies the block device as part of the RAID and typically
>prevents it from directly recognized by file system code . I was
>wondering if Split RAID block devices can be made to be unaware to the
>RAID scheme on top and be fully mountable and usable without the raid
>drivers (of course invalidating the parity if any of them are written
>to). This allows a parity disk to be added to existing block devices
>without having to setup the superblock on the underlying devices.
>
>Hope that is clear now?

Thank you, I knew about the superblock, but didn't realize that was what you were talking about.

Does this address your desire?

https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#mdadm_v3.0_--_Adding_the_Concept_of_User-Space_Managed_External_Metadata_Formats

Fyi: I'm ignorant of any real details and I have not used the above new feature, but it seems to be what you asking for.

>>
>> I haven't thought through the process, but with mdraid/lvm you would
>identify the physical drives as under dm control.  (mdadm for md,
>pvcreate for dm). Then configure the split raid setup.
>>
>> Have you gone through the process of creating a raid5 with mdadm.  If
>not at least read a howto about it.
>>
>> https://raid.wiki.kernel.org/index.php/RAID_setup
>
>Actually, I have maintained a RAID5, RAID6 6 disk cluster with mdadm
>for more than a few years and handled multiple failures. I am
>reasonably familiar with md reconstruction too. It is the performance
>oriented but disk intensive nature of mdadm that I would like to vary
>on for a home media server.
>
>>
>> I assume you would have mdadm form your multi-disk split raid volume
>composed of all the physical disks, then use lvm commands to define the
>block range on the the first drive as a lv (logical volume).  Same for
>the other data drives.
>>
>> Then use mkfs to put a filesystem on each lv.
>
>Maybe it can also be done via md raid creating a partitionable array
>where each partition corresponds to an underlying block device without
>any striping.
>

I think I agree.

>>
>> The filesystem has no knowledge there is a split raid below it.  It
>simply reads/writes to the overall, device mapper is layered below it
>and triggers the required i/o calls.
>>
>> Ie. For a read, it is a straight passthrough.  For a write, the old
>data and old parity have to be read in, modified, written out.  Device
>mapper does this now for raid 4/5/6, so most of the code is in place.
>
>Exactly. Reads are passthrough, writes lead to the parity write being
>triggered. Only remaining concern for me is that the md super block
>will require block device to be initialized using mdadm. That can be
>acceptable I suppose, but an ideal solution would be able to use
>existing block devices (which would be untouched)...put passthrough
>block device on top of them and manage the parity updation on the
>parity block device. The information about which block devices
>comprise the array can be stored in a config file etc and does not
>need a superblock as badly as a raid setup.

Hopefully the new user space feature does just that.

Greg

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.