Split RAID: Proposal for archival RAID using incremental batch checksum

Thu Nov 27 12:50:53 EST 2014

On 25 November 2014 at 10:26, Greg Freemyer <greg.freemyer at gmail.com> wrote:
>
>
> On November 24, 2014 12:28:08 PM EST, Anshuman Aggarwal <anshuman.aggarwal at gmail.com> wrote:
>>On 24 November 2014 at 18:49, Greg Freemyer <greg.freemyer at gmail.com>
>>wrote:
>>>
>>>
>>> On November 24, 2014 1:48:48 AM EST, Anshuman Aggarwal
>><anshuman.aggarwal at gmail.com> wrote:
>>>>Sandeep,
>>>> This isn't exactly RAID4 (only thing in common is a single parity
>>>>disk but the data is not striped at all). I did bring it up on the
>>>>linux-raid mailing list and have had a short conversation with Neil.
>>>>He wasn't too excited about device mapper but didn't indicate why or
>>>>why not.
>>>
>>> If it was early in your proposal it may simply be he didn't
>>understand it.
>>>
>>> The delayed writes to the parity disk you described would have been
>>tough for device mapper to manage.  It doesn't typically maintain its
>>own longer term buffers, so that would have been something that might
>>have given him concern.  The only reason you provided was reduced wear
>>and tear for the parity drive.
>>>
>>> Reduced wear and tear in this case is a red herring.  The kernel
>>already buffers writes to the data disk, so no need to separately
>>buffer parity writes.
>>
>>Fair enough, the delay in buffering for the parity writes is an
>>independent issue which can be deferred easily.
>>
>>>
>>>>I would like to have this as a layer for each block device on top of
>>>>the original block devices (intercepting write requests to the block
>>>>devices and updating the parity disk). Is device mapper the write
>>>>interface?
>>>
>>> I think yes, but dm and md are actually separate.  I think of dm as a
>>subset of md, but if you are going to really do this you will need to
>>learn the details better than I know them:
>>>
>>> https://www.kernel.org/doc/Documentation/device-mapper/dm-raid.txt
>>>
>>> You will need to add code to both the dm and md kernel code.
>>>
>>> I assume you know that both mdraid (mdadm) and lvm userspace tools
>>are used to manage device mapper, so you would have to add user space
>>support to mdraid/lvm as well.
>>>
>>>> What are the others?
>>>
>>> Well btrfs as an example incorporates a lot of raid capability into
>>the filesystem.  Thus btrfs is a monolithic driver that has consumed
>>much of the dm/md layer.  I can't speak to why they are doing that, but
>>I find it troubling.  Having monolithic aspects to the kernel has
>>always been something the Linux kernel avoided.
>>>
>>>> Also if I don't store the metadata on
>>>>the block device itself (to allow the block device to be unaware of
>>>>the RAID4 on top...how would the kernel be informed of which devices
>>>>together form the Split RAID.
>>>
>>> I don't understand the question.
>>
>>mdadm typically has a metadata superblock stored on the block device
>>which identifies the block device as part of the RAID and typically
>>prevents it from directly recognized by file system code . I was
>>wondering if Split RAID block devices can be made to be unaware to the
>>RAID scheme on top and be fully mountable and usable without the raid
>>drivers (of course invalidating the parity if any of them are written
>>to). This allows a parity disk to be added to existing block devices
>>without having to setup the superblock on the underlying devices.
>>
>>Hope that is clear now?
>
> Thank you, I knew about the superblock, but didn't realize that was what you were talking about.
>
> Does this address your desire?
>
> https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#mdadm_v3.0_--_Adding_the_Concept_of_User-Space_Managed_External_Metadata_Formats
>
> Fyi: I'm ignorant of any real details and I have not used the above new feature, but it seems to be what you asking for.
>

It doesn't seem to because it appears that the unified container would
still need to be the created before putting any data on the device.
Ideally, the split raid can be added as an after thought by just
adding a parity disk (block device) to an existing set of disks (block
devices)

>>>
>>> I haven't thought through the process, but with mdraid/lvm you would
>>identify the physical drives as under dm control.  (mdadm for md,
>>pvcreate for dm). Then configure the split raid setup.
>>>
>>> Have you gone through the process of creating a raid5 with mdadm.  If
>>not at least read a howto about it.
>>>
>>> https://raid.wiki.kernel.org/index.php/RAID_setup
>>
>>Actually, I have maintained a RAID5, RAID6 6 disk cluster with mdadm
>>for more than a few years and handled multiple failures. I am
>>reasonably familiar with md reconstruction too. It is the performance
>>oriented but disk intensive nature of mdadm that I would like to vary
>>on for a home media server.
>>
>>>
>>> I assume you would have mdadm form your multi-disk split raid volume
>>composed of all the physical disks, then use lvm commands to define the
>>block range on the the first drive as a lv (logical volume).  Same for
>>the other data drives.
>>>
>>> Then use mkfs to put a filesystem on each lv.
>>
>>Maybe it can also be done via md raid creating a partitionable array
>>where each partition corresponds to an underlying block device without
>>any striping.
>>
>
> I think I agree.
>
>>>
>>> The filesystem has no knowledge there is a split raid below it.  It
>>simply reads/writes to the overall, device mapper is layered below it
>>and triggers the required i/o calls.
>>>
>>> Ie. For a read, it is a straight passthrough.  For a write, the old
>>data and old parity have to be read in, modified, written out.  Device
>>mapper does this now for raid 4/5/6, so most of the code is in place.
>>
>>Exactly. Reads are passthrough, writes lead to the parity write being
>>triggered. Only remaining concern for me is that the md super block
>>will require block device to be initialized using mdadm. That can be
>>acceptable I suppose, but an ideal solution would be able to use
>>existing block devices (which would be untouched)...put passthrough
>>block device on top of them and manage the parity updation on the
>>parity block device. The information about which block devices
>>comprise the array can be stored in a config file etc and does not
>>need a superblock as badly as a raid setup.
>
> Hopefully the new user space feature does just that.
>
> Greg

Although the user space feature doesn't seem to, Neil has suggested a
way to try out using RAID-4 in a manner so as to create a split raid
like array. Will post on this mailing list if it succeeds.
>
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.