Split RAID: Proposal for archival RAID using incremental batch checksum
Anshuman Aggarwal
anshuman.aggarwal at gmail.com
Sat Nov 22 08:22:49 EST 2014
On 22 November 2014 at 18:47, Greg Freemyer <greg.freemyer at gmail.com> wrote:
> Top posting is strongly discouraged on all kernel related mailing lists including this one. I've moved your reply to the bottom and then replied after that. In future I will ignore replies that are top posted.
>
>
>>On 21 November 2014 17:11, Greg Freemyer <greg.freemyer at gmail.com>
>>wrote:
>>>
>>>
>>> On November 21, 2014 5:15:43 AM EST, Anshuman Aggarwal
>><anshuman.aggarwal at gmail.com> wrote:
>>>>I'd a appreciate any help/pointers in implementing the proposal below
>>>>including the right path to get this into the kernel itself.
>>>>----------------------------------
>>>>I'm outlining below a proposal for a RAID device mapper virtual block
>>>>device for the kernel which adds "split raid" functionality on an
>>>>incremental batch basis for a home media server/archived content
>>which
>>>>is rarely accessed.
>>>>
>>>>Given a set of N+X block devices (of the same size but smallest
>>common
>>>>size wins)
>>>>
>>>>the SplitRAID device mapper device generates virtual devices which
>>are
>>>>passthrough for N devices and write a Batched/Delayed checksum into
>>>>the X devices so as to allow offline recovery of block on the N
>>>>devices in case of a single disk failure.
>>>>
>>>>Advantages over conventional RAID:
>>>>
>>>>- Disks can be spun down reducing wear and tear over MD RAID Levels
>>>>(such as 1, 10, 5,6) in the case of rarely accessed archival content
>>>>
>>>>- Prevent catastrophic data loss for multiple device failure since
>>>>each block device is independent and hence unlike MD RAID will only
>>>>lose data incrementally.
>>>>
>>>>- Performance degradation for writes can be achieved by keeping the
>>>>checksum update asynchronous and delaying the fsync to the checksum
>>>>block device.
>>>>
>>>>In the event of improper shutdown the checksum may not have all the
>>>>updated data but will be mostly up to date which is often acceptable
>>>>for home media server requirements. A flag can be set in case the
>>>>checksum block device was shutdown properly indicating that a full
>>>>checksum rebuild is not required.
>>>>
>>>>Existing solutions considered:
>>>>
>>>>- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
>>>>based scheme (Its advantages are that its in user space and has cross
>>>>platform support but has the huge disadvantage of every checksum
>>being
>>>>done from scratch slowing the system, causing immense wear and tear
>>on
>>>>every snapshot and also losing any information updates upto the
>>>>snapshot point etc)
>>>>
>>>>I'd like to get opinions on the pros and cons of this proposal from
>>>>more experienced people on the list to redirect suitably on the
>>>>following questions:
>>>>
>>>>- Maybe this can already be done using the block devices available in
>>>>the kernel?
>>>>
>>>>- If not, Device mapper the right API to use? (I think so)
>>>>
>>>>- What would be the best block devices code to look at to implement?
>>>>
>>>>
>>>>Regards,
>>>>
>>>>Anshuman Aggarwal
>>>>
>>>>_______________________________________________
>>>>Kernelnewbies mailing list
>>>>Kernelnewbies at kernelnewbies.org
>>>>http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>>>
>>> I think I understand the proposal.
>>>
>>> You say N pass-through drives. I assume concatenated?
>>>
>>> If the N drives were instead in a Raid-0 stripe set and your X drives
>>was just a single parity drive, then you would have described Raid-4.
>>>
>>> There are use cases for raid 4 and you have described a good one
>>(rarely used data where random w/o performance is not key).
>>>
>>> I don't know if mdraid supports raid-4 or not. If not I suggest
>>adding raid-4 support is something else you might want to look at.
>>>
>>> Anyway, at a minimum add raid-4 to the existing solutions considered
>>section.
>>>
>>> Greg
> On November 21, 2014 1:48:57 PM EST, Anshuman Aggarwal <anshuman.aggarwal at gmail.com> wrote:
>>N pass through but with their own filesystems. Concatenation is via
>>some kind of union fs solution not at the block level. Data is not
>>supposed to be striped (this is critical so as to prevent all drives
>>to be required to be accessed for consecutive data)
>
> I'm ignorant of how unionfs works, so I can offer no feedback about it.
>
> I see no real issue doing it with a block level solution with device mapper (dm) as the implementation. I'm going to ignore implementation for the rest of this email and discuss the goal.
>
> Can you detail what you see a single page write to D1 doing?
>
> You talked about batching / delaying the checksum writes, but I didn't understand how that made things more efficient, nor the reason for the delay.
>
> I assume you know raid 4 and 5 work like this:
>
> Read D1old
> Read Pold
> Pnew=(Pold^D1old)^D1new
> Write Pnew
> Write D1new
>
> Ie. 2 page reads and 2 page writes to update a single page.
>
> The 2 reads and the 2 writes take place in parallel, so if the disks are otherwise idle, then the time involved is one disk seek and 2 disk rotations. Let's say 25 msecs for the seek and 12 msecs per rotation. That is 49 msecs total. I think that is about right for a low performance rotating drive, but I didn't pull out any specs to double check my memory.
>
> While that is a lot of i/o overhead (4x), it is how raid 4 and 5 work and I assume your split raid would have to do something similar. With a normal non raided disk a single block write requires a seek and a rotation, so 37 msecs, thus very little clock time overhead for raid 4 or 5 for small random i/o block writes.
>
> Is that also true of your split raid? The delayed checksum writes confuse me.
> ---
>
> Where I'm concerned about your solution for performance is with a full stride write. Let's look at how a 4 disk raid 4 would write a full stride:
>
> Pnew = D1new ^ D2new ^ D3new
> Write D1
> Write D2
> Write D3
> Write P
>
> So only 4 writes to write 3 data blocks. Even better all take place in parallel so you can accomplish 3x the data writes to disk that a single non-raided disk can.
>
> Thus for streaming writes, raid 4 or 5 see a performance boost over a single drive.
>
> I see nothing similar in your split raid.
>
> The same is true of streaming reads, raid 4 and 5 get performance gains from reading from the drives in parallel. I don't see any ability for that same gain in your split raid.
>
> In the real world raid 4 is rarely used because having a static parity drive offers no advantage I know of over having the parity handled as raid 5 does it.
>
> ===
> Thus if your split raid was in kernel and I was setting up a streaming media server the choice would be between raid 5 and your split raid. Raid 5 I believe would have superior performance, but split raid would have a less catastrophic failure mode if 2 drives failed at once.
>
> Do I have right?
>
> Greg
>
>
>
>
>
>
>>Idea is that each drive can work independently and the last drive
>>stores parity to save data in case of failure of any one drive.
>>
>>Any suggestions from anyone on where to start with such a driver..it
>>seems like a block driver for the parity drive but which depends on
>>intercepting the writes to other drives.
>
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
You have the motivation and goal quite opposite from what is intended.
In a home media server, the RAID6 mdadm setup that I currently have
keeps all the disks spinning and running for writes which could be
done just to the last disk while the others are in sleep mode (head
parked etc)
Its not about performance at all. Its about longevity of the HDDs. The
entire proposal is focused entirely on extending the life of the
drives.
By not using stripes, we restrict writes to happen to just 1 drive and
the XOR output to the parity drive which then explains the delayed and
batched checksum (resulting in fewer writes to the parity drive). The
intention is that if a drive fails then maybe we lose 1 or 2 movies
but the rest is restorable from parity.
Also another advantage over RAID5 or RAID6 is that in the event of
multiple drive failure we only lose the content on the failed drive
not the whole cluster/RAID.
Did I clarify better this time around?
More information about the Kernelnewbies
mailing list