Metadata snapshot issues

Greg Freemyer greg.freemyer at gmail.com
Fri Dec 23 09:08:56 EST 2011


Swapnil Gaikwad <swapnilgaik72 at gmail.com> wrote:

>If we gives new inodes to each file during metadata snapshot. Then is
>there any conflicting issue happens?
>What are the techniques help us in this?
>Is some-one have source code of it?
>

Swapnil,

There are at least 3 different solutions to filesystem snapshots in the linux kernel:

- device mapper snapshots
- btrfs snapshots
- next3 / next4 snapshots

The first 2 are in the vanilla kernel, the last is available as a patch that builds a module.  You can get the source at the link I posted before.

Device mapper is the simplest to understand and test.  If you don't understand it's copy on write (COW) technology, please study it first.  It is the least efficient because a write to a virgin data block causes the old data block to be read from the primary volume, then written to the snapshot volume, then snapshot pointers are updated, then the new data is written to the primary volume.

Device mapper snapshots have no filesystem knowledge, so inodes are handled exactly like any other volume block.  COW is a standard technology that is implemented in lots of external storage solutions as well.  Ie NAS / SAN devices often offer snapshots and most of those use copy on write solutions.

My first exposure to snapshots maintained by the filesystem itself without the use of COW was with windows server 2003.  It came with "shadow copy" technology.  Since the filesystem knows the details of what's really happening with the data, it can be more efficient.  MS allocates a new $MFT record (like a inode) when a file is updated.  Any replaced %mft records, pointer blocks, data blocks are left in place physically, but logically moved to a large file that holds the "shadow copy".

So think about a database file that has 1% of the data replaced by overwriting.  The filesystem allocates a new $mft record and new data blocks for the 1% of new data.  The old (original) $mft and data blocks are reallocated to the single large shadow copy file.  (Note, if there are 5 simultaneous shadow copies, then there are 5 of these shadow copy files, but only the most recent is active and all newly replaced data blocks are logically moved to just that one large shadow copy file.)

So if you read a old version of a file like a database that is spread across the snapshots, you get the $mft record from the oldest snapshot file.  It will have references to physical blocks for where the data is.  Since the data blocks were never moved, those are all still valid.  The blocks pointed at will be spread across the various shadow copy files and the live/active data blocks.

When the oldest snapshot is deleted, it is a simple matter of deleting that single shadow copy file.

Note how efficient the above is.  There is very little extra disk write activity involved, and maintaining 20 shadow copies is no more overhead than maintaining one.  (Assuming you have plenty of disk space).

I don't know if next3/next4 and btrfs use similar solutions.  Ie. I haven't read their design docs.

Greg



-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.



More information about the Kernelnewbies mailing list