vfs_cache_pressure - time to reevaluate the defaults?

Fri Jul 28 21:52:51 EDT 2023

Hello,

(I'm new to this mailing list - this seemed the most appropriate place
to ask about this; apologies if it is not.)

short version:

There is a parameter called 'vfs_cache_pressure', which can be tweaked
at runtime via /proc/sys/vm/vfs_cache_pressure.

By default, file/dir metadata is not cached much, so recursive
operations (e.g. 'du -sh', 'find', ...) are always deathly slow on
spinning disks.

Using vfs_cache_pressure, it is possible to force pinning metadata of an
entire filesystem in RAM, making these operations just as fast on
spinning disk as they are on SSD.

I am wondering whether this should be a new default for slow disks on
systems with enough RAM, and some other possibilities.

long version:

I discovered the vfs_cache_pressure parameter today while searching for
solutions to a very old problem of mine: namely, "how does every
filesystem know immediately exactly how much free space it has, even on
spinning disks, but if I ask it how big this folder is it takes 15
minutes of thrashing? And why does it take so long to find files that
tools like 'locate' were developed?"

I had toyed with the idea of trying to write a filesystem that is
better at these problems, and later, of trying to change an existing
filesystem driver to favour large sequential reads to snarf up
metadata without so many seeks (even at the cost of reading a bunch of
data it doesn't need), and still later, "what if I just rewrote it to
read all the matadata immediately and pin it in RAM?", which brought me
to, "aha, someone has already done this!"

According to

https://www.kernel.org/doc/Documentation/sysctl/vm.txt

the default is to drop cached dentries/inodes at a "fair" rate relative
to other stuff, and setting vfs_cache_pressure=0 will effectively do
what I want. It then has this caution against doing that:

> "When vfs_cache_pressure=0, the kernel will never reclaim dentries and
> inodes due to memory pressure and this can easily lead to
> out-of-memory conditions."

I thought to myself, "that doesn't seem right; how much RAM could it
possibly use?" and grabbed one of my big, slow disks and tried:

echo 0 >> /proc/sys/vm/vfs_cache_pressure
find /mnt/bigdisk > /dev/null

And after a few minutes of thrashing, I checked how much RAM was
gone... practically none (less than 1 GB on a system with 16 GB
total).

How about performance? Well, instead of 15 minutes of thrashing (it's
a cheap consumer SMR disk, so extra-extra slow):

du -sh bigfolder/

now finishes in 3 seconds (0.3% of the time, or a 300x speedup). So do
things like "find -name ...". For the cost (about 800 MB of RAM for
~450,000 files), this seems like a really good thing to do for any
filesystems that aren't on SSD.

(Experiment results for a larger filesystem: 3.75 million files takes 3
minutes for an initial scan, then occupies 1.7 GB RAM and subsequent
operations take 9 seconds (20x speedup). 9 million files uses up all
available RAM and crashes the system, which isn't proportionate to the
other results, which means one should still be a bit careful with this.)

Setting vfs_cache_pressure to 0 might be a bit extreme, but I wondered
why the default would be so eager to drop caches when RAM is plentiful
now, so I looked at the dates on that vm.txt document: 1998, 1999, and
2008.

Is it possible that this parameter just hasn't been reconsidered since
then? The kernel main mailing list archives only make 8 mentions of it
overall, and almost all of the discussion I've been able to find is
about *increasing* the parameter in attempts to solve low-memory
situations.

Some other possibilities occur to me:

- Could dentry/inode entries be cached in compressed form (e.g.
  with a fast compression method like zstd)? This would make it much
  more feasible to cache entire filesystems, even larger ones. (Looks
  like zram/zswap might be able to do this, but probably not easy to
  single out just the dentry/inode data?)

- Could the vfs_cache_pressure parameter be made per-backing-device so
  that, for example, one could set up rules to aggressively cache
  filesystems on spinning disks and network mounts, and aggressively
  drop caches for those residing on SSDs?

- Could dir/inode cache also be assigned a different swappiness
  parameter, to allow something silly like making a poor-man's SSD
  cache by putting a swapfile on an SSD, caching inode data from a
  spinning disk, then forcing it out to that swapfile? Still faster
  than seeking the heads on the spinning disk. (And yes I think this is
  a silly idea; the only advantage over things like bcachefs,
  lvm/device mapper caching is that you could set it up on an
  already-mounted filesystem, and I think data integrity could be a
  problem...)

- It looks like low non-zero values of vfs_cache_pressure have broken
  behaviour - a value of 1 acts identically to a value of 0 as far as
  I can tell. The system freezes, OOM-killer is invoked, and it starts
  killing many processes while the cache memory is left alone instead
  of being dropped. I would expect a value of 1 to mean "only drop this
  cache as a last resort, but definitely drop it before you even think
  about killing processes to regain RAM" - is there other documentation
  besides vm.txt that explains this?

I would like to put in some time to learn about how some of this might
be implemented or tweaked - does that seem worthwhile, or are there
reasons/problems/existing fixes that I'm not seeing?

~Felix.