Work (really slow directory access on ext4)

Arlie Stephens arlie at worldash.org
Thu Jul 31 19:36:01 EDT 2014


Hi Nick,

[Context - directory ls taking 4-15 seconds; directory large, with
long filenames, but nowhere near as huge as Valdis' mail directory.]

I've now discovered a really bizarre pattern, and I'm inclined to stop
blaming the file system until some clarity develops. If I ever get it
to the point where I can produce a high quality bug report - with or
without patch - I will do so - but what I have now is anything but
clear and high quality. 

On Jul 30 2014, Nick Krause wrote:
> On Wed, Jul 30, 2014 at 3:48 PM,  <Valdis.Kletnieks at vt.edu> wrote:
> > On Wed, 30 Jul 2014 10:38:13 -0700, Arlie Stephens said:
> >
> >> On the good side, Vladis' observations of his mail directory have been
> >> a great help.
> >
> > And remember, that's on a single laptop-class hard drive, no fancy raid or
> > anything. (Though it *is* a hybrid, with 32G of flash cache on the front end).
> >
> > You throw some *real* hardware at it, it of course would go even faster.
> 
> Just send me the logs and anything else you think may help me.
> Please note cc the ext4 mailing list as this will also let the other
> ext4 developers and maintainers known about your problem.
> Cheers Nick

I'm now in a state of complete bafflement.  

It turns out we have a whole collection of misbehaving directories, 
making this testable without waiting for caches to clear. 

I have a couple of strace's of fast ls's, and a function ftrace that
captured about half of a 7 second ls. (The latter is huge, and
probably not suitable for posting.)

I also have a really bizarre observation, the kind that makes you
wonder whether you are actually dreaming. It appears that the
misbehaviour is strongly influenced by the choice of "time" function. 
The problem only occurs when using the shell built-in. /usr/bin/time 
always produces a fast response. 

Stranger still - flat out impossible, I'd have said before seeing it - 
a "fast" ls, run with /usr/bin/time can be followed *immediately* 
by a slow "ls", run with bash' time. It's as if the first one doesn't
warm the cache, which is completely absurd - except I've been able to
make this happen 5 times in a row, first with strace and then
without. 

# with /usr/bin/time the ls is fast
$ time -p ls bad_dir
...
real 0.21
user 0.00
sys 0.00


# with the builtin time, right *after* the strace run, the time can be 
# horrible. 
$ time -p ls bad_dir
...
real 5.60
user 0.00
sys 0.17

# run it again, and the directory is in cache as expected.
$ time -p ls bad_dir
...
real 0.11
user 0.00
sys 0.02


This is not an artefact of one or other time reporting incorrectly -
I'm noticing a long pause before output occurs, but only on the middle
test of the three. 

I can't imagine any sane way for this to be happening, short of
coincidence or user error - and I've now seen this sequence 5 times in
a row, on 5 different directories created and populated by the same
app. (Three times with strace, twice without.) 


-- 
Arlie

(Arlie Stephens					arlie at worldash.org)



More information about the Kernelnewbies mailing list