Memory reclaim mechanisms freezing system
Yeongjin Kwon
yeongjinkwon at gmail.com
Sun Jun 2 21:48:39 EDT 2024
Hello,
I have been using GNU/Linux as my daily driver for 3 years. During
that whole time, I have experienced this problem where my system would
often freeze up and become unresponsive for a few seconds here and
there when it is under load, which has been a big pain. After
experimenting and trying to find the cause of the problem, I now
suspect the root cause is that the kernel memory reclaim mechanisms
(kswapd, direct reclaim, etc.) use too much CPU time, not allowing my
desktop environment processes to run.
My computer is a Surface Pro (1st Gen). Here is a link to its specifications:
https://support.microsoft.com/en-us/surface/surface-pro-1st-gen-specifications-f0e31ddb-b03b-e450-bf83-0e23cf6cbdce
Notably, my computer has 4GB of ram and 8 GB of swap.
I am currently using the sway tiling window manager on arch linux, but
have also experienced the freezing problem on KDE plasma, gnome, and
other linux distros.
There is some configuration I changed from the defaults to improve
performance. I turned zswap off, and set the vm.page-cluster sysctl to
3 (which might have been the default anyways). There are several
reasons I did this, but I will not expand on them here.
My system generally freezes when it has low free memory, is using
swap, and I switch to an application that has not been recently
active. The problem gets worse when more swap is being used. When my
system is using around 3-4 GB or more of swap, it even stalls for tens
of seconds. And so I believe the problem occurs when my system is
swapping a lot.
I thought the freezing was due to my desktop processes getting swapped
out, so I allocated memory, cpu, and io bandwidth protections using
cgroups and systemd to my desktop/system processes so that they would
not get swapped out and that they would have enough resources even
under load. This significantly helped solve the freezing, but the
problem still remained.
As I understand it, there are two procedures involved in swapping when
free memory is low, which are reading data into memory from swap when
page faults occur, and swapping old data out of memory onto disk
(reclaiming memory) to make room for the data being swapped in.
Furthermore, I have observed the system freezing when it exhausts
available ram and starts swapping memory out onto disk, even when
there is no swap usage at the time. This leads me to believe the
freezing is caused by the latter of the two procedures, memory
reclamation. In addition, I did not observe the problem occurring when
the system was swapping data in from disk while there was plenty of
free memory.
I also recently tried increasing the vm.watermark_scale_factor sysctl
parameter. When I did, the freezing problem became worse. According to
the documentation, this sysctl parameter is used to tune the
aggressiveness of kswapd. Assuming kswapd is more aggressive and
active the higher the parameter is, the amount of activity of kswapd
would be positively correlated with the severity of the problem.
I also tried using the linux zen kernel, but that did not seem to
alleviate the problem, and I believe it actually made performance
worse.
And so now I think it's possible that the kernel memory reclaim
mechanisms are using up too much CPU and ignoring the cpu protections
(cpu.weight parameter) that I set for my desktop environment
processes. It explains why my system freezes even though I allocated
protections to my desktop environment. The kernel can just bypass them
all.
I know all the details I mentioned above may be vague and unreliable
as evidence. I suppose it might be faster and more accurate to do some
profiling, but I unfortunately do not know how to go about profiling
my system.
Is my suspicion correct?
If so, is there any way to restrict the CPU time these reclaim
mechanisms get? Will doing that be likely to improve performance?
Perhaps this occurs because the reclaim mechanisms do not yield the
CPU at all until the respective watermark is reached. If this is the
case, is anyone working to make these mechanisms more preemptible and
less CPU hogging?
I would love to try implementing these changes myself, but at the
moment, I unfortunately have no experience in kernel hacking. I
thought it might be a good idea to post a question and see if anyone
else has some more insight into the issue, before attempting further
experimentation.
Thank you for your time.
Yeongjin Kwon
More information about the Kernelnewbies
mailing list