13-STABLE high idprio load gives poor responsiveness and excessive CPU time per task

Discussion:

(too old to reply)

Mark Millard

2024-02-27 05:48:29 UTC

Questions include (generic list for reference,
even if some has been specified):

For /boot/loader.conf (for example) :

What value of sysctl vm.pageout_oom_seq is in use?

This indirectly adjusts the delay before sustained
low free RAM leads to killing processes. Default 12
but 120 is what I use across a wide variety of
systems. More is possible.

For /etc/sysctl.conf :

What values of sysctl vm.swap_enabled and
sysctl vm.swap_idle_enabled are in use? (They work as
a pair.)

Together they can avoid kernel stacks beings swapped out.
(Processes still can page out inactive pages, but not
their kernel stacks.) Processes withe their kernel stacks
swapped out to storage media do not run until the
kernel stacks are swapped back in. Avoiding such for
kernel stacks of processes involved in interacting with
the system can be important ot maintaining control. This
is a big hammer that is not limited to such processes.
Both being 0 is what leads to kernel stacks not being
swapped out.

For /usr/local/etc/poudriere.conf :

What values of the following are in use?

NO_ZFS
USE_TMPFS
PARALLEL_JOBS
ALLOW_MAKE_JOBS
MAX_EXECUTION_TIME
NOHANG_TIME
MAX_EXECUTION_TIME_EXTRACT
MAX_EXECUTION_TIME_INSTALL
MAX_EXECUTION_TIME_PACKAGE
MAX_EXECUTION_TIME_DEINSTALL

(Some, of course, may still have the default
value so the default value would be the answer
in such cases.)

Also: Other system tmpfs use outside poudriere?

ZFS in use in system even if poudriere has NO_ZFS
set? (Such is likely uncommon but is possible.)

(Other contexts than poudriere could have some
analogous questions.)

For /usr/local/etc/poudriere.d/make.conf (for example) :

What value of the likes of MAKE_JOBS_NUMBER is
in use.

Note: PARALLEL_JOBS, ALLOW_MAKE_JOBS, and the
likes of MAKE_JOBS_NUMBER has as context the
number of hardware threads in the context. The
3 load averages (over different time frames)
vs. the hardware threads for the system is
relevant information.

Note: with various examples of package builds that
use 25+ GiBytes of temporary file space, USE_TMPFS
can be highly relevant, as is the RAM space, SWAP
space, and the resultant RAM+SWAP space. But just
the file I/O can be relevant, even if there is
no tmpfs use.

There are questions like: Spinning rust media
usage? (An over-specific but suggestive reference
form the more general subject area.)

Serial console shows a responsiveness problem?
Simple ssh session over local EtherNet? Only if
there is a GUI present, even it is not being
actively used? Only GUI interactions show a
responsiveness problem?

Going in another direction . . .

I'm no ZFS tuning expert but I had performance
problems that I described on the lists and the
person that had increased
vfs.zfs.per_txg_dirty_frees_percent had me try
setting it back to
vfs.zfs.per_txg_dirty_frees_percent=5 . In my
context, the change was very helpful --but, to
me, it was pure magic. My point is more that you
may need judgments from someone with appropriate
internal ZFS knowledge if you are to explore
tuning ZFS. I've no evidence that the specific
setting would be helpful.

There has been a effort to deal with arc_prune
problems/overhead. See:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594

===
Mark Millard
marklmi at yahoo.com

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Eugene Grosbein

2024-02-27 20:03:11 UTC

Permalink

27.02.2024 7:24, Edward Sanford Sutton, III wrote:

[skip]

More recently looked and see top showing threads+system processes shows
I have one core getting 100% cpu for kernel{arc_prune} which has 21.2 hours over a 2 hour 23 minute uptime.

Against this bug, try these two:

sysctl vfs.zfs.arc.meta_strategy=0
sysctl vfs.zfs.arc.meta_limit_percent=25

Note that in 14.0 ZFS changed a lot and no more has sysctl vfs.zfs.arc.meta_strategy,
also it did not exist in FreeBSD 12.x, too.

Eugene Grosbein

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mark Millard

2024-02-29 16:02:42 UTC

Permalink

Peter 'PMc' Much <pmc_at_citylink.dinoex.sub.org>wrote on

More recently looked and see top showing threads+system processes
shows I have one core getting 100% cpu for kernel{arc_prune} which has
21.2 hours over a 2 hour 23 minute uptime.

Ack.

I started looking to see if
https://www.freebsd.org/security/advisories/FreeBSD-EN-23:18.openzfs.asc
was available as a fix for 13 but it is not (and doesn't quite sound
like it was supposed to apply to this issue). Would a kernel thread time
at 100% cpu for only 1 core explain the system becoming unusually
unresponsive?

That depends. This arc_prune issue does usually go alongside with some
other kernel thread (vm-whatever) also blocking, so you have two cores
busy. How many remain?
There is an updated patch in the PR 275594 (5 pieces), that works for
13.3; I have it installed, and only with that I am able to build gcc12
- otherwise the system would just OOM-crash (vm.pageout_oom_seq=5120
does not help with this).

The kernel has multiple, distinct OOM messages. Which type are you
seeing? :

"failed to reclaim memory"
"a thread waited too long to allocate a page"
"swblk or swpctrie zone exhausted"
"unknown OOM reason %d"

Also, but only for boot verbose:

"proc %d (%s) failed to alloc page on fault, starting OOM\n"

vm.pageout_oom_seq is specific to delaying just:
"failed to reclaim memory"

===
Mark Millard
marklmi at yahoo.com

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mark Millard

2024-02-29 16:06:40 UTC

Permalink

[I grabbed locally modify text for one of those messages.]

Post by Mark Millard
Peter 'PMc' Much <pmc_at_citylink.dinoex.sub.org>wrote on

More recently looked and see top showing threads+system processes
shows I have one core getting 100% cpu for kernel{arc_prune} which has
21.2 hours over a 2 hour 23 minute uptime.

Ack.

The kernel has multiple, distinct OOM messages. Which type are you
"failed to reclaim memory"
"a thread waited too long to allocate a page"
"swblk or swpctrie zone exhausted"

Should have been:

"out of swap space"

Post by Mark Millard
"unknown OOM reason %d"
"proc %d (%s) failed to alloc page on fault, starting OOM\n"
"failed to reclaim memory"

===
Mark Millard
marklmi at yahoo.com

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Peter

2024-02-29 16:21:21 UTC

Permalink

On Thu, Feb 29, 2024 at 08:02:42AM -0800, Mark Millard wrote:
! Peter 'PMc' Much <pmc_at_citylink.dinoex.sub.org>wrote on
! Date: Thu, 29 Feb 2024 13:40:05 UTC :
!
! > There is an updated patch in the PR 275594 (5 pieces), that works for
! > 13.3; I have it installed, and only with that I am able to build gcc12
! > - otherwise the system would just OOM-crash (vm.pageout_oom_seq=5120
! > does not help with this).
!
! The kernel has multiple, distinct OOM messages. Which type are you
! seeing? :
!
! "a thread waited too long to allocate a page"

That one.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mark Millard

2024-02-29 17:40:39 UTC

Permalink

Post by Peter
! Peter 'PMc' Much <pmc_at_citylink.dinoex.sub.org>wrote on
!
! > There is an updated patch in the PR 275594 (5 pieces), that works for
! > 13.3; I have it installed, and only with that I am able to build gcc12
! > - otherwise the system would just OOM-crash (vm.pageout_oom_seq=5120
! > does not help with this).
!
! The kernel has multiple, distinct OOM messages. Which type are you
!
! "a thread waited too long to allocate a page"
That one.

That explains why vm.pageout_oom_seq=5120 did not make a
notable difference in the time frame.

If you cause a verbose boot the code:

if (bootverbose)
printf(
"proc %d (%s) failed to alloc page on fault, starting OOM\n",
curproc->p_pid, curproc->p_comm);

likely will report what process had failed to get a
page in a timely manor.

There also is control over the criteria for this but is
is more complicated. In /boot/loader.conf (I'm using
defaults):

#
# For plunty of swap/paging space (will not
# run out), avoid pageout delays leading to
# Out Of Memory killing of processes:
#vm.pfault_oom_attempts=-1
#
# For possibly insufficient swap/paging space
# (might run out), increase the pageout delay
# that leads to Out Of Memory killing of
# processes (showing defaults at the time):
#vm.pfault_oom_attempts= 3
#vm.pfault_oom_wait= 10
# (The multiplication is the total but there
# are other potential tradoffs in the factors
# multiplied, even for nearly the same total.)

If you can be sure of not running out of swap/paging
space, you might try vm.pfault_oom_attempts=-1 .
If you do run out of swap/paging space, it would
deadlock, as I understand. So, if you can tolerate
that the -1 might be an option even if you do run
out of swap/paging space.

I do not have specific suggestions for alternatives
to 3 and 10. It would be exploratory for me if I had
to try such.

For reference:

# sysctl -Td vm.pfault_oom_attempts vm.pfault_oom_wait
vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling
vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler

===
Mark Millard
marklmi at yahoo.com

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mark Millard

2024-02-29 18:18:28 UTC

Permalink

Post by Mark Millard

That explains why vm.pageout_oom_seq=5120 did not make a
notable difference in the time frame.
if (bootverbose)
printf(
"proc %d (%s) failed to alloc page on fault, starting OOM\n",
curproc->p_pid, curproc->p_comm);
likely will report what process had failed to get a
page in a timely manor.
There also is control over the criteria for this but is
is more complicated. In /boot/loader.conf (I'm using
#
# For plunty of swap/paging space (will not
# run out), avoid pageout delays leading to
#vm.pfault_oom_attempts=-1
#
# For possibly insufficient swap/paging space
# (might run out), increase the pageout delay
# that leads to Out Of Memory killing of
#vm.pfault_oom_attempts= 3
#vm.pfault_oom_wait= 10
# (The multiplication is the total but there
# are other potential tradoffs in the factors
# multiplied, even for nearly the same total.)
If you can be sure of not running out of swap/paging
space, you might try vm.pfault_oom_attempts=-1 .
If you do run out of swap/paging space, it would
deadlock, as I understand. So, if you can tolerate
that the -1 might be an option even if you do run
out of swap/paging space.
I do not have specific suggestions for alternatives
to 3 and 10. It would be exploratory for me if I had
to try such.
# sysctl -Td vm.pfault_oom_attempts vm.pfault_oom_wait
vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling
vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler

I'll note that vm.pageout_oom_seq , vm.pfault_oom_attempts , and
vm.pfault_oom_wait are all live writable, not just boot-time
tunables. In other words, all show a line of output in:

# sysctl -Wd vm.pageout_oom_seq vm.pfault_oom_attempts vm.pfault_oom_wait
vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM
vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling
vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler

Not just in:

# sysctl -Td vm.pageout_oom_seq vm.pfault_oom_attempts vm.pfault_oom_wait
vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM
vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling
vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler

(To see values, to not use the "d".)

===
Mark Millard
marklmi at yahoo.com

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Peter

2024-02-29 18:56:10 UTC

Permalink

First off: the honorable Sir Edward, III. had the decency to have me
ntified that they prefer to censor my messages (reasons were not given),

I for my part consider it rather pointless to publicly ask questions,
only to then inform those who bother to answer that one declines to
receive their answers.

So much for that. Now for the call for popcorn:

On Thu, Feb 29, 2024 at 09:40:39AM -0800, Mark Millard wrote:
! > ! The kernel has multiple, distinct OOM messages. Which type are you
! > ! seeing? :
! > !
! > ! "a thread waited too long to allocate a page"
! >
! > That one.
!
! That explains why vm.pageout_oom_seq=5120 did not make a
! notable difference in the time frame.

Good. Glad it explains something.

! If you cause a verbose boot the code:
!
! if (bootverbose)
! printf(
! "proc %d (%s) failed to alloc page on fault, starting OOM\n",
! curproc->p_pid, curproc->p_comm);
!
! likely will report what process had failed to get a
! page in a timely manor.

These are ad-hoc bhyve which are only created for the purpose of
compiling some ports. So there is zero interest about /which/ process
fails, because any failing process will just fail the build.
The essential point it rather: the very same sizing of ressources works
when booting a 13.2, and crashes reproducible with 13.3

! # run out), avoid pageout delays leading to
! # Out Of Memory killing of processes:
! #vm.pfault_oom_attempts=-1

Yes, I already got that far, and that doesn't help: If the system is
neither allowed to oom-kill nor to crash, it freezes and waits for the
reset button.

As this is an endless loop in the kernel, it is not ressource
exhaustion, but rather unability of the kernel to adjust ressources
accordingly due to being busy with other things (i.e. running an
endless loop).

But then, discussion about this is futile, because there exists a
patch that well expects and nicely fixes mentioned behaviour.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de