Assistance Needed for Kernel mode driver Soft Lockup Issue

Sat Oct 19 15:09:47 EDT 2024

Dear Linux Kernel Developers,

I am encountering a soft lockup issue in my system related to the
continuous while loop in the empty_rx_fifo() function. Below is the
relevant code:

#include <linux/io.h> // For readw()

#define FIFO_STATUS 0x0014
#define FIFO_MAN_READ 0x0015
#define RX_FIFO_EMPTY 0x01 // Assuming RX_FIFO_EMPTY is defined as 0x01

static inline uint16_t read16_shifted(void __iomem *addr, u32 offset)
{
    void __iomem *target_addr = addr + (offset << 1); // Left shift
the offset by 1 and add to the base address
    uint16_t value = readw(target_addr); // Read the 16-bit value from
the calculated address
    return value;
}

void empty_rx_fifo(void __iomem *addr)
{
    while (!(read16_shifted(addr, FIFO_STATUS) & RX_FIFO_EMPTY)) {
        read16_shifted(addr, FIFO_MAN_READ); // Keep reading from the
FIFO until it's empty
    }
}

Explanation:
Function Name: read16_shifted — The function reads a 16-bit value from
an offset address with a left shift operation.
Operation: It shifts the offset left by 1 (offset << 1), adds it to
the base address, and reads the value from the new address.
The empty_rx_fifo function is designed to clear out the RX FIFO, but
I've encountered soft lockup issues. Specifically, the system logs
repeated soft lockup messages in the kernel log, with a time gap of
roughly 28 seconds between them (as per the kernel log timestamps).
Here's an example log:

watchdog: BUG: soft lockup - CPU#0 stuck for 23s!

In all cases, the RIP points to:
RIP: 0010:read16_shifted+0x11/0x20

Analysis:
The soft lockup seems to be caused by the continuous while loop in the
empty_rx_fifo() function. The RX FIFO takes a considerable amount of
time to empty, sometimes up to 1000 seconds. As a result, from the
first occurrence of the soft lockup trace, the log repeats
approximately every 28 seconds for the entire 1000 seconds duration.
After 1000 seconds, the system resumes normal operation.

Questions:
1. How should I best handle this kind of issue? Even if the hardware
takes time, I would like advice on the best approach to prevent these
lockups.
2. Do soft lockup issues auto-recover like this? Is this something I
should consider serious, or can it be ignored?

I would appreciate any guidance on how to resolve or mitigate this problem.

-- 
Thanks,
Sekhar