Help Needed: Debugging Memory Corruption results GPF

Thu Dec 19 14:33:31 EST 2024

Can you run this in a KVM ?

My go-to is virtme-ng, where I can run my hacks on my laptop,
in its own VM - on a copy of my whole system.
with the tools I'm familiar with.

then you can attach gdb to the VM.

then Id try a watchpoint on the memory.

On Fri, Nov 15, 2024 at 11:19 AM Muni Sekhar <munisekharrms at gmail.com> wrote:
>
> Hi all,
>
> I am encountering a memory corruption issue in the function
> msm_set_laddr() from the Slimbus MSM Controller driver source code.
> https://android.googlesource.com/kernel/msm/+/refs/heads/android-msm-sunfish-4.14-android12/drivers/slimbus/slim-msm-ctrl.c
>
> In msm_set_laddr(), one of the arguments is ea (enumeration address),
> which is a pointer to constant data. While testing, I observed strange
> behavior:
>
> The contents of the ea buffer get corrupted during a timeout scenario
> in the call to:
>
> timeout = wait_for_completion_timeout(&done, HZ);
>
> Specifically, the ea buffer's contents differ before and after the
> wait_for_completion_timeout() call, even though it's declared as a
> pointer to constant data (const u8 *ea).
> To debug this issue, I enabled KASAN, but it didn't reveal any memory
> corruption. After the buffer corruption, random memory allocations in
> other parts of the kernel occasionally result in a GPF crash.
>
> Here is the relevant part of the code:
>
> static int msm_set_laddr(struct slim_controller *ctrl, const u8 *ea,
>                          u8 elen, u8 laddr)
> {
>     struct msm_slim_ctrl *dev = slim_get_ctrldata(ctrl);
>     struct completion done;
>     int timeout, ret, retries = 0;
>     u32 *buf;
> retry_laddr:
>     init_completion(&done);
>     mutex_lock(&dev->tx_lock);
>     buf = msm_get_msg_buf(dev, 9, &done);
>     if (buf == NULL)
>         return -ENOMEM;
>     buf[0] = SLIM_MSG_ASM_FIRST_WORD(9, SLIM_MSG_MT_CORE,
>                                      SLIM_MSG_MC_ASSIGN_LOGICAL_ADDRESS,
>                                      SLIM_MSG_DEST_LOGICALADDR,
>                                      ea[5] | ea[4] << 8);
>     buf[1] = ea[3] | (ea[2] << 8) | (ea[1] << 16) | (ea[0] << 24);
>     buf[2] = laddr;
>     ret = msm_send_msg_buf(dev, buf, 9, MGR_TX_MSG);
>     timeout = wait_for_completion_timeout(&done, HZ);
>     if (!timeout)
>         dev->err = -ETIMEDOUT;
>     if (dev->err) {
>         ret = dev->err;
>         dev->err = 0;
>     }
>     mutex_unlock(&dev->tx_lock);
>     if (ret) {
>         pr_err("set LADDR:0x%x failed:ret:%d, retrying", laddr, ret);
>         if (retries < INIT_MX_RETRIES) {
>             msm_slim_wait_retry(dev);
>             retries++;
>             goto retry_laddr;
>         } else {
>             pr_err("set LADDR failed after retrying:ret:%d", ret);
>         }
>     }
>     return ret;
> }
>
> What I've Tried:
> KASAN: Enabled it but couldn't identify the source of the corruption.
> Debugging Logs: Added logs to print the ea contents before and after
> the wait_for_completion_timeout() call. The logs show a mismatch in
> the data.
>
> Question:
> How can I efficiently trace the source of the memory corruption in
> this scenario?
> Could wait_for_completion_timeout() or a related function cause
> unintended side effects?
> Are there additional tools or techniques (e.g., dynamic debugging or
> specific kernel config options) that can help identify this
> corruption?
> Any insights or suggestions would be greatly appreciated!
>
>
>
> --
> Thanks,
> Sekhar
>
> _______________________________________________
> Kernelnewbies mailing list
> Kernelnewbies at kernelnewbies.org
> https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies