Help Needed: Debugging Memory Corruption results GPF
jim.cromie at gmail.com
jim.cromie at gmail.com
Thu Dec 19 14:33:31 EST 2024
Can you run this in a KVM ?
My go-to is virtme-ng, where I can run my hacks on my laptop,
in its own VM - on a copy of my whole system.
with the tools I'm familiar with.
then you can attach gdb to the VM.
then Id try a watchpoint on the memory.
On Fri, Nov 15, 2024 at 11:19 AM Muni Sekhar <munisekharrms at gmail.com> wrote:
>
> Hi all,
>
> I am encountering a memory corruption issue in the function
> msm_set_laddr() from the Slimbus MSM Controller driver source code.
> https://android.googlesource.com/kernel/msm/+/refs/heads/android-msm-sunfish-4.14-android12/drivers/slimbus/slim-msm-ctrl.c
>
> In msm_set_laddr(), one of the arguments is ea (enumeration address),
> which is a pointer to constant data. While testing, I observed strange
> behavior:
>
> The contents of the ea buffer get corrupted during a timeout scenario
> in the call to:
>
> timeout = wait_for_completion_timeout(&done, HZ);
>
> Specifically, the ea buffer's contents differ before and after the
> wait_for_completion_timeout() call, even though it's declared as a
> pointer to constant data (const u8 *ea).
> To debug this issue, I enabled KASAN, but it didn't reveal any memory
> corruption. After the buffer corruption, random memory allocations in
> other parts of the kernel occasionally result in a GPF crash.
>
> Here is the relevant part of the code:
>
> static int msm_set_laddr(struct slim_controller *ctrl, const u8 *ea,
> u8 elen, u8 laddr)
> {
> struct msm_slim_ctrl *dev = slim_get_ctrldata(ctrl);
> struct completion done;
> int timeout, ret, retries = 0;
> u32 *buf;
> retry_laddr:
> init_completion(&done);
> mutex_lock(&dev->tx_lock);
> buf = msm_get_msg_buf(dev, 9, &done);
> if (buf == NULL)
> return -ENOMEM;
> buf[0] = SLIM_MSG_ASM_FIRST_WORD(9, SLIM_MSG_MT_CORE,
> SLIM_MSG_MC_ASSIGN_LOGICAL_ADDRESS,
> SLIM_MSG_DEST_LOGICALADDR,
> ea[5] | ea[4] << 8);
> buf[1] = ea[3] | (ea[2] << 8) | (ea[1] << 16) | (ea[0] << 24);
> buf[2] = laddr;
> ret = msm_send_msg_buf(dev, buf, 9, MGR_TX_MSG);
> timeout = wait_for_completion_timeout(&done, HZ);
> if (!timeout)
> dev->err = -ETIMEDOUT;
> if (dev->err) {
> ret = dev->err;
> dev->err = 0;
> }
> mutex_unlock(&dev->tx_lock);
> if (ret) {
> pr_err("set LADDR:0x%x failed:ret:%d, retrying", laddr, ret);
> if (retries < INIT_MX_RETRIES) {
> msm_slim_wait_retry(dev);
> retries++;
> goto retry_laddr;
> } else {
> pr_err("set LADDR failed after retrying:ret:%d", ret);
> }
> }
> return ret;
> }
>
> What I've Tried:
> KASAN: Enabled it but couldn't identify the source of the corruption.
> Debugging Logs: Added logs to print the ea contents before and after
> the wait_for_completion_timeout() call. The logs show a mismatch in
> the data.
>
> Question:
> How can I efficiently trace the source of the memory corruption in
> this scenario?
> Could wait_for_completion_timeout() or a related function cause
> unintended side effects?
> Are there additional tools or techniques (e.g., dynamic debugging or
> specific kernel config options) that can help identify this
> corruption?
> Any insights or suggestions would be greatly appreciated!
>
>
>
> --
> Thanks,
> Sekhar
>
> _______________________________________________
> Kernelnewbies mailing list
> Kernelnewbies at kernelnewbies.org
> https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
More information about the Kernelnewbies
mailing list