From munisekharrms at gmail.com Thu Feb 6 11:21:19 2025 From: munisekharrms at gmail.com (Muni Sekhar) Date: Thu, 6 Feb 2025 21:51:19 +0530 Subject: soundwire : LnK SoundWire Protocol Analyzer on Linux Message-ID: Hi all, I hope this message finds you well. I am planning to use the LnK SoundWire Protocol Analyzer on a Linux platform. However, I am uncertain whether the current Linux kernel includes the necessary SoundWire drivers to support this . I would appreciate any guidance on how to use the LnK SoundWire Protocol Analyzer with the Linux kernel, as well as information on the specific device drivers required. Could you please provide details on: The compatibility of the LnK SoundWire Protocol Analyzer with the Linux kernel. The steps to set up and use the analyzer on a Linux platform. The necessary device drivers and any additional configurations needed. Thank you for your assistance. -- Thanks, Sekhar From dan.oelke at teradyne.com Wed Feb 12 15:12:33 2025 From: dan.oelke at teradyne.com (Dan Oelke) Date: Wed, 12 Feb 2025 20:12:33 +0000 Subject: SCSI Generic with NUMA setup Message-ID: I am attempting to modify kernel to get it to work for my somewhat weird system. I have a flash device and that device is attached to a separate bank of RAM that the data for a SCSI Read/Write data will be directed to. This is a UFS flash device and I have the UFS device driver working at a basic level at least. Now I am attempting to do a read or write SCSI command. I am doing this using ioctl(fd, SG_IO, sg_io_hdr) I am familiar with SCSI so I have the CDB all set no problem. In that io_hdr I want to set the dxferp to an address in that separate bank of memory. This would be an address That is from the point of view of the UFS/HCI controller and not an address that is relevant to the CPU. I?ve been reading through the scsi_lib.c and sg_ioctl code as well as the ufshcd_queuecommand. The problem is that this all centers on going through the block interface and setting up sglist there. However, I don?t want those sglists. The application is managing that bank of memory and has all the knowledge of what is in there. And it isn't virtual addresses from the kernel's point of view Any pointers on where I should look for help or maybe where would be a better place to ask for help?? This is for an embedded application so I can make things work a bit different from a general purpose system if needed. As for me: I am relatively a newbie to Linux kernel code but quite experienced in C & embedded. Thank you, Dan ________________________________ Please note that this message may contain confidential information. If you have received this message by mistake, please inform the sender of the mistake, then delete the message from your system without making, distributing or retaining any copies of it. Although we believe that the message and any attachments are free from viruses and other errors that might affect the computer or IT system where it is received and read, the recipient opens the message at his or her own risk. We assume no responsibility for any loss or damage arising from the receipt or use of this message. When we process personal data relating to physical persons, such processing will meet the requirements of applicable data protection legislation and will be in accordance with our Privacy Policy. -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg at kroah.com Thu Feb 13 05:04:29 2025 From: greg at kroah.com (Greg KH) Date: Thu, 13 Feb 2025 11:04:29 +0100 Subject: SCSI Generic with NUMA setup In-Reply-To: References: Message-ID: <2025021318-unsaid-floss-5c80@gregkh> On Wed, Feb 12, 2025 at 08:12:33PM +0000, Dan Oelke wrote: > Please note that this message may contain confidential information. If you have received this message by mistake, please inform the sender of the mistake, then delete the message from your system without making, distributing or retaining any copies of it. Although we believe that the message and any attachments are free from viruses and other errors that might affect the computer or IT system where it is received and read, the recipient opens the message at his or her own risk. We assume no responsibility for any loss or damage arising from the receipt or use of this message. When we process personal data relating to physical persons, such processing will meet the requirements of applicable data protection legislation and will be in accordance with our Privacy Policy. Now deleted. From tanure at linux.com Mon Feb 17 13:43:15 2025 From: tanure at linux.com (Lucas Tanure) Date: Mon, 17 Feb 2025 18:43:15 +0000 Subject: crypto: fscrypt: crypto_create_tfm_node memory leak Message-ID: Hi, I am working with Android 13 and V5.15 kernel. During our development, I found a memory leak using kmemleak. Steps I did to find the memleak: mount -t debugfs debugfs /sys/kernel/debug echo scan=5 > /sys/kernel/debug/kmemleak cat /sys/kernel/debug/kmemleak Stack I got (hundreds of them): unreferenced object 0xffffff8101d31000 (size 1024): comm "binder:1357_2", pid 1357, jiffies 4294899464 (age 394.468s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [] crypto_create_tfm_node+0x64/0x228 [] fscrypt_prepare_key+0xbc/0x230 [] fscrypt_setup_v1_file_key+0x48c/0x510 [] fscrypt_setup_encryption_info+0x210/0x43c [] fscrypt_prepare_new_inode+0x128/0x1a4 [] f2fs_new_inode+0x27c/0x89c [] f2fs_mkdir+0x78/0x278 [] vfs_mkdir+0x138/0x204 [] do_mkdirat+0x88/0x204 [] __arm64_sys_mkdirat+0x40/0x58 [] invoke_syscall+0x60/0x150 [] el0_svc_common+0xc8/0x114 [] do_el0_svc+0x28/0x98 [] el0_svc+0x28/0x90 [] el0t_64_sync_handler+0x88/0xec [] el0t_64_sync+0x1b8/0x1bc After checking upstream, I came up with the following: cff805b1518f fscrypt: fix keyring memory leak on mount failure But my kernel has this patch. So I continued to dig around this and saw the function fscrypt_prepare_key in fs/crypto/keysetup.c for V5.15. I can't see the pointer tfm being used anywhere or saved, and smp_store_release doesn't kfree it. Is smp_store_release doing something with that pointer that makes this memory leak a false positive? Any help with this issue would be much appreciated. Thanks Lucas Tanure From tanure at linux.com Mon Feb 17 14:08:25 2025 From: tanure at linux.com (Lucas Tanure) Date: Mon, 17 Feb 2025 19:08:25 +0000 Subject: crypto: fscrypt: crypto_create_tfm_node memory leak In-Reply-To: <20250217185000.GC1258@sol.localdomain> References: <20250217185000.GC1258@sol.localdomain> Message-ID: On Mon, Feb 17, 2025 at 6:50?PM Eric Biggers wrote: > > On Mon, Feb 17, 2025 at 06:43:15PM +0000, Lucas Tanure wrote: > > Hi, > > > > I am working with Android 13 and V5.15 kernel. During our development, > > I found a memory leak using kmemleak. > > > > Steps I did to find the memleak: > > mount -t debugfs debugfs /sys/kernel/debug > > echo scan=5 > /sys/kernel/debug/kmemleak > > cat /sys/kernel/debug/kmemleak > > > > Stack I got (hundreds of them): > > unreferenced object 0xffffff8101d31000 (size 1024): > > comm "binder:1357_2", pid 1357, jiffies 4294899464 (age 394.468s) > > hex dump (first 32 bytes): > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > backtrace: > > [] crypto_create_tfm_node+0x64/0x228 > > [] fscrypt_prepare_key+0xbc/0x230 > > [] fscrypt_setup_v1_file_key+0x48c/0x510 > > [] fscrypt_setup_encryption_info+0x210/0x43c > > [] fscrypt_prepare_new_inode+0x128/0x1a4 > > [] f2fs_new_inode+0x27c/0x89c > > [] f2fs_mkdir+0x78/0x278 > > [] vfs_mkdir+0x138/0x204 > > [] do_mkdirat+0x88/0x204 > > [] __arm64_sys_mkdirat+0x40/0x58 > > [] invoke_syscall+0x60/0x150 > > [] el0_svc_common+0xc8/0x114 > > [] do_el0_svc+0x28/0x98 > > [] el0_svc+0x28/0x90 > > [] el0t_64_sync_handler+0x88/0xec > > [] el0t_64_sync+0x1b8/0x1bc > > > > After checking upstream, I came up with the following: > > cff805b1518f fscrypt: fix keyring memory leak on mount failure > > > > But my kernel has this patch. So I continued to dig around this and > > saw the function fscrypt_prepare_key in fs/crypto/keysetup.c for > > V5.15. > > I can't see the pointer tfm being used anywhere or saved, and > > smp_store_release doesn't kfree it. > > Is smp_store_release doing something with that pointer that makes this > > memory leak a false positive? > > > > Any help with this issue would be much appreciated. > > Thanks > > The pointer to the crypto_skcipher 'tfm' is stored in the fscrypt_inode_info > (previously fscrypt_info) which is stored in inode::i_crypt_info. It gets freed > when the inode is evicted. I don't know why you're getting a kmemleak warning. > Perhaps f2fs in that version of the kernel has a bug that is leaking inodes. > Thanks. How do you check for leaking inodes? Do you have any start point (function) to look at? > smp_store_release is just a fancy way of doing a store that includes a memory > barrier. > > - Eric Lucas From helgaas at kernel.org Wed Feb 19 12:06:40 2025 From: helgaas at kernel.org (Bjorn Helgaas) Date: Wed, 19 Feb 2025 11:06:40 -0600 Subject: PCI: hotplug_event: PCIe PLDA Device BAR Reset In-Reply-To: Message-ID: <20250219170640.GA219612@bhelgaas> [+cc linux-acpi] On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote: > Hi all, > > I am writing to seek assistance with an issue we are experiencing with > a PCIe device (PLDA Device 5555) connected through PCI Express Root > Port 1 to the host bridge. > > We have observed that after booting the system, the Base Address > Register (BAR0) memory of this device gets reset to 0x0 after > approximately one hour or more (the timing is inconsistent). This was > verified using the lspci output and the setpci -s 01:00.0 > BASE_ADDRESS_0 command. > > To diagnose the issue, we checked the dmesg log, but it did not > provide any relevant information. I then enabled dynamic debugging for > the PCI subsystem (drivers/pci/*) and noticed the following messages > related ACPI hotplug in the dmesg log: > > [ 0.465144] pci 0000:01:00.0: reg 0x10: [mem 0xb0400000-0xb07fffff] > ... > [ 6710.000355] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > [ 7916.250868] perf: interrupt took too long (4072 > 3601), lowering > kernel.perf_event_max_sample_rate to 49000 > [ 7984.719647] perf: interrupt took too long (5378 > 5090), lowering > kernel.perf_event_max_sample_rate to 37000 > [11051.409115] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > [11755.388727] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > [12223.885715] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > [14303.465636] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > After these messages appear, reading the device BAR memory results in > 0x0 instead of the expected value. > > I would like to understand the following: > > 1. What could be causing these hotplug_event debug messages? This is an ACPI Notify event. Basically the platform is telling us to re-enumerate the hierarchy below RP01 because a device might have been added or removed. Unfortunately the only real information we get is the ACPI device (RP01) and the notification value (ACPI_NOTIFY_BUS_CHECK). You could instrument acpiphp_check_bridge() to see what path we take. The main paths look like enable_slot() or disable_slot(), but those both include a pr_debug() than you apparently don't see. A remove followed by add would definitely reset the device, including its BARs. But you would normally see some messages related to enumerating a new device. If this doesn't help, try to reproduce the problem with a recent kernel, e.g., v6.13, and post the complete dmesg log. > 2. Why does this result in the BAR memory being reset? > 3. How can we resolve this issue? > > I have verified that the issue occurs even without loading the driver > for the PLDA Device 5555, so it does not appear to be related to the > device driver. > > Any help or guidance on debugging this issue would be greatly appreciated. > > Thank you for your assistance. > > Best regards, > Naveen From helgaas at kernel.org Mon Feb 24 12:33:17 2025 From: helgaas at kernel.org (Bjorn Helgaas) Date: Mon, 24 Feb 2025 11:33:17 -0600 Subject: PCI: hotplug_event: PCIe PLDA Device BAR Reset In-Reply-To: Message-ID: <20250224173317.GA466030@bhelgaas> On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote: > On Wed, Feb 19, 2025 at 10:36?PM Bjorn Helgaas wrote: > > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote: > > > Hi all, > > > > > > I am writing to seek assistance with an issue we are experiencing with > > > a PCIe device (PLDA Device 5555) connected through PCI Express Root > > > Port 1 to the host bridge. > > > > > > We have observed that after booting the system, the Base Address > > > Register (BAR0) memory of this device gets reset to 0x0 after > > > approximately one hour or more (the timing is inconsistent). This was > > > verified using the lspci output and the setpci -s 01:00.0 > > > BASE_ADDRESS_0 command. > > > > > > To diagnose the issue, we checked the dmesg log, but it did not > > > provide any relevant information. I then enabled dynamic debugging for > > > the PCI subsystem (drivers/pci/*) and noticed the following messages > > > related ACPI hotplug in the dmesg log: > > > > > > [ 0.465144] pci 0000:01:00.0: reg 0x10: [mem 0xb0400000-0xb07fffff] > > > ... > > > [ 6710.000355] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > > > [ 7916.250868] perf: interrupt took too long (4072 > 3601), lowering > > > kernel.perf_event_max_sample_rate to 49000 > > > [ 7984.719647] perf: interrupt took too long (5378 > 5090), lowering > > > kernel.perf_event_max_sample_rate to 37000 > > > [11051.409115] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > > > [11755.388727] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > > > [12223.885715] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > > > [14303.465636] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > > > After these messages appear, reading the device BAR memory results in > > > 0x0 instead of the expected value. > > > > > > I would like to understand the following: > > > > > > 1. What could be causing these hotplug_event debug messages? > > > > This is an ACPI Notify event. Basically the platform is telling us to > > re-enumerate the hierarchy below RP01 because a device might have been > > added or removed. > > Thank you for your response regarding the PCI BAR reset issue we are > experiencing with the PLDA Device 5555. I have a few follow-up > questions and additional information to share. > > 1. Clarification on "Platform": > > Does the term "platform" refer to the BIOS/ACPI subsystem in this context? Yes, "platform" refers to the BIOS/ACPI subsystem. > Can the platform signal to re-enumerate the hierarchy below RP01 > without an actual device being removed or added? In our case, the PCI > PLDA device is neither physically removed nor connected to the bus on > the fly. Yes, I think a Bus Check notification is just a request for the OS to re-enumerate starting at the point in the device tree where it is notified. It's possible that no add or remove has occurred. ACPI r6.5, sec 5.6.6, includes the example of hardware that can't detect device changes during a system sleep state, so it issues a Bus Check on wake. > 2. System Configuration: > > We are currently using an x86_64 system with Ubuntu 20.04.6 LTS > (kernel version: 5.4.0-148-generic). > I have enabled dynamic debug logs for all files in the PCI and ACPI > subsystems and rebooted the system with the following parameters: > $ cat /proc/cmdline > BOOT_IMAGE=/vmlinuz-5.4.0-148-generic root=/dev/mapper/vg00-rootvol ro > quiet libata.force=noncq pci=nomsi pcie_aspm=off pcie_ports=on > "dyndbg=file drivers/pci/* +p; file drivers/acpi/* +p" > > > 3. Observations: > > After rebooting with more debug logs, I noticed the issue after 1 day, > 11:48 hours. > A snippet of the dmesg log is mentioned below (complete dmesg log is > attached to this email): > > [128845.248503] ACPI: GPE event 0x01 > [128845.356866] ACPI: \_SB_.PCI0.RP01: ACPI_NOTIFY_BUS_CHECK event > [128845.357343] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in > hotplug_event() If you could add more debug in hotplug_event() and the things it calls, we might get more clues about what's happening. > 4. BAR Reset Issue: > > I filtered the lspci output to show the contents of the configuration > space starting at offset 0x10 for getting BASE_ADDRESS_0 by running > sudo lspci -xxx -s 01:00.0 | grep "10:". > Prior to the BAR reset issue, the lspci output was: > $ sudo lspci -xxx -s 01:00.0 | grep "10:" > 10: 00 00 40 b0 00 00 00 00 00 00 00 00 00 00 00 00 > > During the ACPI_NOTIFY_BUS_CHECK event, the lspci output initially > showed all FF's, and then the next run of the same command showed > BASE_ADDRESS_0 reset to zero: > $ sudo lspci -xxx -s 01:00.0 | grep "10:" > 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Looks like the device isn't responding at all here. Could happen if the device is reset or powered down. What is this device? What driver is bound to it? I don't see anything in dmesg that identifies a driver. > $ sudo lspci -xxx -s 01:00.0 | grep "10:" > 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > I am not sure why lspci initially showed all FF's and then the next > run showed BAR0 reset. > Complete sudo lspci -xxx -s 01:00.0 output is captured in the attached > dmesg_log_pci_bar_reset.txt file. > > /sys/firmware/acpi/interrupts/gpe01: 1 EN enabled unmasked > /sys/firmware/acpi/interrupts/gpe02: 1 EN enabled unmasked > > > 5. Debugging Steps: > > Instrumenting acpiphp_check_bridge() will indicate whether we are > enabling or disabling a slot (enable_slot() or disable_slot()). Based > on the dmesg log, there is only one ACPI_NOTIFY_BUS_CHECK event, and > it is most likely for disable_slot(). However, does instrumenting > acpiphp_check_bridge() will explain why this is happening without > actually removing the PCI PLDA device? No, it won't explain that. But if there was no add/remove event, re-enumeration should be harmless. The objective of instrumentation would be to figure out why it isn't harmless in this case. > 6. Reproduction and Additional Information: > > We do not see any clear pattern or procedure to reproduce this issue. > Once the issue occurs, rebooting the machine resolves it, but it > reoccurs after an unpredictable time. > We have another identical hardware setup with an older kernel (Ubuntu > 16.04.4 LTS, kernel version: 4.4.0-66-generic), and this issue has not > been observed so far on that machine. > Any additional pointers or suggestions on how to proceed to the root > cause of this issue would be greatly appreciated. You're seeing the problem on v5.4 (Nov 2019), which is much newer than v4.4 (Jan 2016). But v5.4 is still really too old to spend a lot of time on unless the problem still happens on a current kernel. Bjorn From helgaas at kernel.org Mon Feb 24 14:54:23 2025 From: helgaas at kernel.org (Bjorn Helgaas) Date: Mon, 24 Feb 2025 13:54:23 -0600 Subject: PCI: hotplug_event: PCIe PLDA Device BAR Reset In-Reply-To: Message-ID: <20250224195423.GA473540@bhelgaas> On Tue, Feb 25, 2025 at 12:29:00AM +0530, Naveen Kumar P wrote: > On Mon, Feb 24, 2025 at 11:03?PM Bjorn Helgaas wrote: > > On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote: > > > On Wed, Feb 19, 2025 at 10:36?PM Bjorn Helgaas wrote: > > > > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote: > > > > > Hi all, > > > > > > > > > > I am writing to seek assistance with an issue we are experiencing with > > > > > a PCIe device (PLDA Device 5555) connected through PCI Express Root > > > > > Port 1 to the host bridge. > > > > > > > > > > We have observed that after booting the system, the Base Address > > > > > Register (BAR0) memory of this device gets reset to 0x0 after > > > > > approximately one hour or more (the timing is inconsistent). This was > > > > > verified using the lspci output and the setpci -s 01:00.0 > > > > > BASE_ADDRESS_0 command. > ... > I booted with the pcie_aspm=off kernel parameter, which means that > PCIe Active State Power Management (ASPM) is disabled. Given this > context, should I consider removing this setting to see if it affects > the occurrence of the Bus Check notifications and the BAR0 reset > issue? Doesn't seem likely to be related. Once configured, ASPM operates without any software intervention. But note that "pcie_aspm=off" means the kernel doesn't touch ASPM configuration at all, and any configuration done by firmware remains in effect. You can tell whether ASPM has been enabled by firmware with "sudo lspci -vv" before the problem occurs. > > > During the ACPI_NOTIFY_BUS_CHECK event, the lspci output initially > > > showed all FF's, and then the next run of the same command showed > > > BASE_ADDRESS_0 reset to zero: > > > $ sudo lspci -xxx -s 01:00.0 | grep "10:" > > > 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > > > > Looks like the device isn't responding at all here. Could happen if > > the device is reset or powered down. > > From the kernel driver or user space tools, is it possible to > determine whether the device has been reset or powered down? Are > there any power management settings or configurations that could be > causing the device to reset or power down unexpectedly? Not really. By "powered down", I meant D3cold, where the main power is removed. Config space is readable in all other power states. > > What is this device? What driver is bound to it? I don't see > > anything in dmesg that identifies a driver. > > The PCIe device in question is a Xilinx FPGA endpoint, which is > flashed with RTL code to expose several host interfaces to the system > via the PCIe link. > > We have an out-of-tree driver for this device, but to eliminate the > driver's role in this issue, I renamed the driver to prevent it from > loading automatically after rebooting the machine. Despite not using > the driver, the issue still occurred. Oh, right, I forgot that you mentioned this before. > > You're seeing the problem on v5.4 (Nov 2019), which is much newer than > > v4.4 (Jan 2016). But v5.4 is still really too old to spend a lot of > > time on unless the problem still happens on a current kernel. This part is important. We don't want to spend a lot of time debugging an issue that may have already been fixed upstream. Bjorn From helgaas at kernel.org Tue Feb 25 15:38:18 2025 From: helgaas at kernel.org (Bjorn Helgaas) Date: Tue, 25 Feb 2025 14:38:18 -0600 Subject: PCI: hotplug_event: PCIe PLDA Device BAR Reset In-Reply-To: Message-ID: <20250225203818.GA516645@bhelgaas> On Tue, Feb 25, 2025 at 06:46:02PM +0530, Naveen Kumar P wrote: > On Tue, Feb 25, 2025 at 1:24?AM Bjorn Helgaas wrote: > > On Tue, Feb 25, 2025 at 12:29:00AM +0530, Naveen Kumar P wrote: > > > On Mon, Feb 24, 2025 at 11:03?PM Bjorn Helgaas wrote: > > > > On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote: > > > > > On Wed, Feb 19, 2025 at 10:36?PM Bjorn Helgaas wrote: > > > > > > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote: > > > > > > > Hi all, > > > > > > > > > > > > > > I am writing to seek assistance with an issue we are experiencing with > > > > > > > a PCIe device (PLDA Device 5555) connected through PCI Express Root > > > > > > > Port 1 to the host bridge. > > > > > > > > > > > > > > We have observed that after booting the system, the Base Address > > > > > > > Register (BAR0) memory of this device gets reset to 0x0 after > > > > > > > approximately one hour or more (the timing is inconsistent). This was > > > > > > > verified using the lspci output and the setpci -s 01:00.0 > > > > > > > BASE_ADDRESS_0 command. > > > > > ... > > > I booted with the pcie_aspm=off kernel parameter, which means that > > > PCIe Active State Power Management (ASPM) is disabled. Given this > > > context, should I consider removing this setting to see if it affects > > > the occurrence of the Bus Check notifications and the BAR0 reset > > > issue? > > > > Doesn't seem likely to be related. Once configured, ASPM operates > > without any software intervention. But note that "pcie_aspm=off" > > means the kernel doesn't touch ASPM configuration at all, and any > > configuration done by firmware remains in effect. > > > > You can tell whether ASPM has been enabled by firmware with "sudo > > lspci -vv" before the problem occurs. > > > > > > > During the ACPI_NOTIFY_BUS_CHECK event, the lspci output initially > > > > > showed all FF's, and then the next run of the same command showed > > > > > BASE_ADDRESS_0 reset to zero: > > > > > $ sudo lspci -xxx -s 01:00.0 | grep "10:" > > > > > 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > > > > > > > > Looks like the device isn't responding at all here. Could happen if > > > > the device is reset or powered down. > > > > > > From the kernel driver or user space tools, is it possible to > > > determine whether the device has been reset or powered down? Are > > > there any power management settings or configurations that could be > > > causing the device to reset or power down unexpectedly? > > > > Not really. By "powered down", I meant D3cold, where the main power > > is removed. Config space is readable in all other power states. > > > > > > What is this device? What driver is bound to it? I don't see > > > > anything in dmesg that identifies a driver. > > > > > > The PCIe device in question is a Xilinx FPGA endpoint, which is > > > flashed with RTL code to expose several host interfaces to the system > > > via the PCIe link. > > > > > > We have an out-of-tree driver for this device, but to eliminate the > > > driver's role in this issue, I renamed the driver to prevent it from > > > loading automatically after rebooting the machine. Despite not using > > > the driver, the issue still occurred. > > > > Oh, right, I forgot that you mentioned this before. > > > > > > You're seeing the problem on v5.4 (Nov 2019), which is much newer than > > > > v4.4 (Jan 2016). But v5.4 is still really too old to spend a lot of > > > > time on unless the problem still happens on a current kernel. > > > > This part is important. We don't want to spend a lot of time > > debugging an issue that may have already been fixed upstream. > > Sure, I started building the 6.13 kernel and will post more > information if I notice the issue on the 6.13 kernel. > > Regarding the CommClk- (Common Clock Configuration) bit, it indicates > whether the common clock configuration is enabled or disabled. When it > is set to CommClk-, it means that the common clock configuration is > disabled. > > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > > For my device, I noticed that the common clock configuration is > disabled. Could this be causing the BAR reset issue? Not to my knowledge. > How is the CommClk bit determined(to set or clear)? and is it okay to > enable this bit after booting the kernel? It is somewhere in drivers/pci/pcie/aspm.c, i.e., https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/aspm.c?id=v6.13#n383 From munisekharrms at gmail.com Thu Feb 27 08:55:32 2025 From: munisekharrms at gmail.com (Muni Sekhar) Date: Thu, 27 Feb 2025 19:25:32 +0530 Subject: pci: acpi: Query on ACPI Device Tree Representation and Enumeration for Xilinx FPGA PCIe Endpoint functions Message-ID: Hi all, I am currently working on a project involving a Xilinx FPGA connected to an x86 CPU via a PCIe root port. The Xilinx FPGA functions as a PCIe endpoint with single function capability and is programmed to emulate the Soundwire Master controller. It can be dynamically reprogrammed to emulate other interfaces as needed. Essentially, the FPGA emulates an interface and connects to the CPU via the PCIe bus. Given this setup, the BIOS does not have prior knowledge of the function implemented in the Xilinx FPGA PCIe endpoint. I have a couple of questions regarding this configuration: Is it possible to define an ACPI Device Tree representation for this type of hardware setup? Can we achieve ACPI-based device enumeration with this configuration? I would greatly appreciate any guidance or references to documentation that could help us achieve this. Thank you for your time and assistance. -- Thanks, Sekhar From helgaas at kernel.org Thu Feb 27 11:04:48 2025 From: helgaas at kernel.org (Bjorn Helgaas) Date: Thu, 27 Feb 2025 10:04:48 -0600 Subject: pci: acpi: Query on ACPI Device Tree Representation and Enumeration for Xilinx FPGA PCIe Endpoint functions In-Reply-To: Message-ID: <20250227160448.GA597390@bhelgaas> On Thu, Feb 27, 2025 at 07:25:32PM +0530, Muni Sekhar wrote: > Hi all, > > I am currently working on a project involving a Xilinx FPGA connected > to an x86 CPU via a PCIe root port. The Xilinx FPGA functions as a > PCIe endpoint with single function capability and is programmed to > emulate the Soundwire Master controller. It can be dynamically > reprogrammed to emulate other interfaces as needed. Essentially, the > FPGA emulates an interface and connects to the CPU via the PCIe bus. > > Given this setup, the BIOS does not have prior knowledge of the > function implemented in the Xilinx FPGA PCIe endpoint. I have a couple > of questions regarding this configuration: > > Is it possible to define an ACPI Device Tree representation for this > type of hardware setup? > Can we achieve ACPI-based device enumeration with this configuration? If the FPGA is programmed before BIOS enumerates PCI devices, the FPGA would look just like any other PCI device, and BIOS would be able to read the Vendor ID and Device ID and would be able to size and program the BARs. So I assume the FPGA is not programmed before BIOS enumeration, the FPGA doesn't respond at all when BIOS or Linux reads the Vendor ID, and you want to program the FPGA later and make Linux enumerate to find it. >From Linux's point of view, this is basically a hot-add of a PCI device. If the Root Port supports hotplug and you have pciehp enabled (CONFIG_HOTPLUG_PCI_PCIE=y) and if the FPGA comes out of reset and brings up the PCIe link after being programmed, it all might "just work." You can also force a complete re-enumeration by writing a non-zero value to /sys/bus/pci/rescan. I'm not sure why you would need ACPI or a device tree to be involved. ACPI and device tree are ways to tell the OS about devices that do not have a native enumeration protocol. PCI devices (like the programmed FPGA) do support native enumeration, so generally we don't need ACPI or device tree descriptions of them. PCI host bridges have a CPU-specific bus on the upstream side and a PCI bus on the downstream side, so they are not themselves PCI devices, and we do need ACPI or device tree descriptions for them. If you have something that doesn't work like you expect, can you post a complete dmesg log and any user commands you're using to program the FPGA? Bjorn