pcie dma transfer

Christoph Böhmwalder christoph at boehmwalder.at
Mon Jun 4 07:12:48 EDT 2018


Hi,

I'm not sure how on-topic this is on this list, but I have a question
regarding a device driver design issue.

For our Bachelor's project my team and I are tasked to optimize an
existing hardware solution. The design utilizes an FPGA to accomplish
various tasks, including a Triple Speed Ethernet controller that is linked to
the CPU via PCI express. Currently the implementation is fairly naive,
and the driver just does byte-by-byte reads directly from a FIFO on the
FPGA device. This, of course, is quite resource intensive and basically
hogs up the CPU completely (leading throughput to peak at around
10 Mbit/s).

Our plan to solve this problem is as follows:

* Keep a buffer on the FPGA that retains a number of Ethernet packages.
* Once a certain threshold is reached (or a period of time, e.g. 5ms, elapses),
  the buffer is flushed and sent directly to RAM via DMA.
* When the buffer is flushed and the data is in RAM and accessible by
  the CPU, the device raises an interrupt, signalling the CPU to read
  the data.
* In the interrupt handler, we `memcpy` the individual packets to
  another buffer and hand them to the upper layer in the network stack.

Our rationale behind keeping a buffer of packets rather than just
transmitting a single packet is to maximize the amount of data send
with each PCIe transaction (and in turn minimize the overhead).

However, upon reading a relevant LDD chapter [1] (which, admittedly, we
should have done in the first place), we found that the authors of the
book take a different approach:

> The second case comes about when DMA is used asynchronously. This happens,
> for example, with data acquisition devices that go on pushing data even if
> nobody is reading them. In this case, the driver should maintain a buffer so
> that a subsequent read call will return all the accumulated data to user
> space. The steps involved in this kind of transfer are slightly different:
>   1. The hardware raises an interrupt to announce that new data has arrived.
>   2. The interrupt handler allocates a buffer and tells the hardware where to transfer
>      its data.
>   3. The peripheral device writes the data to the buffer and raises another interrupt
>      when it’s done.
>   4. The handler dispatches the new data, wakes any relevant process, and takes care
>      of housekeeping.
> A variant of the asynchronous approach is often seen with network cards. These
> cards often expect to see a circular buffer (often called a DMAring buffer) established
> in memory shared with the processor; each incoming packet is placed in the
> next available buffer in the ring, and an interrupt is signaled. The driver then passes
> the network packets to the rest of the kernel and places a new DMA buffer in the
> ring.

Now, there are some obvious advantages to this method (not the least of
which being that it's much easier to implement), but I can't help but
feel like it would be a little inefficient.

Now here's my question: Is our solution to this problem sane? Do you
think it would be viable or that it would create more issues than it
would solve? Should we go the LDD route instead and allocate a new
buffer everytime an interrupt is raised?

Thanks for your help!

--
Regards,
Christoph

[1] https://static.lwn.net/images/pdf/LDD3/ch15.pdf (page 30)



More information about the Kernelnewbies mailing list