About bio_endio

Wed Jun 25 01:01:11 EDT 2014

Hello Pranay,

 Thanks for your very helpful insights! I hope you don’t mind if I continue
with more questions on block layer :-)

On Jun 25, 2014 2:09 AM, "Pranay Srivastava" <pranjas at gmail.com> wrote:
>
> Hello Alvin,
>
> On Tue, Jun 24, 2014 at 9:53 PM, Alvin Abitria <abitria.alvin at gmail.com>
wrote:
> > Hello Pranay!
> >
> > Thanks for your reply.  I apologize for my very late reply, I was very
> > preoccupied earlier at work.
> >
> >
> > On Tue, Jun 24, 2014 at 1:07 PM, Pranay Srivastava <pranjas at gmail.com>
> > wrote:
> >>
> >> Hello Alvin,
> >>
> >> On Mon, Jun 23, 2014 at 10:39 PM, Alvin Abitria <
abitria.alvin at gmail.com>
> >> wrote:
> >> > Hello,
> >> >
> >> > I'm developing a block driver using the make_request method,
effectively
> >> > bypassing existing scsi or request stack in block layer.  So that
means
> >> > im
> >> > directly working with bios.  As prescribed in linux documentation and
> >> > from
> >> > referring to similar drivers in kernel, you close a session with a
bio
> >> > with
> >> > the bio_endio function.
> >>
> >> So it means you are just passing on the bios without the request
> >> structure if I'm correct?
> >> I don't know how you are handling blk_finish_plug without having
> >> request or request queue,
> >> I maybe wrong in understanding how you are handling it.
> >>
> > Yes, I'm working on bio's level.  No struct requests, and I haven't used
> > blk_finish_plug yet.
> > The block driver method I'm implementing is somewhat along the same line
> > with nvme and mtip2xxx
> > drivers in drivers/block directory (but differing in hardware specific
level
> > of course).
>
> Ok I got it now. But why not use request structures? You can use at
> least one for chaining bio(s). See below what I mean..
>
> >>
> >> >
> >> > I usually invoke bio_endio during successful I/O completion, meaning
> >> > with an
> >> > error code of zero.  But there are cases that this is not fulfilled
or
> >> > there
> >> > are error cases.  My question is, what are the valid error codes that
> >> > can be
> >> > used with it?  My initial impression is that other than zero as error
> >> > code,
> >> -EIO is the one that you should use I think,
> >> > bio_endio will fail.  I've read somewhere that -EBUSY is not
recognized,
> >> > and
> >> > I tried -EIO but my driver crashed.  I got a panic in some dio_xxx
> >> > function
> >> > leading from bio_endio(bio,-EIO). I would like to block subsequent
bios
> >> > sent
> >>
> >> If it's okay for you to post the error then can you do that? I was
> >> seeing the code for
> >> dio_end_io but it would be good if you can post the exact crash
> >> backtrace if you've got that.
> >
> >
> > Here you go:
> >
> > BUG: unable to handle kernel NULL pointer dereference at (null)
> > IP: [<ffffffff811b9a80>] bio_check_pages_dirty+0x50/0xe0
> > PGD 42e5e4067 PUD 42e6e7067 PMD 0
> > Oops: 0000 [#1] SMP
> > last sysfs file:
> > /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
> > CPU 7
> > Modules linked in: block_module(U) fuse ip6table_filter ip6_tables
> > ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4
> > nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM
iptable_mangle
> > iptable_filter ip_tables bridge autofs4 sunrpc 8021q garp stp llc
> > cpufreq_ondemand freq_table pcc_cpufreq ipv6 vhost_net macvtap macvlan
tun
> > kvm_intel kvm uinput power_meter hpilo hpwdt sg tg3 microcode serio_raw
> > iTCO_wdt iTCO_vendor_support ioatdma dca shpchp ext4 mbcache jbd2 sd_mod
> > crc_t10dif hpsa pata_acpi ata_generic ata_piix dm_mirror dm_region_hash
> > dm_log dm_mod [last unloaded: scsi_wait_scan]
> >
> > Pid: 3740, comm: fio Not tainted 2.6.32-358.el6.x86_64 #1 HP ProLiant
DL380p
> > Gen8
> > RIP: 0010:[<ffffffff811b9a80>]  [<ffffffff811b9a80>]
> > bio_check_pages_dirty+0x50/0xe0
> > RSP: 0018:ffff8804191618c8  EFLAGS: 00010046
> > RAX: 2000000000000000 RBX: ffff88041909f0c0 RCX: 00000000000011ae
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > RBP: ffff8804191618f8 R08: ffffffff81c07728 R09: 0000000000000040
> > R10: 0000000000000002 R11: 0000000000000002 R12: 0000000000000000
> > R13: ffff8804191b9b80 R14: ffff8804191b9b80 R15: 0000000000000000
> > FS:  00007fcd43e2d720(0000) GS:ffff8800366e0000(0000)
knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000000 CR3: 000000041e69f000 CR4: 00000000000407e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process fio (pid: 3740, threadinfo ffff880419160000, task
ffff88043341f500)
> > Stack:
> >  0000000000000000 ffff88043433f400 ffff88043433f520 ffff88041909f0c0
> > <d> ffff88041909f0c0 ffff8804191b9b80 ffff880419161948 ffffffff811bdc38
> > <d> ffff880419161968 0000000034236400 00000000fffffffb ffff88043433f400
> > Call Trace:
> >  [<ffffffff811bdc38>] dio_bio_complete+0xc8/0xd0
>
> You are using a different kernel then mine. See if the offset within
> this function leads you to
>
>                 spin_lock_irqsave(&bio_dirty_lock, flags);
>                 bio->bi_private = bio_dirty_list; /*This is the line
> number I got from the offset*/
>                 bio_dirty_list = bio;
>                 spin_unlock_irqrestore(&bio_dirty_lock, flags);
>

Here are the offsets in my kernel:
(gdb) list *(dio_bio_complete+0xc8)

0xffffffff811bdc38 is in dio_bio_complete (fs/direct-io.c:440).

435         int page_no;

436

437         if (!uptodate)

438               dio->io_error = -EIO;

439

440         if (dio->is_async && dio->rw == READ) {

441               bio_check_pages_dirty(bio);   /* transfers ownership */

442         } else {

443               for (page_no = 0; page_no < bio->bi_vcnt; page_no++) {

444                     struct page *page = bvec[page_no].bv_page;

(gdb)

445

446                     if (dio->rw == READ && !PageCompound(page))

447                           set_page_dirty_lock(page);

448                     page_cache_release(page);

449               }

450               bio_put(bio);

451         }

452         return uptodate ? 0 : -EIO;

453   }

(gdb) list *(bio_check_pages_dirty+0x50)

0xffffffff811b9a80 is in bio_check_pages_dirty
(/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/arch/x86/include/asm/bitops.h:359).

354   #endif

355

356   static __always_inline int constant_test_bit(unsigned int nr, const
volatile unsigned long *addr)

357   {

358         return ((1UL << (nr % BITS_PER_LONG)) &

359               (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;

360   }

361

362   static inline int variable_test_bit(int nr, volatile const unsigned
long *addr)

363   {

(gdb)

364         int oldbit;

365

366         asm volatile("bt %2,%1\n\t"

367                    "sbb %0,%0"

368                    : "=r" (oldbit)

369                    : "m" (*(unsigned long *)addr), "Ir" (nr));

370

371         return oldbit;

372   }

373

> You are not doing bio_put in the driver code right? Assuming you are
> not doing bio_get also. On which filesystem you are testing this?

No, I’m not using either the bio_put nor the bio_get kernel functions.
Also, there is no filesystem yet on the drive under test; it’s raw.  This
is because  I’m using fio benchmarking tool to send large amount of I/O
traffic to test how the driver behaves if pushed to its queue depth limits.

Can
> you see if it happens on a simpler one like ext2 since it uses the
> generic direct_IO call. So we can further narrow it down.

>
> >  [<ffffffff811bef4f>] dio_bio_end_aio+0x2f/0xd0
> >  [<ffffffff811b920d>] bio_endio+0x1d/0x40
> >  [<ffffffffa02c15a1>] block_make_request+0xe1/0x150 [block_module]
> >  [<ffffffff8125ccce>] generic_make_request+0x25e/0x530
> >  [<ffffffff811bae72>] ? bvec_alloc_bs+0x62/0x110
> >  [<ffffffff8125d02d>] submit_bio+0x8d/0x120
> >  [<ffffffff811bdf6c>] dio_bio_submit+0xbc/0xc0
> >  [<ffffffff811be951>] __blockdev_direct_IO_newtrunc+0x631/0xb30
> >  [<ffffffff8111afe3>] ? filemap_fault+0xd3/0x500
> >  [<ffffffff811beeae>] __blockdev_direct_IO+0x5e/0xd0
> >  [<ffffffff811bb280>] ? blkdev_get_blocks+0x0/0xc0
> >  [<ffffffff811bc347>] blkdev_direct_IO+0x57/0x60
> >  [<ffffffff811bb280>] ? blkdev_get_blocks+0x0/0xc0
> >  [<ffffffff8111bb8b>] generic_file_aio_read+0x6bb/0x700
> >  [<ffffffff81166a2a>] ? kmem_getpages+0xba/0x170
> >  [<ffffffff81166f87>] ? cache_grow+0x217/0x320
> >  [<ffffffff811bb893>] blkdev_aio_read+0x53/0xc0
> >  [<ffffffff8111c633>] ? mempool_alloc+0x63/0x140
> >  [<ffffffff811bb840>] ? blkdev_aio_read+0x0/0xc0
> >  [<ffffffff811cadc4>] aio_rw_vect_retry+0x84/0x200
> >  [<ffffffff811cc784>] aio_run_iocb+0x64/0x170
> >  [<ffffffff811cdbb1>] do_io_submit+0x291/0x920
> >  [<ffffffff811ce250>] sys_io_submit+0x10/0x20
> >  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
> > Code: 31 e4 45 31 ff eb 15 0f 1f 40 00 0f b7 43 28 41 83 c4 01 41 83 c7
01
> > 44 39 e0 7e 36 4d 63 ec 49 c1 e5 04 4f 8d 2c 2e 49 8b 7d 00 <48> 8b 07
a8 10
> > 75 06 66 a9 00 c0 74 d3 e8 5e 59 f7 ff 49 c7 45
> > RIP  [<ffffffff811b9a80>] bio_check_pages_dirty+0x50/0xe0
> >  RSP <ffff8804191618c8>
> > CR2: 0000000000000000
> >
> > Basically, after receiving the bio from generic_make_request, I checked
and
> > found that
> > I already have the maximum outstanding pending I/O (bios) and couldn't
> > accomodate
> > this bio for now.  Rather than just exiting the make_request() routine,
I
> > decided to close
> > this bio or return it to kernel via bio_endio(bio, -EIO).  No
processing or
> > modification
> >  was done to the bio, I just used the callback fn above.
> >
> >>
> >> > to me after reaching my queue depth and with no tags left, and so I
want
> >> > to
> >> > use bio_endio with an error code.
> >>
> >> If you have a request queue then you could call blk_stop_queue and
> >> blk_start_queue but I don't know if this is
> >> relevant to your case.
> >
> >
> > I do have a request queue allocated using blk_alloc_queue, but this is
more
> > of a dummy queue
> > in my case - I don't use it as much as the struct requests would use it.
> > Its function that I used is
> > for me to register my make_request function, register the maximum I/O
size
> > and max number of
> > buffer segments that can be sent to my driver.
>
> See request generation is not in your hands. You can't tell a file
> system don't send in more requests but what your driver can do is
> choose to delay their processing when it is able to do so. So I don't
> think you should return error from make_request at least hold on to
> those bios so you can process them later. We don't want to lie to
> applications that an I/O error occurred right?
>
> For this purpose I thought you might use one static request structure
> and since you are already overriding the default one just put those
> bios in that request, chain them. I guess you'll have to do a bit more
> if you work this way so that you don't mess with the request code
> anywhere else, but if putting those bios on a list is easier go that
> way then.
>
I get your point, that there is no control over bio generation from the
upper layers.  I’m curious in your suggestion to hold on the excess bios
for later processing.  Say I do have a queue 256 bio entries deep, the 256
bios that make it within the queue gets processed, the 257th bio etc. will
not be returned but instead put on a list (a struct bio_list?) and
processed later on.  Is that correct?  Is there a possibility that if
traffic is really high and that lasts for indefinite amount of time that
the bio list will grow bigger and bigger and overflow ‘til it hurts the
kernel?  I’m thinking of the worst case and how to handle that, since upper
layers can’t be stopped from sending I/O to my driver anyway.

I also have a related question from the generic block layer.  I see that
it’s tagging outstanding requests via the blk_queue_start_tag() if the
request is within the block layer queue depth.  By tagging requests it can
process them.  That is similar to what I want to do.  I want to handle
properly those bios over the queue, so I first thought of returning it to
kernel hoping it will retry sending it til I can accomodate it, or be able
to tell it im busy and so don't send me bios for a while til driver has
free slots in queue.  What does the block layer do with a request that
failed to get a tag (because it’s over the queue depth).  Where is it
placed?

> >
> > Thanks for the suggestion, I haven't tried blk_stop_queue and
> > blk_start_queue but I hope it will
> > work.  I also like the possibility that you can prevent the upper layers
> > from sending me further bio
> > when my queue depth is full, then letting them know once my driver can
> > accomodate new bios -
> > more of telling them I'm busy right now, don't send me further bios,
etc.
> > Is that the general idea behind
> > the two functions you mentioned?
>
> Yup that's what those functions are for. But they just stop processing
> part not the generation part of request. I hope you got the point I'm
> trying to make.
>
> >
> > I also notice that the two kernel APIs you gave either set/clear a
certain
> > request queue flag.  I'd like to
> > think that upper layers (generic_make_request etc) check that flag
first to
> > decide if it can dispatch bios
> > to my driver, is that right?
> Right that's for the queue, See blk_queue_stopped, if that's set then
> processing won't be done right now i.e. request_function won't be
> called. But nothing on the make_request function since that's not in
> your hands. Let the requests be queued but process them at your
> discretion.
> >>
> >>
> >> >
> >> > What are those error codes, and will they work for my intended
function?
> >> > Thanks!
>
> I thought -EIOCBQUEUED might help but I don't think it would. You can
> try it though.
>
Thanks! No harm in trying :-) Let me know if you have questions regarding
my replies.
> >>
> >> -EIO should work, but first let's find out why you got the crash.
> >>
> > Thanks.  I hope we get to the bottom why it failed and crashed.  Let me
know
> > if you have questions.
> >
> >>
> >> > _______________________________________________
> >> > Kernelnewbies mailing list
> >> > Kernelnewbies at kernelnewbies.org
> >> > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
> >> >
> >>
> >>
> >>
> >> --
> >>         ---P.K.S
> >
> > Alvin
>
>
>
> --
>         ---P.K.S

Alvin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20140625/bf1eae18/attachment-0001.html