About bio_endio

Wed Jun 25 05:58:39 EDT 2014

On Wed, Jun 25, 2014 at 10:31 AM, Alvin Abitria <abitria.alvin at gmail.com> wrote:
> Hello Pranay,
>
>  Thanks for your very helpful insights! I hope you don’t mind if I continue
> with more questions on block layer :-)

Surely.

>
> On Jun 25, 2014 2:09 AM, "Pranay Srivastava" <pranjas at gmail.com> wrote:
>>
>> Hello Alvin,
>>
>> On Tue, Jun 24, 2014 at 9:53 PM, Alvin Abitria <abitria.alvin at gmail.com>
>> wrote:
>> > Hello Pranay!
>> >
>> > Thanks for your reply.  I apologize for my very late reply, I was very
>> > preoccupied earlier at work.
>> >
>> >
>> > On Tue, Jun 24, 2014 at 1:07 PM, Pranay Srivastava <pranjas at gmail.com>
>> > wrote:
>> >>
>> >> Hello Alvin,
>> >>
>> >> On Mon, Jun 23, 2014 at 10:39 PM, Alvin Abitria
>> >> <abitria.alvin at gmail.com>
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > I'm developing a block driver using the make_request method,
>> >> > effectively
>> >> > bypassing existing scsi or request stack in block layer.  So that
>> >> > means
>> >> > im
>> >> > directly working with bios.  As prescribed in linux documentation and
>> >> > from
>> >> > referring to similar drivers in kernel, you close a session with a
>> >> > bio
>> >> > with
>> >> > the bio_endio function.
>> >>
>> >> So it means you are just passing on the bios without the request
>> >> structure if I'm correct?
>> >> I don't know how you are handling blk_finish_plug without having
>> >> request or request queue,
>> >> I maybe wrong in understanding how you are handling it.
>> >>
>> > Yes, I'm working on bio's level.  No struct requests, and I haven't used
>> > blk_finish_plug yet.
>> > The block driver method I'm implementing is somewhat along the same line
>> > with nvme and mtip2xxx
>> > drivers in drivers/block directory (but differing in hardware specific
>> > level
>> > of course).
>>
>> Ok I got it now. But why not use request structures? You can use at
>> least one for chaining bio(s). See below what I mean..
>>
>> >>
>> >> >
>> >> > I usually invoke bio_endio during successful I/O completion, meaning
>> >> > with an
>> >> > error code of zero.  But there are cases that this is not fulfilled
>> >> > or
>> >> > there
>> >> > are error cases.  My question is, what are the valid error codes that
>> >> > can be
>> >> > used with it?  My initial impression is that other than zero as error
>> >> > code,
>> >> -EIO is the one that you should use I think,
>> >> > bio_endio will fail.  I've read somewhere that -EBUSY is not
>> >> > recognized,
>> >> > and
>> >> > I tried -EIO but my driver crashed.  I got a panic in some dio_xxx
>> >> > function
>> >> > leading from bio_endio(bio,-EIO). I would like to block subsequent
>> >> > bios
>> >> > sent
>> >>
>> >> If it's okay for you to post the error then can you do that? I was
>> >> seeing the code for
>> >> dio_end_io but it would be good if you can post the exact crash
>> >> backtrace if you've got that.
>> >
>> >
>> > Here you go:
>> >
>> > BUG: unable to handle kernel NULL pointer dereference at (null)
>> > IP: [<ffffffff811b9a80>] bio_check_pages_dirty+0x50/0xe0
>> > PGD 42e5e4067 PUD 42e6e7067 PMD 0
>> > Oops: 0000 [#1] SMP
>> > last sysfs file:
>> > /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
>> > CPU 7
>> > Modules linked in: block_module(U) fuse ip6table_filter ip6_tables
>> > ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4
>> > nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM
>> > iptable_mangle
>> > iptable_filter ip_tables bridge autofs4 sunrpc 8021q garp stp llc
>> > cpufreq_ondemand freq_table pcc_cpufreq ipv6 vhost_net macvtap macvlan
>> > tun
>> > kvm_intel kvm uinput power_meter hpilo hpwdt sg tg3 microcode serio_raw
>> > iTCO_wdt iTCO_vendor_support ioatdma dca shpchp ext4 mbcache jbd2 sd_mod
>> > crc_t10dif hpsa pata_acpi ata_generic ata_piix dm_mirror dm_region_hash
>> > dm_log dm_mod [last unloaded: scsi_wait_scan]
>> >
>> > Pid: 3740, comm: fio Not tainted 2.6.32-358.el6.x86_64 #1 HP ProLiant
>> > DL380p
>> > Gen8
>> > RIP: 0010:[<ffffffff811b9a80>]  [<ffffffff811b9a80>]
>> > bio_check_pages_dirty+0x50/0xe0
>> > RSP: 0018:ffff8804191618c8  EFLAGS: 00010046
>> > RAX: 2000000000000000 RBX: ffff88041909f0c0 RCX: 00000000000011ae
>> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>> > RBP: ffff8804191618f8 R08: ffffffff81c07728 R09: 0000000000000040
>> > R10: 0000000000000002 R11: 0000000000000002 R12: 0000000000000000
>> > R13: ffff8804191b9b80 R14: ffff8804191b9b80 R15: 0000000000000000
>> > FS:  00007fcd43e2d720(0000) GS:ffff8800366e0000(0000)
>> > knlGS:0000000000000000
>> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > CR2: 0000000000000000 CR3: 000000041e69f000 CR4: 00000000000407e0
>> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > Process fio (pid: 3740, threadinfo ffff880419160000, task
>> > ffff88043341f500)
>> > Stack:
>> >  0000000000000000 ffff88043433f400 ffff88043433f520 ffff88041909f0c0
>> > <d> ffff88041909f0c0 ffff8804191b9b80 ffff880419161948 ffffffff811bdc38
>> > <d> ffff880419161968 0000000034236400 00000000fffffffb ffff88043433f400
>> > Call Trace:
>> >  [<ffffffff811bdc38>] dio_bio_complete+0xc8/0xd0
>>
>> You are using a different kernel then mine. See if the offset within
>> this function leads you to
>>
>>                 spin_lock_irqsave(&bio_dirty_lock, flags);
>>                 bio->bi_private = bio_dirty_list; /*This is the line
>> number I got from the offset*/
>>                 bio_dirty_list = bio;
>>                 spin_unlock_irqrestore(&bio_dirty_lock, flags);
>>
>
> Here are the offsets in my kernel:
> (gdb) list *(dio_bio_complete+0xc8)
>
> 0xffffffff811bdc38 is in dio_bio_complete (fs/direct-io.c:440).
>
> 435         int page_no;
>
> 436
>
> 437         if (!uptodate)
>
> 438               dio->io_error = -EIO;
>
> 439
>
> 440         if (dio->is_async && dio->rw == READ) {
>
> 441               bio_check_pages_dirty(bio);   /* transfers ownership */
>
> 442         } else {
>
> 443               for (page_no = 0; page_no < bio->bi_vcnt; page_no++) {
>
> 444                     struct page *page = bvec[page_no].bv_page;
>
> (gdb)
>
> 445
>
> 446                     if (dio->rw == READ && !PageCompound(page))
>
> 447                           set_page_dirty_lock(page);
>
> 448                     page_cache_release(page);
>
> 449               }
>
> 450               bio_put(bio);
>
> 451         }
>
> 452         return uptodate ? 0 : -EIO;
>
> 453   }
>
>
>
> (gdb) list *(bio_check_pages_dirty+0x50)
>
> 0xffffffff811b9a80 is in bio_check_pages_dirty
> (/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/arch/x86/include/asm/bitops.h:359).
>
> 354   #endif
>
> 355
>
> 356   static __always_inline int constant_test_bit(unsigned int nr, const
> volatile unsigned long *addr)
>
> 357   {
>
> 358         return ((1UL << (nr % BITS_PER_LONG)) &
>
> 359               (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
>
> 360   }
>
> 361
>
> 362   static inline int variable_test_bit(int nr, volatile const unsigned
> long *addr)
>
> 363   {
>
> (gdb)
>
> 364         int oldbit;
>
> 365
>
> 366         asm volatile("bt %2,%1\n\t"
>
> 367                    "sbb %0,%0"
>
> 368                    : "=r" (oldbit)
>
> 369                    : "m" (*(unsigned long *)addr), "Ir" (nr));
>
> 370
>
> 371         return oldbit;
>
> 372   }
>
> 373
>

Ok I think there's a problem with page you are having. So in your make
request function can you just see if the pages are present in the
bvecs of the bio? They should be there though.

>> You are not doing bio_put in the driver code right? Assuming you are
>> not doing bio_get also. On which filesystem you are testing this?
>
> No, I’m not using either the bio_put nor the bio_get kernel functions.
> Also, there is no filesystem yet on the drive under test; it’s raw.  This is
> because  I’m using fio benchmarking tool to send large amount of I/O traffic
> to test how the driver behaves if pushed to its queue depth limits.

>
> Can
>> you see if it happens on a simpler one like ext2 since it uses the
>> generic direct_IO call. So we can further narrow it down.
>
>>
>> >  [<ffffffff811bef4f>] dio_bio_end_aio+0x2f/0xd0
>> >  [<ffffffff811b920d>] bio_endio+0x1d/0x40
>> >  [<ffffffffa02c15a1>] block_make_request+0xe1/0x150 [block_module]
>> >  [<ffffffff8125ccce>] generic_make_request+0x25e/0x530
>> >  [<ffffffff811bae72>] ? bvec_alloc_bs+0x62/0x110
>> >  [<ffffffff8125d02d>] submit_bio+0x8d/0x120
>> >  [<ffffffff811bdf6c>] dio_bio_submit+0xbc/0xc0
>> >  [<ffffffff811be951>] __blockdev_direct_IO_newtrunc+0x631/0xb30
>> >  [<ffffffff8111afe3>] ? filemap_fault+0xd3/0x500
>> >  [<ffffffff811beeae>] __blockdev_direct_IO+0x5e/0xd0
>> >  [<ffffffff811bb280>] ? blkdev_get_blocks+0x0/0xc0
>> >  [<ffffffff811bc347>] blkdev_direct_IO+0x57/0x60
>> >  [<ffffffff811bb280>] ? blkdev_get_blocks+0x0/0xc0
>> >  [<ffffffff8111bb8b>] generic_file_aio_read+0x6bb/0x700
>> >  [<ffffffff81166a2a>] ? kmem_getpages+0xba/0x170
>> >  [<ffffffff81166f87>] ? cache_grow+0x217/0x320
>> >  [<ffffffff811bb893>] blkdev_aio_read+0x53/0xc0
>> >  [<ffffffff8111c633>] ? mempool_alloc+0x63/0x140
>> >  [<ffffffff811bb840>] ? blkdev_aio_read+0x0/0xc0
>> >  [<ffffffff811cadc4>] aio_rw_vect_retry+0x84/0x200
>> >  [<ffffffff811cc784>] aio_run_iocb+0x64/0x170
>> >  [<ffffffff811cdbb1>] do_io_submit+0x291/0x920
>> >  [<ffffffff811ce250>] sys_io_submit+0x10/0x20
>> >  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
>> > Code: 31 e4 45 31 ff eb 15 0f 1f 40 00 0f b7 43 28 41 83 c4 01 41 83 c7
>> > 01
>> > 44 39 e0 7e 36 4d 63 ec 49 c1 e5 04 4f 8d 2c 2e 49 8b 7d 00 <48> 8b 07
>> > a8 10
>> > 75 06 66 a9 00 c0 74 d3 e8 5e 59 f7 ff 49 c7 45
>> > RIP  [<ffffffff811b9a80>] bio_check_pages_dirty+0x50/0xe0
>> >  RSP <ffff8804191618c8>
>> > CR2: 0000000000000000
>> >
>> > Basically, after receiving the bio from generic_make_request, I checked
>> > and
>> > found that
>> > I already have the maximum outstanding pending I/O (bios) and couldn't
>> > accomodate
>> > this bio for now.  Rather than just exiting the make_request() routine,
>> > I
>> > decided to close
>> > this bio or return it to kernel via bio_endio(bio, -EIO).  No processing
>> > or
>> > modification
>> >  was done to the bio, I just used the callback fn above.
>> >
>> >>
>> >> > to me after reaching my queue depth and with no tags left, and so I
>> >> > want
>> >> > to
>> >> > use bio_endio with an error code.
>> >>
>> >> If you have a request queue then you could call blk_stop_queue and
>> >> blk_start_queue but I don't know if this is
>> >> relevant to your case.
>> >
>> >
>> > I do have a request queue allocated using blk_alloc_queue, but this is
>> > more
>> > of a dummy queue
>> > in my case - I don't use it as much as the struct requests would use it.
>> > Its function that I used is
>> > for me to register my make_request function, register the maximum I/O
>> > size
>> > and max number of
>> > buffer segments that can be sent to my driver.
>>
>> See request generation is not in your hands. You can't tell a file
>> system don't send in more requests but what your driver can do is
>> choose to delay their processing when it is able to do so. So I don't
>> think you should return error from make_request at least hold on to
>> those bios so you can process them later. We don't want to lie to
>> applications that an I/O error occurred right?
>>
>> For this purpose I thought you might use one static request structure
>> and since you are already overriding the default one just put those
>> bios in that request, chain them. I guess you'll have to do a bit more
>> if you work this way so that you don't mess with the request code
>> anywhere else, but if putting those bios on a list is easier go that
>> way then.
>>
> I get your point, that there is no control over bio generation from the
> upper layers.  I’m curious in your suggestion to hold on the excess bios for
> later processing.  Say I do have a queue 256 bio entries deep, the 256 bios
> that make it within the queue gets processed, the 257th bio etc. will not be
> returned but instead put on a list (a struct bio_list?) and processed later
> on.  Is that correct?  Is there a possibility that if traffic is really high
> and that lasts for indefinite amount of time that the bio list will grow
> bigger and bigger and overflow ‘til it hurts the kernel?  I’m thinking of
> the worst case and how to handle that, since upper layers can’t be stopped
> from sending I/O to my driver anyway.
>
>
>
> I also have a related question from the generic block layer.  I see that
> it’s tagging outstanding requests via the blk_queue_start_tag() if the
> request is within the block layer queue depth.  By tagging requests it can
> process them.  That is similar to what I want to do.  I want to handle
> properly those bios over the queue, so I first thought of returning it to

I don't  think this will solve the problem. If you see start_tag it's
actually getting the request off the request_queue. I think you would
need to do this in your request_function but that's the thing we want
to stop. We don't want to process the request. You can see it though I
couldn't make much of it like how to use it in this case.

> kernel hoping it will retry sending it til I can accomodate it, or be able
> to tell it im busy and so don't send me bios for a while til driver has free
> slots in queue.  What does the block layer do with a request that failed to
> get a tag (because it’s over the queue depth).  Where is it placed?

I think you aren't using a request_function and only using a
make_request function. In that case you can pass on the bios directly
to you processing function without troubling block layer any further.
Yes bio_list is what I was talking about. You'll need to have your
make_request function take the queue_lock while manipulating that
list.

Ok so let's take a look at problem, if is more then what we can handle
then we must have a way of bailing out. You can't tell upper layers to
stop but then you can give them error that enough is enough :-P. I
think you can maintain 2 queues one being your processing queue other
place holder queue. If you have added items to place holder queue then
you can start a timer and if it expires then call bio_endio with error
code(-EIO) on the items in place holder queue. I say 2 queues so that
while you are deleting bios from one queue you can add them to other
one for processing if there's space in processing queue.

>
>> >
>> > Thanks for the suggestion, I haven't tried blk_stop_queue and
>> > blk_start_queue but I hope it will
>> > work.  I also like the possibility that you can prevent the upper layers
>> > from sending me further bio
>> > when my queue depth is full, then letting them know once my driver can
>> > accomodate new bios -
>> > more of telling them I'm busy right now, don't send me further bios,
>> > etc.
>> > Is that the general idea behind
>> > the two functions you mentioned?

Yes but I don't think you need those. Since you are not involving the
request structures.

>>
>> Yup that's what those functions are for. But they just stop processing
>> part not the generation part of request. I hope you got the point I'm
>> trying to make.
>>
>> >
>> > I also notice that the two kernel APIs you gave either set/clear a
>> > certain
>> > request queue flag.  I'd like to
>> > think that upper layers (generic_make_request etc) check that flag first
>> > to
>> > decide if it can dispatch bios
>> > to my driver, is that right?
>> Right that's for the queue, See blk_queue_stopped, if that's set then
>> processing won't be done right now i.e. request_function won't be
>> called. But nothing on the make_request function since that's not in
>> your hands. Let the requests be queued but process them at your
>> discretion.
>> >>
>> >>
>> >> >
>> >> > What are those error codes, and will they work for my intended
>> >> > function?
>> >> > Thanks!
>>
>> I thought -EIOCBQUEUED might help but I don't think it would. You can
>> try it though.
>>
> Thanks! No harm in trying :-) Let me know if you have questions regarding my
> replies.
>
>> >>
>> >> -EIO should work, but first let's find out why you got the crash.
>> >>
>> > Thanks.  I hope we get to the bottom why it failed and crashed.  Let me
>> > know
>> > if you have questions.
>> >
>> >>
>> >> > _______________________________________________
>> >> > Kernelnewbies mailing list
>> >> > Kernelnewbies at kernelnewbies.org
>> >> > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >>         ---P.K.S
>> >
>> > Alvin
>>
>>
>>
>> --
>>         ---P.K.S
>
> Alvin

-- 
        ---P.K.S