This is a question I've posted on the #btrfs IRC channel today.
hyperair adviced me to contact with Josef Bacik or Chris Mason.
So, I post my question to this maling list.
Here are my post on the IRC:
Actually, I want to remove BUG_ON(ret) around the Btrfs code.
The motivation is to make the Btrfs code more robust.
First of all, is this meaningless?
For example, there are code like the following:
struct btrfs_path *path;
path = btrfs_alloc_path();
BUG_ON(!path);
This is a frequenty used pattern of current Btrfs code.
A btrfs_alloc_path()'s caller has to deal with the allocation failure
instead of using BUG_ON. However, (this is what most interesting
thing for me) can the caller do any proper error handlings here?
I mean, is this a critical situation where we cannot recover from?
--
Yoshinori Sano <yoshinori.sano@gmail.com>
--
No we're just lazy ;). Tho making sure the caller can recover from getting -ENOMEM is very important, which is why in some of these paths we just do BUG_ON since fixing the callers is tricky. A good strategy for things like this is to do something like static int foo = 1; path = btrfs_alloc_path(); if (!path || !(foo % 1000)) return -ENOMEM; foo++; that way you can catch all the callers and make sure we're handling the error all the way up the chain properly. Thanks, Josef --
Hahaha, I love it. Yeah, I suspect this approach will be a bit confusing though. I believe that it will be more effective, although time consuming, to work through the call tree function by function. Although, as I have said, the problem is working out what needs to be done to recover, rather than working out what the callers are. I'm not at all sure yet but I also suspect that it may not be possible to recover in some cases, which will likely lead to serious rework of some subsystems (but, hey, who am I to say, I really don't have any clue yet). Ian --
So we talked about this at plumbers. First thing we need is a way to flip the filesystem read only, that way we can deal with the simple corruption cases. And then we can start looking at these harder cases where it's really unclear about how to recover. Thankfully because we're COW we really shouldn't have any cases that we have to unwind anything, we just fail the operation and go on our happy merry way. The only tricky thing is where we get ENOMEM when say inserting the metadata for data after writing out the data, since that will leave data just sitting around. Probably should look at what NFS does with dirty pages when the server hangs up. Thanks, Josef --
OK, that's a though for me to focus on while I'm trying to work out what's going on ... mmm. Indeed, a large proportion of these are handling ENOMEM. I somehow suspect your heavily focused on disk io itself when I'm still back thinking about house keeping of operations, in the process of being queued and those currently being processed, the later being the difficult case. But I'll eventually get to worrying about io as part of that process. It's also worth mentioning that my scope is also quite narrow at this stage, focusing largely on the transaction subsystem, although that tends to pull in a fair amount of other code too. Seems to me that if the house keeping is solid then potential corruption issues should be easier to spot and possibly avoid. Ian --
So the transaction stuff should be relatively simple since we shouldn't have too much to clean up if the transaction fails to allocate. Maybe point out some places where you are having trouble and I can frame up what we'd want to do to give you an idea of where to go? Thanks, Josef --
Thanks, I will when I have something to discuss or maybe I'll start with the couple I have, when I get a chance to get back to it anyway. Ian --
How about we discuss this patch to start with?
The places where I have put "/* TODO: What to do here? */" are the
places where I couldn't work out what to do or if some sort of recovery
is needed or what effect a failure might have on continued operations.
btrfs - resolve bug_on()s in btrfs_record_root_in_trans()
From: Ian Kent <raven@themaw.net>
---
fs/btrfs/ctree.c | 3 ++-
fs/btrfs/disk-io.c | 3 ++-
fs/btrfs/extent-tree.c | 4 +++-
fs/btrfs/inode.c | 7 +++++--
fs/btrfs/ioctl.c | 8 ++++++--
fs/btrfs/relocation.c | 30 +++++++++++++++++++++++-------
fs/btrfs/root-tree.c | 4 +++-
fs/btrfs/transaction.c | 15 ++++++++++++---
fs/btrfs/tree-log.c | 1 +
9 files changed, 57 insertions(+), 18 deletions(-)
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 9ac1715..a1f46fa 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -3832,7 +3832,8 @@ int btrfs_insert_item(struct btrfs_trans_handle *trans, struct btrfs_root
unsigned long ptr;
path = btrfs_alloc_path();
- BUG_ON(!path);
+ if (!path)
+ return -ENOMEM;
ret = btrfs_insert_empty_item(trans, root, path, cpu_key, data_size);
if (!ret) {
leaf = path->nodes[0];
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b40dfe4..066af87 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1105,7 +1105,8 @@ struct btrfs_root *btrfs_read_fs_root_no_radix(struct btrfs_root *tree_root,
root, fs_info, location->objectid);
path = btrfs_alloc_path();
- BUG_ON(!path);
+ if (!path)
+ return ERR_PTR(-ENOMEM);
ret = btrfs_search_slot(NULL, tree_root, location, path, 0, 0);
if (ret == 0) {
l = path->nodes[0];
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a541bc8..d737cea6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7821,8 +7821,10 @@ static noinline int relocate_one_extent(struct btrfs_root *extent_root,
}
...We should probably just check for a return value and pass it up the stack and
let the caller deal with it, there doesn't appear to be any cleanup stuff thats
This is probably one of those cases where we want to flip read only, because
really we are screwed if we cannot setup a new root. In this case we'd just
want to clear the radix tag and return the error and let the caller deal with
it. This can be called by the snapshot stuff so it's not catastrophic if we
cant create a snapshot. It's also called by the transaction stuff so all these
Ok here we want to do something like
ret = record_root_in_trans(h, root);
if (ret) {
if (type != TRANS_JOIN_NOLOCK)
mutex_unlock(&root->fs_info->trans_mutex);
btrfs_end_transaction(h, root);
return ERR_PTR(ret);
Here we should be ok with
ret = record_root_in_trans(trans, parent_root);
if (ret) {
pending->error = ret;
goto fail;
This is a little trickier since we've created the dir item, so this
ret = record_root_in_trans(trans, root);
if (ret) {
struct btrfs_dir_item *di;
struct btrfs_path *path;
pending->error = ret;
path = btrfs_alloc_path();
if (!path) {
/*
* We're really screwed here, at this point we're just going to
* leave a dangling dir item that fsck will have to fix.
*/
pending->error = -ENOMEM;
goto fail;
}
di = btrfs_lookup_dir_item(trans, parent_root, path,
parent_inode->i_ino, dentry->d_name.name,
dentry->d_name.len, -1);
if (IS_ERR(di)) {
pending->error = PTR_ERR(di);
btrfs_free_path(path);
goto fail;
}
ret = btrfs_delete_one_dir_name(trans, parent_root, path, di);
btrfs_free_path(path);
if (ret) {
pending->error = PTR_ERR(ret);
goto fail;
}
goto fail;
This would be bad because it means we can't recover the log. What we would want
to do is make a fail: label at the end of the function right before the commit
transaction and just goto fail. You'll want to move the btrfs_free_path() down
so that gets called too. This is in case ...Is making the filesystem read only triggered by something like ext3_abort (fs/ext3/super.c)? We might get ideas from Ext3 in addition to NFS. (Just an idea, I don't have confidence at all...) Here is the comment located at ext3_abort. The situation described here is partly similar to what we talked about in this thread. So, I think this is perhaps useful information for us. /* * ext3_abort is a much stronger failure handler than ext3_error. The * abort function may be used to deal with unrecoverable failures such * as journal IO errors or ENOMEM at a critical moment in log management. * * We unconditionally force the filesystem into an ABORT|READONLY state, * unless the error response on the fs has been set to panic in which * case we take the easy way out and panic immediately. */ Thank you, -- Yoshinori Sano <yoshinori.sano@gmail.com> --
To reduce the number of such BUG_ON usages, the code will be more robust, which results in increasing the number of Btrfs production use. (This is one of the way, off course.) Is this a debugging idiom? I cannot understand why this idiom can be used to catch all the callers. Would you explain more about it? Thank you, -- Yoshinori Sano <yoshinori.sano@gmail.com> --
So this forces us to return the error case every 1000 times the function is called. So you can run various tests that stress the FS in different ways so you can catch alot of the possible ways you can end up in this function. Then if any of the callers have problems getting ENOMEM then the box will panic or something like that and you can fix the callers. Then when the system stops panicing/blowing up after a while you know you are pretty OK with returning ENOMEM in this case and you can post the patch and move on to the next one. But as Ian says it gets a little confusing in the more complicated cases, so you are probably better off walking up the possible callchains by hand and fixing any problems you see, and then doing the above to validate your work. Thanks, Josef --
Thank you for your explanation. You mean, this is a validation technique that emulates memory allocation failure repeatedly and is very useful when the paths of an interesting function's caller are spread (unclear). Thank you, -- Yoshinori Sano <yoshinori.sano@gmail.com> --
No, it isn't meaningless it's essential that it be fixed, although it is quite difficult. The main problem is that while this remains it makes it almost impossible (at least much more difficult) for those implementing Yes, this is quite common but the fact is that there are many other If you work through the function call tree quite a good number of the dead end BUG_ON()s can be handled (at least it looks like that to me). But there are also quite a good number of cases where the recovery needed isn't yet clear to me. I hope that as I study the code some of these cases will become apparent to me, failing that I plan to post some patches with specific questions to see what others have to say. Ian --
