When replying, please rewrite the Subject: appropriately and attempt to cc the
relevant developers and mailing lists, thanks.consolidate-ptrace_detach.patch
slow-down-printk-during-boot.patch
slow-down-printk-during-boot-fix-2.patch
slow-down-printk-during-boot-fix-3.patch
slow-down-printk-during-boot-fix-4.patch
clockevents-fix-bogus-next_event-reset-for-oneshot-broadcast-devices.patchMerge
exit-acpi-processor-module-gracefully-if-acpi-is-disabled.patch
acpi-enable-c3-power-state-on-dell-inspiron-8200.patch
acpi-add-reboot-mechanism.patch
hibernation-make-sure-that-acpi-is-enabled-in-acpi_hibernation_finish.patch
acpi-clean-up-acpi_enter_sleep_state_prep.patch
acpi-sbs-fix-sbs-add-alarm-patch.patch
acpi-suppress-uninitialized-var-warning.patch
acpi-fix-bdc-handling-in-drivers-acpi-sleep-procc.patchSend to lenb
sound-snd_register_device_for_dev-fix.patch
Send to perex & tiwai
working-3d-dri-intel-agpko-resume-for-i815-chip.patch
fix-use-after-free--double-free-bug-in-amd_create_gatt_pages--amd_free_gatt_pages.patch
generic-ac97-mixer-modem-oss-use-list_for_each_entry.patchSend to airlied
documentation-arm-00-index-add-missing-entries.patch
at91-remove-at91_lcdch.patch
make-power-supply-class-available-for-arm-architecture.patchSend to rmk
fix-auditscc-kernel-doc.patch
Send to viro
fs-cifs-connectc-kmalloc-memset-conversion-to-kzalloc.patch
Send to sfrench
cpufreq-move-policys-governor-initialisation-out-of-low-level-drivers-into-cpufreq-core.patch
cpufreq-allow-ondemand-and-conservative-cpufreq-governors-to-be-used-as-default.patch
allow-ondemand-and-conservative-cpufreq-governors-to-be-used-as-default-kconfig-fix.patch
cpufreq-mark-hotplug-notifier-callback-as-__cpuinit.patch
cpufreq-implement-config_cpu_freq-stub-for.patch
cpufreq_stats-misc-cpuinit-section-annotations.patchSend to davej
git-powerpc.patch
powerpc-vdso-install-unstripped-copies-on-disk.patch
powerpc-vdso-install-unstripped-copies-on-disk-up...
Good, fine by me; but forces me to confess, with abject shame,
that I still haven't sent you some shmem/tmpfs fixes/cleanups
(currently intermingled with some other stuff in my tree, I'm
still disentangling). Nothing so bad as to mess up a bisection,
but my loop-over-tmpfs tests hang without passing gfp_mask down
and down to add_to_swap_cache; and a few other bits. I'll get
back on to it.Hugh
-
I agree. I spent a while last week bisecting down to see why my heavily
swapping loads take 30%-60% longer with -mm than mainline, and it was
here that they went bad. Trying to keep higher orders free is costly.On the other hand, hasn't SLUB efficiency been built on the expectation
that higher orders can be used? And it would be a twisted shame for
high performance to be held back by some idiot's swapping load.Hugh
-
IMO it's a bad idea to create all these dependencies like this.
If SLUB can get _more_ performance out of using higher order allocations,
then fine. If it is starting off at a disadvantage at the same order, then it
that should be fixed first, right?-
Very interesting. I had agreed with these patches being pulled but it
was simply on the grounds that there was no agreement it was the right
thing to do. It was best to have mainline and -mm behave the same from a
fragmentation perspective and revisit this idea from scratch. That itMy belief is that SLUB can still use the higher orders if configured to
do so at boot-time. The loss of these patches means it won't try and do
it automatically. Christoph will chime in I'm sure.--
Mel Gorman-
The larger order allocations may cause excessive reclaim under certain
circumstances. Reclaim will continue to evict pages until a larger order
page can be coalesced. And it seems that this eviction is not that wellYou can still manually configure those at boot time via slub_max_order
etc.I think Mel and I have to rethink how to do these efficiently. Mel has
some ideas and there is some talk about using the vmalloc fallback to
insure that things always work. Probably we may have to tune things so
that fallback is chosen if reclaim cannot get us the larger order page
with reasonable effort.The maximum order of allocation used by SLUB may have to depend on the
number of page structs in the system since small systems (128M was the
case that Peter found) can easier get into trouble. SLAB has similar
measures to avoid order 1 allocations for small systems below 32M.-
A patch like this? This is based on the number of page structs on the
system. Maybe it needs to be based on the number of MAX_ORDER blocks
for antifrag?SLUB: Determine slub_max_order depending on the number of pages available
Determine the maximum order to be used for slabs and the mininum
desired number of objects in a slab from the amount of pages that
a system has available (like SLAB does for the order 1/0 distinction).For systems with less than 128M only use order 0 allocations (SLAB does
that for <32M only). The order 0 config is useful for small systems to
minimize the memory used. Memory easily fragments since we have less than
32k pages to play with. Order 0 insures that higher order allocations are
minimized (Larger orders must still be used for objects that do not fit
into order 0 pages).Then step up to order 1 for systems < 256000 pages (1G)
Order 2 limit to systems < 1000000 page structs (4G)
Order 3 for systems larger than that.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
mm/slub.c | 49 +++++++++++++++++++++++++------------------------
1 file changed, 25 insertions(+), 24 deletions(-)Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2007-10-02 09:26:16.000000000 -0700
+++ linux-2.6/mm/slub.c 2007-10-02 16:40:22.000000000 -0700
@@ -153,25 +153,6 @@ static inline void ClearSlabDebug(struct
/* Enable to test recovery from slab corruption on boot */
#undef SLUB_RESILIENCY_TEST-#if PAGE_SHIFT <= 12
-
-/*
- * Small page size. Make sure that we do not fragment memory
- */
-#define DEFAULT_MAX_ORDER 1
-#define DEFAULT_MIN_OBJECTS 4
-
-#else
-
-/*
- * Large page machines are customarily able to handle larger
- * page orders.
- */
-#define DEFAULT_MAX_ORDER 2
-#define DEFAULT_MIN_OBJECTS 8
-
-#endif
-
/*
* Mininum number of partial slabs. These will be left on the partial
* lists even if they are empty. kmem_cache...
These are fine to me, but should not all go through my tree
Did you resend it? I have nothing pending currently. I rejected
also quite a few of these.The clockevents patches are not included in this; but given the recent
+ * For each page in the address space, this file contains one long
+ * representing the corresponding physical page frame number (PFN) orThis required changes from review I think. And the previous patch is useless
without a boot protocol.-Andi
-
I assume you're referring to just
You did? I'd have dropped them if you had.
Oh well, I was planning on a maintainer patch-bombing tomorrow - let's go
hm, well, I hope you and Thomas are on the same page regarding precisely
How come? Memoryless node can and do occur in real-world machines. Kernel
Well that would be bad.
What's the issue here? Both 32-bit and 64-bit userspace will see 64-bit
data. So the problem is that 32-bit applications on 32-bit kernels will
see 32-bit data but 32-bit applications on 64-bit kernels will see 64-bit
data?If so, that might be OK - the app just needs a reliable way of working out
So should I drop them?
-
i'm curious, which "recent trouble" do you refer to? (The NMI watchdog
bug [which is off by default] was fixed quickly. The C1E bug was foundi'd like to see the 64-bit clockevents (& dynticks) patches merged.
Demand from users and distros is high: the 64-bit CE patches are merged
into the Fedora 8 and Ubuntu kernels already, it's been in -mm and in
-rt too for a long time and powertop users demand it as well. It also
makes it obviously easier to unify the 64-bit and 32-bit code. So
there's multiple good reasons to do it.Ingo
-
C1e and now the misrouted irq 0s Thomas reported.
Also i'm a little worried about the missing C1e check; it looks
like it needs a re-review to make sure not other infrastructure wasI'm aware of that. No argument on that it will eventually go in.
-Andi
-
I had completely forgotten about the C1E problem, which we debugged
half a year ago on 32bit. I went through the other pitfalls we had in
32bit carefully again and they are all covered on 64 bit too. C1E was
the only one I missed.The irq0 problem is not a real one. The clock events code has no irq0
bound to cpuX assumption at all. The only affected part is nmi_watchdog
and I have a fix ready to handle this even for the irq#0 not on cpu#0
case.tglx
-
Done now (adding ccs)
x86_64-sparsemem_vmemmap-2m-page-size-support.patch
x86_64-sparsemem_vmemmap-vmemmap-x86_64-convert-to-new-helper-based-initialisation.patch
Look like these two should be merged togetherAlso I'm concerned about a third variant of memmappery. Can we agree
to only merge that when the old sparsemem support is removed from x86-64?That would be ugly and a little error prone (would this case really be
Yes for now please.
e.g. we at least need a patch to actually check the version number
of the boot protocol.-Andi
-
sparsemem vmemmap is a sparsemem variant. By that I mean that it uses all
the same infrastructure as sparsemem. That sparsemem code is generic code
and shared with the other architectures. There essentially is no code to
remove which is not generic and currently in use by other architectures.
The patches as they stand select the vmemmap variant unconditionally
when sparsemem is selected, we are not adding a new option for x86_64
overall -- in that sense classic sparsemem is already removed for x86_64
by these patches.The longer plan is to pull out the other memory models where they are
no longer beneficial with a view to ending up with only one. A good
example is the private virtual memory map implemented on ia64, which
is an early target. As discussed at VM summit, we are also looking atI thought that a node was a unit of numa locality. Cirtainly some
machines seem to express themselves as memory only nodes and cpu only
nodes; in the past I am sure we have also heard of IO only nodes
representing "io drawers" and the like.-apw
-
Don't think so. A node is a lump of circuitry which can have zero or more
CPUs, IO and memory.It may initially have been conceived as a memory-only concept in the Linux
kernel, but that doesn't fully map onto reality (does it?)There was a real-world need for this, I think from the Fujitsu guys. That
I guess it wouldn't be too hard for a 64-bit kernel to fake up 32-bit data
for 32-bit userspace. For each architecture :( But let's see what Matt-
On Tue, 2 Oct 2007 00:18:09 -0700
Yes, Fujitsu and HP guys really need this memory-less-node support.
Thanks,
-Kame-
The SGI guys also need this support in the future.
-
Anton's post (http://marc.info/?l=linux-mm&m=118133042025995&w=2) (and
my subsequent reposts) may have helped prompt this full series, which
then was picked up and shown to be useful to other folks. NUMA systems
with memoryless nodes exist in the wild and Linux did not do the right
thing there. Admittedly, Anton's case is hugetlb-specific, but the fix
I've been proposing (and hope to repost soon) depends on Christoph's
patches, especially the THISNODE fix.Thanks,
Nish
-
For what reason, please?
-
On Tue, 2 Oct 2007 00:43:24 -0700
For fujitsu, problem is called "empty" node.
When ACPI's SRAT table includes "possible nodes", ia64 bootstrap(acpi_numa_init)
creates nodes, which includes no memory, no cpu.I tried to remove empty-node in past, but that was denied.
It was because we can hot-add cpu to the empty node.
(node-hotplug triggered by cpu is not implemented now. and it will be ugly.)For HP, (Lee can comment on this later), they have memory-less-node.
As far as I hear, HP's machine can have following configration.(example)
Node0: CPU0 memory AAA MB
Node1: CPU1 memory AAA MB
Node2: CPU2 memory AAA MB
Node3: CPU3 memory AAA MB
Node4: Memory XXX GBAAA is very small value (below 16MB) and will be omitted by ia64 bootstrap.
After boot, only Node 4 has valid memory (but have no cpu.)Maybe this is memory-interleave by firmware config.
Thanks,
-Kame-
Future SGI platforms (actually also current one can have but nothing like
that is deployed to my knowledge) have nodes with only cpus. Current SGI
platforms have nodes with just I/O that we so far cannot manage in the
core. So the arch code maps them to the nearest memory node.
-
From memory-hotplug view, memory-less node is very helpful.
It can define and arrange some "halfway conditions" of node hot-plug.
I guess that node unpluging code will be simpler by it.Bye.
--
Yasunori Goto-
For the HP platforms, we can configure each cell with from 0% to 100%
"cell local memory". When we configure with <100% CLM, the "missing
percentages" are interleaved by hardware on a cache-line granularity to
improve bandwidth at the expense of latency for numa-challenged
applications [and OSes, but not our problem ;-)]. When we boot Linux on
such a config, all of the real nodes have no memory--it all resides in a
single interleaved pseudo-node.When we boot Linux on a 100% CLM configuration [== NUMA], we still have
the interleaved pseudo-node. It contains a few hundred MB stolen from
the real nodes to contain the DMA zone. [Interleaved memory resides at
phys addr 0]. The memoryless-nodes patches, along with the zoneorder
patches, support this config as well.Also, when we boot a NUMA config with the "mem=" command line,
specifying less memory than actually exists, Linux takes the excluded
memory "off the top" rather than distributing it across the nodes. This
can result in memoryless nodes, as well.Lee
-
Agreed!
Lee
-
Grumble. The options are:
a) export it in the kernel's native size and have userspace figure it
out
b) add a header
c) lie to 32-bit apps on 64-bit kernels
d) always export 32 bits
e) always export 64 bitsI started with (a), switched to (b), and then Alan and Dave convinced
me to switch back to (a). I don't think (c) is desireable, especially
as it means having two code paths. (d) would work until memory got
large enough that PFNs didn't fit in 32 bits. (e) would be ok all
around, except for the extra overhead. Ho hum.--
Mathematics is the supreme nostalgia of our time.
-
If the overhead of (e) is ok for 64bit it should be ok for 32bit too.
-Andi
-
Needs more work to fix problems raised by viro. I am cooking up a
patch but it won't be ready for 2.6.24.
-
Andrew-
Could you please add the trace patches to the merge list?
These patches have been very well reviewed on lkml. I believe they are
ready to be merged. The patches can be found here:
http://lkml.org/lkml/2007/10/2/236
http://lkml.org/lkml/2007/10/2/237
http://lkml.org/lkml/2007/10/2/238
http://lkml.org/lkml/2007/10/2/239Dave Wilder
-
No, hold it please, until v4l extension will be developped and bayer->rgb
thanks,
--
Jiri Slaby (jirislaby@gmail.com)
Faculty of Informatics, Masaryk University
-
Hi, Andrew,
I mostly agree with your decision. I am a little concerned however
that as we develop and add more features (a.k.a better statistics/
forced reclaim), which are very important; the code base gets larger,
the review takes longer :)I was hopeful of getting the bare minimal infrastructure for memory
control in mainline, so that review is easy and additional changes
can be well reviewed as well.Here are the pros and cons of merging the memory controller
Pros
1. Smaller size, easy to review and merge
2. Incremental development, makes it easier to maintain the
codeCons
1. Needs more review like you said
2. Although the UI is stable, it's a good chance to review
it once more before merging the code into mainlineHaving said that, I'll continue testing the patches and make the
solution more complete and usable.--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
I agree with putting the memory controller stuff on hold from 2.6.24.
Sorry, Balbir, I've failed to get back to you, still attending to
priorities. Let me briefly summarize my issue with the mem controller:
you've not yet given enough attention to swap.I accept that full swap control is something you're intending to add
incrementally later; but the current state doesn't make sense to me.The problems are swapoff and swapin readahead. These pull pages into
the swap cache, which are assigned to the cgroup (or the whatever-we-
call-the-remainder-outside-all-the-cgroups) which is running swapoff
or faulting in its own page; yet they very clearly don't (in general)
belong to that cgroup, but to other cgroups which will be discovered
later.I did try removing the cgroup mods to mm/swap_state.c, so swap pages
get assigned to a cgroup only once it's really known; but that's not
enough by itself, because cgroup RSS reclaim doesn't touch those
pages, so the cgroup can easily OOM much too soon. I was thinking
that you need a "limbo" cgroup for these pages, which can be attacked
for reclaim along with any cgroup being reclaimed, but from which
pages are readily migrated to their real cgroup once that's known.But I had to switch over to other work before trying that out:
perhaps the idea doesn't really fly at all. And it might well
be no longer needed once full mem+swap control is there.So in the current memory controller, that unuse_pte mem charge I was
originally worried about failing (I hadn't at that point delved in
to see how it tries to reclaim) actually never fails (and never
does anything): the page is already assigned to some cgroup-or-
whatever and is never charged to vma->vm_mm at that point.And small point: once that is sorted out and the page is properly
assigned in unuse_pte, you'll be needing to pte_unmap_unlock and
pte_offset_map_lock around the mem_cgroup_charge call there -
you're right to call it with GFP_KERNEL, but cannot do so while
holding the pa...
One comment on swap - ideally it should be a separate subsystem from
the memory controller. That way people who are using cpusets to
provide memory isolation (rather than using the page-based memory
controller) can also get swap isolation.Paul
-
I am open to suggestions and ways and means of making swap control
I understand what your trying to say, but with several approaches that
we tried in the past, we found caches the hardest to most accurately
account. IIRC, with readahead, we don't even know if all the pages
readahead will be used, that's why we charge everything to the cgroup--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
Well, swap control is another subject. I guess for that you'll need
to track which cgroup each swap page belongs to (rather more expensive
than the current swap_map of unsigned shorts). And I doubt it'll be
swap control as such that's required, but control of rss+swap.But here I'm just worrying about how the existence of swap makes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^Yes, readahead is anyway problematic. My guess is that in the file
cache case, you'll tend not to go too far wrong by charging to the
one that added - though we're all aware that's fairly unsatisfactory.My point is that in the swap cache case, it's badly wrong: there's
no page more obviously owned by a cgroup than its anonymous pages
(forgetting for a moment that minority shared between cgroups
until copy-on-write), so it's very wrong for swapin readahead
or swapoff to go charging those to another or to no cgroup.Imagine a cgroup at its rss limit, with more out on swap. Then
another cgroup does some swap readahead, bringing pages private
to the first into cache. Or runs swapoff which actually plugs
them into the rss of the first cgroup, so it goes over limit.Those are pages we'd want to swap out when the first cgroup
faults to go further over its limit; but they're now not evenMy answer is definitely yes. I'm not suggesting that you need
general migration between cgroups at this stage (something for
later quite likely); but I am suggesting you need one pseudo-cgroup
to hold these cases temporarily, and that you cannot properly trackUmm, please explain what's excellent about that.
Hugh
-
I see what you mean now, other people have recommending a per cgroup
Ideally, pages would not reside for too long in swap cache (unless
I've misunderstood swap cache or there are special cases for tmpfs/
ramfs). Once pages have been swapped back in, they get assigned
back to their respective cgroup's in do_swap_page() (where we charge
them back to the cgroup).The swap cache pages will be the first ones to go, once the cgroup
exceeds its limit.There might be gaps in my understanding or I might be missing a use
In the past people have used names like default cgroup, we could use
If what I understand and discussed earlier is, then we don't need
to go this route. But I think the idea of having a pseduo cgroupNothing really, I was glad that we dont fail, even though we might
assign pages to some other cgroup. Not really exciting, but not
failing was a relief :-) In summary, there's nothing excellent--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
Sounds too inflexible, and too many swap areas to me. Perhaps the
right answer will fall in between: assign clusters of swap pages toThinking particularly of those brought in by swapoff or swap readahead:
some will get attached to mms once accessed, others will simply get
freed when tasks exit or munmap, others will hang around until they
reach the bottom of the LRU and are reclaimed again by memory pressure.But as your code stands, that'll be total memory pressure: in-cgroup
memory pressure will tend to miss them, since typically they're
assigned to the wrong cgroup; until then their presence is liableramfs pages are always in RAM, never go out to swap, no need to
worry about them in this regard. But tmpfs pages can indeed go
out to swap, so whatever we come up with needs to make sense
with them too, yes. I don't think its swapoff/readahead issues
are any harder to handle than the anonymous mapped page case,That's where it should happen, yes; but my point is that it very
often does not. Because the swap cache page (read in as part of
the readaround cluster of some other cgroup, or in swapoff by some
other cgroup) is already assigned to that other cgroup (by the
mem_cgroup_cache_charge in __add_to_swap_cache), and so goes "The
page_cgroup exists and the page has already been accounted" route
when mem_cgroup_charge is called from do_swap_page. Doesn't it?Are we misunderstanding each other, because I'm assuming
MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
though I can't see that _MAPPED and _CACHED are actually supported,
there being no reference to them outside the enum that defines them.Or are you deceived by that ifdef NUMA code in swapin_readahead,
which propagates the fantasy that swap allocation follows vma layout?
That nonsense has been around too long, I'll soon be sending a patchOkay, thanks.
Hugh
-
Yes, depending on the number of cgroups, we'll need to share swap
in-cgroup pressure will not affect them, since they are in different
cgroups. If there is pressure in the cgroup to which they are wronglyYou are right, at this point I am beginning to wonder if I should
account for the swap cache at all? We account for the pages in RSS
and when the page comes back into the page table(s) via do_swap_page.
If we believe that the swap cache is transitional and the current
expected working behaviour does not seem right or hard to fix,
it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
for accounting and control.The expected working behaviour of the memory controller is that
currently, as you point out several pages get accounted to the
cgroup that initiates swapin readahead or swapoff. On
cgroup pressure (the one that initiated swapin or swapoff), the
cgroup would discard these pages first. These pages are discarded
from the cgroup, but still live on the global LRU.When the original cgroup is under pressure, these pages might not
be effected as they belong to a different cgroup, which might notI am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
discussion. The accounting is split into mem_cgroup_charge() and
mem_cgroup_cache_charge(). While charging the caches is when weThe swapin readahead code under #ifdef NUMA is very confusing. I also
noticed another confusing thing during my test, swap cache does not
drop to 0, even though I've disabled all swap using swapoff. May be
those are tmpfs pages. The other interesting thing I tried was running
swapoff after a cgroup went over it's limit, the swapoff succeeded,
but I see strange numbers for free swap. I'll start another threadI meant for the wrong cgroup, in the wrong cgroup, these will be the
first set of pages to be reclaimed.--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
It would be wrong to ignore the unuse_pte() case: what it's intending
to do is correct, it's just being prevented by the swapcache issue
from doing what it intends at present.(Though I'm not thrilled with the idea of it causing an admin's
swapoff to fail because of a cgroup reaching mem limit there, I do
agree with your earlier argument that that's the right thing to happen,
and it's up to the admin to fix things up - my original objection came
from not realizing that normally the cgroup will reclaim from itself
to free its mem. Hmm, would the charge fail or the mm get OOM'ed?)Ignoring add_to/remove_from swap cache is what I've tried before,
and again today. It's not enough: if you trying run a memhog
(something that allocates and touches more memory than the cgroup
is allowed, relying on pushing out to swap to complete), then that
works well with the present accounting in add_to/remove_from swap
cache, but it OOMs once I remove the memcontrol mods from
mm/swap_state.c. I keep going back to investigate why, keep on
thinking I understand it, then later realize I don't. Please
give it a try, I hope you've got better mental models than I have.And I don't think it will be enough to handle shmem/tmpfs either;
but won't worry about that until we've properly understood whyIt checks MEM_CGROUP_TYPE_ALL there, yes; but I can't find anything
checking for either MEM_CGROUP_TYPE_MAPPED or MEM_CGROUP_TYPE_CACHED.
(Or is it hidden in one of those preprocesor ## things which frustrateThose indeed are strange behaviours (if the swapoff really has
succeeded, rather than lying), I not seen such and don't have an
explanation. tmpfs doesn't add any weirdness there: when there's
no swap, there can be no swap cache. Or is the swapoff still in
progress? While it's busy, we keep /proc/meminfo looking sensible,
but <Alt><SysRq>m can show negative free swap (IIRC).I'll be interested to hear what your investigation shows.
Hugh
-
I'm glad we have that sorted out.
I will try it. Another way to try it, is to set memory.control_type
to 1, that removes charging of cache pages (both swap cache
and page cache). I just did a quick small test on the memory
controller with swap cache changes disabled and it worked fine
for me on my UML image (without OOMing). I'll try the same test
on a bigger box. Disabling swap does usually cause an
OOM for workloads using anonymous pages if the cgroup goesMEM_CGROUP_TYPE_ALL is defined to be (MEM_CGROUP_TYPE_CACHED |
MEM_CGROUP_TYPE_MAPPED). I'll make that more explicit with a patch.With the new OOM killer changes, I see negative swap. When I run swapoff
with a memory hogger workload, I see (after swapoff succeeds)....
Swap cache: add 473215, delete 473214, find 31744/36688, race 0+0
Free swap = 18446744073709105092kB
Total swap = 0kB
Free swap: -446524kB--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
On Tue, 02 Oct 2007 09:51:11 +0530
I am not yet convinced that the way the memory controller code and
lumpy reclaim have been merged is correct. I am combing through the
code now and will send in a patch when I figure out if/what is wrong.I ran into this because I'm trying to merge the split VM code up to
the latest -mm...--
All Rights Reversed
-
Hi, Rik,
Do you mean the way the memory controller and lumpy reclaim work
together? The reclaim in memory controller (on hitting the limit) is not
lumpy. Would you like to see that change?Interesting, I'll see if I can find some spare test cycles to help test
this code.--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
On Mon, Oct 01, 2007 at 02:22:22PM -0700, Andrew Morton wrote:
I have 4 more patches on writeback, 3 of them fix *new* problems that
was introduced by the above patch. I'd recommend to merge them as a
whole - either for 2.6.24 or for 2.6.25.I'll post them right now.
-
On Mon, Oct 01, 2007 at 02:22:22PM -0700, Andrew Morton wrote:
can you please add http://lkml.org/lkml/2007/7/30/98 also to the misc-queue for
the warning still persists and the patch is good to go as is (against current git
v2.6.23-2840-g752097c, for example) albeit with a little fuzziness.
--
Regards/Gruß,
Boris.
-
I got completely fed up with maintaining that patch against ongoing churn
frenzy in Greg's trees so I dropped it.If/when things settle down in that area someone will need to redo the
patch.-
/me wondering: what if you pass it on upstream to Linus, since:
1. it applies cleanly now
2. is pretty trivial
and forget it about it forever :). It seems the place this
patch touches is the only place where kobject_* and sysfs_create_* etc. error
codes are not being handled in contrast to all those functions which have been declared
__must_check?--
Regards/Gruß,
Boris.
-
I guess we need some more time for this patchset to mature, yes. But
the four patches I've quoted about are just small preparator cleanups
that should go into 2.6.24.-
I don't know if Linus actually disliked the patch itself, or disliked
my (maybe confusingly worded) rationale?To clarify: it is not zero_page that fundamentally causes a problem,
but it is a problem that was exposed when I rationalised the page
refcounting in the kernel (and mapcounting in the mm).I see about 4 things we can do:
1. Nothing
2. Remove zero_page
3. Reintroduce some refcount special-casing for the zero page
4. zero_page per-node or per-cpu or whatever1 and 2 kind of imply that nothing much sane should use the zero_page
much (the former also implies that we don't care much about those who
do, but in that case, why not go for code removal?).3 and 4 are if we think there are valid heavy users of zero page, or we
are worried about hurting badly written apps by removing it. If the former,
I'd love to hear about them; if the latter, then it definitely is a valid
concern and I have a patch to avoid refcounting (but if this is the caseCare to give it one more round through -mm? Is it easy enough to
keep? I haven't had a chance to review it, which I'd like to do at some
point (and I don't think it would hurt to have a bit more testing).
-
Yes. I'd happily accept the patch, but I'd want it clarified and made
obvious what the problem was - and it wasn't the zero page itself, it was
a regression in the VM that made it less palatable.I also thought that there were potentially better solutions, namely to
simply avoid the VM regression, but I also acknowledge that they may not
be worth it - I just want them to be on the table.In short: the real cost of the zero page was the reference counting on the
page that we do these days. For example, I really do believe that the
problem could fairly easily be fixed by simply not considering zero_page
to be a "vm_normal_page()". We already *do* have support for pages not
getting ref-counted (since we need it for other things), and I think that
zero_page very naturally falls into exactly that situation.So in many ways, I would think that turning zero-page into a nonrefcounted
page (the same way we really do have to do for other things anyway) would
be the much more *direct* solution, and in many ways the obvious one.HOWEVER - if people think that it's easier to remove zero_page, and want
to do it for other reasons, *AND* can hopefully even back up the claim
that it never matters with numbers (ie that the extra pagefaults just make
the whole zero-page thing pointless), then I'd certainly accept the patch.I'd just want the patch *description* to then also be correct, and blame
the right situation, instead of blaming zero-page itself.Linus
-
That was my first approach. It isn't completely trivial, but
vm_normal_page() does play a part (but we end up needing a
vm_normal_page() variant -- IIRC vm_normal_or_zero_page()). But taken
as a whole, non-refcounted zero_page is obviously a lot more work thanI have done some tests which indicate a couple of very basic common tools
don't do much zero-page activity (ie. kbuild). And also combined with some
logical arguments to say that a "sane" app wouldn't be using zero_page much.
(basically -- if the app cares about memory or cache footprint and is using
many pages of zeroes, then it should have a more compressed representation
of zeroes anyway).However there is a window for some "insane" code to regress without the
zero_page. I'm not arguing that we don't care about those, however I have
no way to guarantee they don't exist. I hope we wouldn't get a potentially
useless complexity like this stuck in the VM forever just because we don't
_know_ whether it's useful to anybody...How about something like this?
---
From: Nick Piggin <npiggin@suse.de>The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note
A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
(and thus mapcounted and count towards shared rss). These writes to
the struct page could cause excessive cacheline bouncing on big
systems. There are a number of ways this could be addressed if it is
an issue.And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu Z...
Sorry, I've no useful arguments to add (and my testing was too much
like yours to add any value), but I do want to go on record as still
a strong supporter of approach 3 and your patch.Hugh
-
One of the things that zero-page has been used for is absolutely *huge*
(but sparse) arrays in Fortan programs.At least in traditional fortran, it was very hard to do dynamic
allocations, so people would allocate the *maximum* array statically, and
then not necessarily use everything. I don't know if the pages ever even
got paged in, but this is the kind of usage which is *not* insane.Linus
-
In which case, they would not be using the ZERO_PAGE?
If they were paging in (ie. reading) huge reams of zeroes,
then maybe their algorithms aren't so good anyway? (I don'tYeah, that's why I use the double quotes... I wonder how to
find out, though. I guess I could ask SGI if they could ask
around -- but that still comes back to the problem of not being
able to ever conclusively show that there are no real users of
the ZERO_PAGE.Where do you suggest I go from here? Is there any way I can
convince you to try it? Make it a config option? (just kidding)
-
No, I'll take the damn patch, but quite frankly, I think your arguments
suck.I've told you so before, and asked for numbers, and all you do is
handwave. And this is like the *third*time*, and you don't even seem to
admit that you're handwaving.So let's do it, but dammit:
- make sure there aren't any invalid statements like this in the final
commit message.
- if somebody shows that you were wrong, and points to a real load,
please never *ever* make excuses for this again, ok?Is that a deal? I hope we'll never need to hear about this again, but I
really object to the way you've tried to "sell" this thing, by basically
starting out dishonest about what the problem was, and even now I've yet
to see a *single* performance number even though I've asked for them
(except for the problem case, which was introduced by *you*)Linus
-
I gave 2 other numbers. After that, it really doesn't matter if I give
you 2 numbers or 200, because it wouldn't change the fact that thereI think I've always admitted I'm handwaving in my assertion that programs
would not be using the zero page. My handwaving is an attempt to show that
I have some vaguely reasonable reasons to think it will be OK to remove it.The dishonesty in the changelog is more of an oversight than an attempt
to get it merged. It never even crossed my mind that you would be fooled
by it ;) To prove my point: the *first* approach I posted to fix this
problem was exactly a patch to special-case the zero_page refcounting
which was removed with my PageReserved patch. Neither Hugh nor yourself
liked it one bit!So I have no particular bias against the zero page or problem admitting
I introduced the issue. I do just think this could be a nice opportunityBasically: I don't know what else to show you! I expect it would be
relatively difficult to find a measurable difference between no zero-page
and zero-page with no refcounting problem. Precisely because I can't
find anything that really makes use of it. Again: what numbers can I
get for you that would make you feel happier about it?Anyway, before you change your mind: it's a deal! If somebody screams
then I'll have a patch for you to reintroduce the zero page minus
refcounting the next day.
-
True (speaking for me; I forget whether Linus ever got to see it).
I apologize to you, Nick, for getting you into this position of
fighting for something which wasn't your choice in the first place.If I thought we'd have a better kernel by dropping this patch and
going back to one that just avoids the refcounting, I'd say do it.
No, I still think it's worth trying this one first.But best have your avoid-the-refcounting patch ready and reviewed
for emergency use if regression does show up somewhere.Thanks,
Hugh[My mails out are at present getting randomly delayed by six hours or
so, which makes it extra hard for me to engage usefully in any thread.]
-
The problem is, those first "remove ref-counting" patches were ugly
*regardless* of ZERO_PAGE.We (yes, largely I) fixed up the mess since. The whole vm_normal_page()
and the magic PFN_REMAP thing got rid of a lot of the problems.And I bet that we could do something very similar wrt the zero page too.
Basically, the ZERO page could act pretty much exactly like a PFN_REMAP
page: the VM would not touch it. No rmap, no page refcounting, no nothing.This following patch is not meant to be even half-way correct (it's not
even _remotely_ tested), but is just meant to be a rough "grep for
ZERO_PAGE in the VM, and see what happens if you don't ref-count it".Would something like the below work? I dunno. But I suspect it would. I
doubt anybody has the energy to actually try to actually follow through on
it, which is why I'm not pushing on it any more, and why I'll accept
Nick's patch to just remove ZERO_PAGE, but I really *am* very unhappy
about this.The "page refcounting cleanups" in the VM back when were really painful.
And dammit, I felt like I was the one who had to clean them up after you
guys. Which makes me really testy on this subject.And yes, I also admit that the vm_normal_page() and the PFN_REMAP thing
ended up really improving the VM, and we're pretty much certainly better
off now than we were before - but I also think that ZERO_PAGE etc could
easily be handled with the same model. After all, if we can make
"mmap(/dev/mem)" work with COW and everything, I'd argue that ZERO_PAGE
really is just a very very small special case of that!Totally half-assed untested patch to follow, not meant for anything but a
"I think this kind of approach should have worked too" comment.So I'm not pushing the patch below, I'm just fighting for people realizing
that- the kernel has *always* (since pretty much day 1) done that ZERO_PAGE
thing. This means that I would not be at all surprised if some
application basically depends on it. I've writte...
Sure it will work. It's not completely trivial like your patch,
though. The VM has to know about ZERO_PAGE if you also want it
to do the "optimised" wp (what you have won't work because it
will break all other "not normal" pages which are non-zero I think).And your follow_page_page path is not going to do the right thing
Sure they have.
http://marc.info/?l=linux-mm&m=117515508009729&w=2
OK, this patch was open coding the tests rather than putting them in
vm_normal_page, but vm_normal_page doesn't magically make it a whole
lot cleaner (a _little_ bit cleaner, I agree, but in my current patchOK, but in this case we'll not have a big hard-to-revert set of
changes that fundamentally alter assumptions throughout the vm.
It will be more a case of "if somebody screams, put the zero pageThat's the main question. Maybe my wording was a little strong, but
I simply personally couldn't think of sane uses of zero page. I'm
not prepared to argue that none could possibly exist.It just seems like now might be a good time to just _try_ removing
the zero page, because of this peripheral problem caused by my
refcounting patch. If it doesn't work out, then at least we'll be
wiser for it, we can document why the zero page is needed, and addOK, maybe this is where we are not on the same page.
There are 2 issues really. Firstly, performance problem of
refcounting the zero-page -- we've established that it causes
this livelock and that we should stop refcounting it, right?Second issue is the performance difference between removing the
zero page completely, and de-refcounting it (it's obviously
incorrect to argue for zero page removal for performance reasons
if the performance improvement is simply coming from avoiding
the refcounting). The problem with that is I simply don't know
any tests that use the ZERO_PAGE significantly enough to measure
a difference. The 1000 COW faults vs < 1 unmap per second thing
was simply to show that, on the micro level, performance won't...
Yes. Let's do your patch immediately after the x86 merge, and just see if
anybody screams.It might take a while, because I certainly agree that whoever would be
Well, even if it's a "when you don't get into the bad behaviour,
performance difference is not measurable", and give a before-and-after
number for some random but interesting load. Even if it's just a kernel
compile..Linus
-
You gave me no timings what-so-ever. Yes, you said "1000 page faults", but
no, I have yet to see a *single* actual performance number.Maybe I missed it? Or maybe you just never did them.
Was it really so non-obvious that I actually wanted *performance* numbers,
not just some random numbers about how many page faults you have? Or did
you post them somewhere else? I don't have any memory of having seen any
performance numbers what-so-ever, but I admittedly get too much email.Here's three numbers of my own: 8, 17 and 975.
So I gave you "numbers", but what do they _mean_?
So let me try one more time:
- I don't want any excuses about how bad PAGE_ZERO is. You made it bad,
it wasn't bad before.
- I want numbers. I want the commit message to tell us *why* this is
done. The numbers I want is performance numbers, not handwave numbers.
Both for the bad case that it's supposed to fix, *and* for "normal
load".
- I want you to just say that if it turns out that there are people who
use ZERO_PAGE, you stop calling them crazy, and promise to look at the
alternatives.How much clearer can I be? I have said several times that I think this
patch is kind of sad, and the reason I think it's sad is that you (and
Hugh) convinced me to take the patch that made it sad in the first place.
It didn't *use* to be bad. And I've use ZERO_PAGE myself for timing, I've
had nice test-programs that knew that it could ignore cache effects and
get pure TLB effects when it just allocated memory and didn't write to it.That's why I don't like the lack of numbers. That's why I didn't like the
original commit message that tried to blame the wrong part. That's why I
didn't like this patch to begin with.But I'm perfectly ready to take it, and see if anybody complains.
Hopefully nobody ever will.But by now I absolutely *detest* this patch because of its history, and
how I *told* you guys what the reserved bit did, and how you totally
ignored it, and then...
Sure.
-
Oh dear ... after looking at the following to figure out what
a wibble is, I wonder which one Andrew had in mind:http://www.urbandictionary.com/define.php?term=wibble
The insanity, the rubbish, being overwhelmed, ... ?
<grin>
If one of Nick or I can knock some sense into the others head,
then this saga should come to a close soon.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
In the meantime, that patch should be merged though, shouldn't it?
cpusets is currently telling the scheduler to do the wrong thing WRT
the user interface definition of cpusets, right?
-
> In the meantime, that patch should be merged though, shouldn't it?
Which patch do you refer to:
1) the year old patch to disconnect cpusets and sched domains:
cpuset-remove-sched-domain-hooks-from-cpusets.patch
2) my patch of a few days ago to add a 'sched_load_balance' flag:
cpuset and sched domains: sched_load_balance flagI can't push one without the other, because some real time folks are
depending on the sched domain hooks that (1) would remove, so need some
alternative, such as in (2). Even though (1) is rather broken, as you
note, it still provides a way that the real time folks can disable load
balancing at runtime on selected CPUs, so is essential to their work.I can't delay any more resolving this, because the cgroup (aka
container) code is tangled up with (1), and Andrew needs a clear path
to send cgroups to Linus real soon now.In my last message to you, a couple of days ago, I asked what I thought
were a couple of key and simple questions -- can sched domains overlap,
and what does it mean for user space if they overlap? A further
question comes to mind now -- if sched domains can overlap, does this
provide some capability to user space that is important to provide?Could you take a minute, Nick, to consider these questions? Thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
If code isn't ready to go, it doesn't need to rush, it can just be untangled
Yeah, it arrived after I had a 24 hour flight. I just see it now.
-
True ... though we seem to be going in circles now. I doubt
taking longer will help much; we should strive to resolve this
now, if we can.--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Please, work out what you want to do from a design perspective, then cook
up a patch against rc8-mm2.
-
I'm not sure polluting /sys/block/<foo>/queue/ like that is The Right
Thing. These patches sure were handy when debugging this, but not sure
they want to move to maineline.Maybe we want /sys/bdi/<foo>/ or maybe /debug/bdi/<foo>/
Opinions?
-
I'd vote for /sys/bdi/.
It will be more than debug variables. It's good to expose per-bdi
tunable parameters and allow one to view bdi states.It would also allow one to tune things like NFS readahead :-)
-
Hi Peter,
my only opinion is that it is great to see that stuff moving into
mainline. If it really goes in, there will be one more very interested
rc-tester :-)Cheers
Martin------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de
-
If you think we'll need this stuff for support/debug during 2.6.24-rcX then
sure - we can always take it out prior to 2.6.24-final.otoh, if we're going to take that approach we might as well leave things in
/sys/block/<foo>/queue.-
People seem to have tested the code in various demanding scenarios (both
-mm and the back-port to .22), so I'm fairly confident that this part
works well (famous last words,.. I know I'll regret having said this).Also the NFS issue bothers me to no end.
I'll try and come up with a /sys/bdi/<foo> thing. SLUB should have some
sysfs code I can copy from - last time I looked at doing sysfs I just
gave up. God awful mess that is.-
What would be the point in another top-level tree for device
information? All devices you are exporting information for, are
already in the sysfs tree, right?Thanks,
Kay
-
Never did find NFS mounts/servers/superblocks or whatever constitutes a
BDI for NFS in there. Same goes for all other networked filesystems for
that matter.-
How about adding this information to the tree then, instead of
creating a new top-level hack, just because something that you think
you need doesn't exist.You called sysfs a mess, seems you work on that topic too. :)
Kay
-
So you suggest adding all the various network filesystems in there
(where?), and adding the concept of a BDI, and ensuring all are properlyI called the in-kernel API to create sysfs files a mess. Not that I have
another opinion on the content of /sys though, always takes to damn long
to find anything in there.-
No, I propose to add only sane new userspace interfaces. That you miss
infrastructure to use, should never be the reason to add conceptually
broken new interfaces. A new device related top-level directory in
/sys is not going to happen. Better don't add new stuff, if you can'tSure, it's a mess, we are trying to clean that up, it's hard, and
therefore we can not accept stuff like you propose, that makes it even
harder.Use debugfs, if you can't add a sane interface. There are no rules,
you can do whatever you want there. If over time the stuff you need
gets added, you can always switch over.Or add an attribute group to the existing devices, and leave the ones
out, which are not in sysfs right now, until they are added.Or use a device class and use the existing devices as parents. If you
don't have a parent, they will show up in "virtual" until someone adds
the right parent devices. (If you are going to do this, make sure
nothing depends on a device being "virtual", it must be allowed to
re-parent these device with any future kernel release.)Thanks,
Kay
-
Would something fit better under /sys/fs/? At least filesystems are
already an existing concept to userspace.
-
Sounds at least less messy than an new top-level directory.
But again, if it's "device" releated, like the name suggests, it should
be reachable from the device tree.
Which userspace tool is supposed to set these values, and at what time?
An init-script, something at device discovery/setup? If that is is ever
going to be used in a hotplug setup, you really don't want to go look
for directories with magic device names in another disconnected tree.Kay
-
Filesystems don't really map to BDIs either. One can have multiple FSs
per BDI.'Normally' a BDI relates to a block device, but networked (and other
non-block device) filesystems have to create a BDI too. So these need to
be represented some place as well.The typical usage would indeed be init scripts. The typical example
would be setting the read-ahead window. Currently that cannot be done
for NFS mounts.-
What kind of context for a non-block based fs will get the bdi controls
added? Is there a generic place, or does every non-block based
filesystem needs to be adapted individually to use it?Kay
-
---
Subject: bdi: debugfs interfaceExpose the BDI stats (and readahead window) in /debug/bdi/
I'm still thinking it should go into /sys somewhere, however I just noticed
not all block devices that have a queue have a /queue directory. Noticeably
those that use make_request_fn() as opposed to request_fn(). And then of
course there are the non-block/non-queue BDIs.A BDI is basically the object that represents the 'thing' you dirty pages
against. For block devices that is related to the block device (and is
typically embedded in the queue object), for NFS mounts its the remote server
object of the client. For FUSE, yet again something else.I appreciate the sysfs people their opinion that /sys/bdi/ might not be the
best from their POV, however I'm not seeing where to hook the BDI object from
so that it all makes sense, a few of the things are currently not exposed in
sysfs at all, like the NFS and FUSE things.So, for now, I've exposed the thing in debugfs. Please suggest a better
alternative.Miklos, Trond: could you suggest a better fmt for the bdi_init_fmt() for your
respective filesystems?Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
block/genhd.c | 2
block/ll_rw_blk.c | 1
drivers/block/loop.c | 7 ++
drivers/md/dm.c | 2
drivers/md/md.c | 2
fs/fuse/inode.c | 2
fs/nfs/client.c | 2
include/linux/backing-dev.h | 15 ++++
include/linux/debugfs.h | 11 +++
include/linux/writeback.h | 3
mm/backing-dev.c | 153 ++++++++++++++++++++++++++++++++++++++++++++
mm/page-writeback.c | 2
12 files changed, 199 insertions(+), 3 deletions(-)Index: linux-2.6-2/fs/fuse/inode.c
===================================================================
--- linux-2.6-2.orig/fs/fuse/inode.c
+++ linux-2.6-2/fs/fuse/in...
In our debugging printks, we usually use the superblock->s_id, but that
only gets initialised later.I'd suggest passing the 'dev_name' from *_get_sb() into
*_create_server().Cheers
Trond-
I just realised that such a name would contain '/' and I'm quite sure
that makes filesystems unhappy.Neil suggested using device numbers which would work, however I think
those might not be human friendly. While its easy to find the device
number of a given path (eg.: stat -c %d /), its rather hard to find the
path belonging to a given device number.--
Ram Pai had a patch which added the device number (among other things)
to /proc/mounts:Subject: [RFC2 PATCH 1/1] VFS: Augment /proc/mount with subroot and
shared-subtree
From: Ram Pai <linuxram@us.ibm.com>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@ftp.linux.org.uk>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-fsdevel@vger.kernel.org, util-linux-ng@vger.kernel.org
In-Reply-To: <20070625214640.GC29058@ram.us.ibm.com>
Content-Type: text/plain
Date: Mon, 16 Jul 2007 11:46:48 -0700
Content-Transfer-Encoding: 7bit
Sender: linux-fsdevel-owner@vger.kernel.org
X-Mailing-List: linux-fsdevel@vger.kernel.org/proc/mounts in its current state fail to disambiguate bind mounts, especially
when the bind mount is subrooted. Also it does not capture propagation state of
the mounts(shared-subtree). The following patch addresses the problem.The following additional fields to /proc/mounts are added.
propagation-type in the form of <propagation_flag>[:<mntid>][,...]
note: 'shared' flag is followed by the mntid of its peer mount
'slave' flag is followed by the mntid of its master mount
'private' flag stands by itself
'unbindable' flag stands by itselfmntid -- is a unique identifier of the mount
major:minor -- is the major minor number of the device hosting the filesystem
dir -- the subdir in the filesystem which forms the root of this mount
parent -- the id of the parent mountHere is a sample cat /proc/mounts after execution the following commands:
mount --bind /mnt /mnt
mount --make-shared /mnt
mount --bind /mnt/1 /var
mount --make-slave /var
mount --make-shared /var
mount --bind /var/abc /tmp
mount --make-unbindable /procrootfs / rootfs rw 0 0 private 2 0:1 / 2
/dev/root / ext2 rw 0 0 private 16 98:0 / 2
/proc /proc proc rw 0 0 unbindable 17 0:3 / 16
devpts /dev/pts devpts rw 0 0 private 18 0:10 / 16
/dev/root /mnt ext2 rw 0 0 shared:19 19 98:0 /mnt 16
/de...
OK, I guess that would work.
Ram Pai, what is the current status of that work?
--
For fuse:
err = bdi_init_fmt(&fc->bdi, "fuse-%llu", (unsigned long long) fc->id);
This would match the connection ID in /sys/fs/fuse/connections/*
Miklos
-
What happended to the idea to create a "bdi" class, and have the
existing devices as parents, and for stuff that is not (not now, or
never) in sysfs, no parent is set.Thanks,
Kay-
Must have forgotten about that, mainly because I'm not sure I fully
understand it.So we create a class, create these objects, which are all called bdi and
have children with these attributes in it.Now, I supposed there is a directory that lists all unparented thingies,
how do I locate the one that matches my nfs mount?-
Yes, "struct device" objects, assigned to the "bdi" class. (Don't use
Probably not. You can name it how you want, you can inherit the name of
the parent, or prefix it with whatever fits, they just need to be
unique. Things like the "fuse-%llu" name would work just fine. I guessYou look for the name (prefix), try: "ls /sys/class/sound/", it's the
same model all over the place.Kay
-
Ok, will try that. Is there a 'simple uncluttered' example I could look
at to copy from?-
drivers/firmware/dmi-id.c
It has only a single device created in the init routine, but it shows
what to do with the class.Until the block subsystem is converted from using raw kobjects to
devices, you need to set the parent kobject of the bdi device to the
blockdev:
bdidev->dev.kobj.parent = &disk->kobjKay
-
This crashes and burns on bootup, but I'm too tired to figure out what I
did wrong... will give it another try tomorrow..Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
block/genhd.c | 2
fs/fuse/inode.c | 2
fs/nfs/client.c | 2
include/linux/backing-dev.h | 33 ++++++++++++
include/linux/writeback.h | 3 +
mm/backing-dev.c | 121 ++++++++++++++++++++++++++++++++++++++++++++
mm/page-writeback.c | 2
7 files changed, 162 insertions(+), 3 deletions(-)Index: linux-2.6-2/fs/fuse/inode.c
===================================================================
--- linux-2.6-2.orig/fs/fuse/inode.c
+++ linux-2.6-2/fs/fuse/inode.c
@@ -467,7 +467,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(&fc->num_waiting, 0);
fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc->bdi.unplug_io_fn = default_unplug_io_fn;
- err = bdi_init(&fc->bdi);
+ err = bdi_init_fmt(&fc->bdi, "fuse-%llu", (unsigned long long)fc->id);
if (err) {
kfree(fc);
fc = NULL;
Index: linux-2.6-2/fs/nfs/client.c
===================================================================
--- linux-2.6-2.orig/fs/nfs/client.c
+++ linux-2.6-2/fs/nfs/client.c
@@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
goto out_error;nfs_server_set_fsinfo(server, &fsinfo);
- error = bdi_init(&server->backing_dev_info);
+ error = bdi_init_fmt(&server->backing_dev_info, "nfs-%s-%p", clp->cl_hostname, server);
if (error)
goto out_error;Index: linux-2.6-2/include/linux/backing-dev.h
===================================================================
--- linux-2.6-2.orig/include/linux/backing-dev.h
+++ linux-2.6-2/include/linux/backing-dev.h
@@ -11,6 +11,8 @@
#include <linux/percpu_counter.h>
#include <linux/log2.h>
#include <linux/proportions.h>
+#include <linux/kernel.h>
+#include <linux/dev...
Ok, can't sleep.. took a look. I have several problems here.
The thing that makes it go *boom* is the __ATTR_NULL. Removing that
makes it boot. Albeit it then warns me of multiple duplicate sysfs
objects, all named "bdi".For some obscure reason this device interface insists on using the
bus_id as name (?!), and further reduces usability by limiting that to
20 odd characters.This makes it quite useless. I tried fudging around that limit by using
device_rename and kobject_rename, but to no avail.Really, it should not be this hard to use, trying to expose a handfull
of simple integers to userspace should not take 8h+ and still not work.Peter, who thinks sysfs is contorted mess beyond his skill. I'll stick
to VM and scheduler code, that actually makes sense.-
Heh, that's funny :)
I'll look at this and see what I can come up with. Would you just like
a whole new patch, or one against this one?thanks,
greg k-h
-
Sorry for the grumpy note, I get that way at 3.30 am. Maybe I ought not
have mailed :-/This is the code I had at that time.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
block/genhd.c | 2
fs/fuse/inode.c | 2
fs/nfs/client.c | 2
include/linux/backing-dev.h | 21 ++++++
include/linux/string.h | 4 +
include/linux/writeback.h | 3
mm/backing-dev.c | 144 ++++++++++++++++++++++++++++++++++++++++++++
mm/page-writeback.c | 2
mm/util.c | 42 ++++++++++++
9 files changed, 219 insertions(+), 3 deletions(-)Index: linux-2.6-2/fs/fuse/inode.c
===================================================================
--- linux-2.6-2.orig/fs/fuse/inode.c
+++ linux-2.6-2/fs/fuse/inode.c
@@ -467,7 +467,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(&fc->num_waiting, 0);
fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc->bdi.unplug_io_fn = default_unplug_io_fn;
- err = bdi_init(&fc->bdi);
+ err = bdi_init_fmt(&fc->bdi, "bdi-fuse-%llu", (unsigned long long)fc->id);
if (err) {
kfree(fc);
fc = NULL;
Index: linux-2.6-2/fs/nfs/client.c
===================================================================
--- linux-2.6-2.orig/fs/nfs/client.c
+++ linux-2.6-2/fs/nfs/client.c
@@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
goto out_error;nfs_server_set_fsinfo(server, &fsinfo);
- error = bdi_init(&server->backing_dev_info);
+ error = bdi_init_fmt(&server->backing_dev_info, "bdi-nfs-%s-%p", clp->cl_hostname, server);
if (error)
goto out_error;Index: linux-2.6-2/include/linux/backing-dev.h
===================================================================
--- linux-2.6-2.orig/include/linux/backing-dev.h
+++ linux-2.6-2/include/linux/backing-dev.h
@@ -11,6 +11,8 @@
#include <linux/percpu_counter.h>
#include <linux/log2.h>
...
Ah, I see a few problems. Here, try this version instead. It's
compile-tested only, and should be a lot simpler.Note, we still are not setting the parent to the new bdi structure
properly, so the devices will show up in /sys/devices/virtual/ instead
of in their proper location. To do this, we need the parent of the
device, which I'm not so sure what it should be (block device? block
device controller?)Let me know if this works better, I'm off to a kids birthday party for
the day, but will be around this evening...thanks,
greg k-h
---
block/genhd.c | 2
fs/fuse/inode.c | 2
fs/nfs/client.c | 2
include/linux/backing-dev.h | 19 +++++++
include/linux/string.h | 4 +
include/linux/writeback.h | 3 +
mm/backing-dev.c | 110 ++++++++++++++++++++++++++++++++++++++++++++
mm/page-writeback.c | 2
mm/util.c | 42 ++++++++++++++++
9 files changed, 183 insertions(+), 3 deletions(-)--- a/block/genhd.c
+++ b/block/genhd.c
@@ -182,6 +182,7 @@ void add_disk(struct gendisk *disk)
disk->minors, NULL, exact_match, exact_lock, disk);
register_disk(disk);
blk_register_queue(disk);
+ bdi_register(&disk->queue->backing_dev_info, "bdi-%s", disk->disk_name);
}EXPORT_SYMBOL(add_disk);
@@ -190,6 +191,7 @@ EXPORT_SYMBOL(del_gendisk); /* in partit
void unlink_gendisk(struct gendisk *disk)
{
blk_unregister_queue(disk);
+ bdi_unregister(&disk->queue->backing_dev_info);
blk_unregister_region(MKDEV(disk->major, disk->first_minor),
disk->minors);
}
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -467,7 +467,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(&fc->num_waiting, 0);
fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc->bdi.unplug_io_fn = default_unplug_io_fn;
- err = bdi_init(&fc->bdi);
+ err = bdi_init_fmt(&fc->bdi, "bdi-fuse-%llu", ...
Assigning a parent device will only work with the upcoming conversion of
the raw kobjects in the block subsystem to "struct device".Why is that here? I don't think we need this when we use the existing:
This function should accept a: "struct device *parent" and all callers
All this open-coded attribute stuff should go away and be replaced by:
bdi_class->dev_attrs = bdi_dev_attrs;
Otherwise at event time the attributes are not created and stuff hooking
into the events will not be able to set values. Also, the core will do
proper add/remove and error handling then.Thanks,
Kay-
Hi,
Thanks for the help so far, however we're still not quite there.
The below patch still has the funny 20 character name limit. Is there a
good reason its a char array like this, and not just a char * to a kstr?
The code does kstrdup all over the place, I can't imagine why suddenly
limiting it to 20 chars seems like a good idea.Mounting NFS filesystems: mount: 192.168.0.1:/mnt/ying failed, reason given by server: Permission denied
mount: 192.168.0.1:/mnt/yang failed, reason given by server: Permission denied
[ 52.705052] sysfs: duplicate filename 'bdi-nfs-192.168.0.1' can not be created
[ 52.712517] WARNING: at /mnt/md0/src/linux-2.6-2/fs/sysfs/dir.c:424 sysfs_add_one()
[ 52.720243]
[ 52.720244] Call Trace:
[ 52.724311] [<ffffffff80304f7c>] sysfs_add_one+0xac/0xe0
[ 52.729708] [<ffffffff80305613>] create_dir+0x63/0xc0
[ 52.734832] [<ffffffff803056a4>] sysfs_create_dir+0x34/0x50
[ 52.740489] [<ffffffff804d5710>] _spin_unlock+0x30/0x60
[ 52.745792] [<ffffffff8036958e>] kobject_add+0xbe/0x200
[ 52.751094] [<ffffffff803dfc40>] device_add+0xc0/0x680
[ 52.756305] [<ffffffff803e0219>] device_register+0x19/0x20
[ 52.761877] [<ffffffff803e0697>] device_create+0xe7/0x120
[ 52.767360] [<ffffffff8036e0bc>] vsnprintf+0x2bc/0x690
[ 52.772585] [<ffffffff80370380>] kvasprintf+0x70/0x90
[ 52.777724] [<ffffffff80295bbb>] bdi_register+0x9b/0xe0
[ 52.783037] [<ffffffff8037e039>] percpu_counter_init_irq+0x39/0x50
[ 52.789299] [<ffffffff8036abcc>] prop_local_init_percpu+0x3c/0x50
[ 52.795462] [<ffffffff80295a41>] bdi_init+0x61/0xb0
[ 52.800411] [<ffffffff80308069>] nfs_probe_fsinfo+0x4d9/0x640
[ 52.806226] [<ffffffff80308f9a>] nfs_create_server+0x1ea/0x560
[ 52.812123] [<ffffffff80263526>] lock_release_holdtime+0x66/0x80
[ 52.818217] [<ffffffff802ca936>] __d_lookup+0x106/0x1d0
[ 52.823529] [<ffffffff802b0fd5>] __kmall...
You are absolutely right, it doesn't make any sense. The 20 char limit
is bad, but really,
having the name duplicated in the device structure, while the name is
already in the
embedded kobject, is really bad.Greg recently got rid of the 20 chars in the kobject, now we need to fix the
devices to completely get rid of the static bus_id string array, and just set
the kobject name directly.
It's all long overdue to fix things like this in the driver core -
it's such a mess. After the
kset cleanup Greg and I are doing currently, we will remove that silly limit.Hmm, regardless of the limit, isn't there a better device name than a memory
address of a kernel structure. :) If there is no better data,
shouldn't we get something
like an atomic counter somewhere in the nfs code, which just increases
with every
instance, and we could use that number as a connection id?Kay
-
One more question,
I currently prefix the names with "bdi-", is that needed?
That is, if I give the bdi object a parent, how will it look?
Would a bdi device with name "sda" with a block device called "sda" as
parent look like: /sys/block/sda/sda? Or would if be
called /sys/block/sda/bdi:sda or just /sys/block/sda/bdi?-
The class devices as childs get their own subdirectory at the parent. So
it would be /sys/block/sda/bdi/sdaIt's currently only implemented for class devices which get a bus device
as a parent, but that will be for all parents, so that we prevent
clashing names if devices from multiple subsystems get the same parent.
See here for netdevs there is a "net" "glue directory":$ ls -l /sys/class/net/eth0
/sys/class/net/eth0 -> ../../devices/pci0000:00/0000:00:1c.0/0000:02:00.0/net/eth0We needed this, because people complained that they can't name their
netif "irq", because there is already an attribute at the parent with
that name. :)When we get the "block as devices" patch merged, we can do the proper
parent logic for bdi, and I'll add the same logic as we have for bus
device parents today.Kay
-
Thanks.
Here is the 'pretty' patch :-)
Since it relies on the removal of the device name length limit, this
should not yet be applied.---
Subject: mm: sysfs: expose the BDI object in sysfsProvide a place in sysfs for the backing_dev_info object.
This allows us to see and set the various BDI specific variables.In particular this properly exposes the read-ahead window for all
relevant users and /sys/block/<block>/queue/read_ahead_kb should be
deprecated.With patient help from Kay Sievers and Greg KH
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
block/genhd.c | 3 +
fs/fuse/inode.c | 3 -
fs/nfs/client.c | 24 +++++----
fs/nfs/internal.h | 10 ++--
fs/nfs/super.c | 10 ++--
include/linux/backing-dev.h | 19 +++++++
include/linux/writeback.h | 3 +
lib/percpu_counter.c | 1
mm/backing-dev.c | 109 ++++++++++++++++++++++++++++++++++++++++++++
mm/page-writeback.c | 2
10 files changed, 163 insertions(+), 21 deletions(-)Index: linux-2.6-2/block/genhd.c
===================================================================
--- linux-2.6-2.orig/block/genhd.c
+++ linux-2.6-2/block/genhd.c
@@ -182,6 +182,8 @@ void add_disk(struct gendisk *disk)
disk->minors, NULL, exact_match, exact_lock, disk);
register_disk(disk);
blk_register_queue(disk);
+ bdi_register(&disk->queue->backing_dev_info, NULL,
+ "%s", disk->disk_name);
}EXPORT_SYMBOL(add_disk);
@@ -190,6 +192,7 @@ EXPORT_SYMBOL(del_gendisk); /* in partit
void unlink_gendisk(struct gendisk *disk)
{
blk_unregister_queue(disk);
+ bdi_unregister(&disk->queue->backing_dev_info);
blk_unregister_region(MKDEV(disk->major, disk->first_minor),
disk->minors);
}
Index: linux-2.6-2/fs/fuse/inode.c
===================================================================
--- linux-2.6-2.orig/fs/fuse/inode.c
+++ linux-2...
Ok, I'll take a look now, and see what's needed here. Would be good to
get that into the next kernel, now that you are waiting for the crap to
get fixed. :)Thanks,
Kay-
Ok, great!
Could I ask you to nudge me awake once those patches hit a git tree
Yes there is, Trond already suggested a proper replacement, however so
far I've been just trying to get it to work before trying to make it
pretty.Will implement Trond's suggestion while you and Greg eradicate this 20
byte thing :-)-
Ignorance of the existance of said function. Thanks for pointing it out.
Yeah, you're right, but I wanted to just get something working before
ok, that's good to know. someone ought to write a book on how to use all
this... really, even the functions are bare of documentation or
comments.-
Yes, I know, sorry :(
I'm working on cleaning up the apis a lot right now, so hopefully, in a
few months things will have settled down. I'm trying to document things
as I go, but right now I'm stuck at a much lower level (ksets and
friends...)thanks,
greg k-h
-
The problem is that not every bdi has a sysfs represented parent, hence
the class suggestion. For block devices it is indeed the block device
itself, but for example the NFS client's server descriptor does not haveHehe, do enjoy! Thanks.
-
Not sure what the other non-block FSs do, but NFS puts it in its
superblock. So that would roughly be per mount.-
And loop/md/dm devices...
-
Hmm, /sys/block/mdX, /sys/block/loopX, /sys/block/dm-X are all there today.
Kay
-
Yes, but they have no /queue/ subdir, which (for sda etc.) exports both
elevator and bdi variables. Now we want to access these bdi variables...-
| Andrew Morton | -mm merge plans for 2.6.23 |
| Greg Kroah-Hartman | [PATCH 005/196] Chinese: add translation of SubmittingDrivers |
| david | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Roland Dreier | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | [GIT]: Networking |
| Jan Kara | Re: [BUG] New Kernel Bugs |
