Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier

Previous thread: [PATCH?][arch/parisc/kernel/pci-dma.c] pcxl_dma_ops.alloc_noncoherent = pa11_dma_alloc_consistent? by Roel Kluin on Monday, February 11, 2008 - 9:23 am. (10 messages)

Next thread: Aborted commands with arcmsr and 2xWD1500ADFD in RAID1 by Aron Stansvik on Monday, February 11, 2008 - 9:44 am. (2 messages)
From: Jan-Bernd Themann
Date: Monday, February 11, 2008 - 9:24 am

Drivers like eHEA need memory notifiers in order to 
update their internal DMA memory map when memory is added
to or removed from the system.

Patch for eHEA memory hotplug support that uses these functions:
http://www.spinics.net/lists/netdev/msg54484.html

Signed-off-by: Jan-Bernd Themann <themann@de.ibm.com>


---
 Hi,

 the eHEA patch belongs to a patchset that is usually
 added by Jeff Garzik once this dependency (EXPORTS)
 is resolved.

 Regards,
 Jan-Bernd


 drivers/base/memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 7ae413f..f5a0bf1 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -52,11 +52,13 @@ int register_memory_notifier(struct notifier_block *nb)
 {
         return blocking_notifier_chain_register(&memory_chain, nb);
 }
+EXPORT_SYMBOL_GPL(register_memory_notifier);
 
 void unregister_memory_notifier(struct notifier_block *nb)
 {
         blocking_notifier_chain_unregister(&memory_chain, nb);
 }
+EXPORT_SYMBOL_GPL(unregister_memory_notifier);
 
 /*
  * register_memory - Setup a sysfs device for a memory block
-- 
1.5.2

--

From: Dave Hansen
Date: Monday, February 11, 2008 - 9:47 am

I know that's already in mainline but, man, that code is nasty.  It has
stuff indented 7 levels or so and is virtually impossible to read.


is also completely bogus for arch-independent code.  If someone ever
wants to do ppc without sparsemem (or redoes internal sparsemem
interfaces), this code will break.  

Also, keeping your own mapping of all of memory is really nasty.  With
SPARSEMEM_EXTREME/VMEMMAP you can have extremely sparse physical memory,

could theoretically consume all of memory if the sections are very sparse.

Could you please try to explain what the heck this driver is doing?
We'll try to fix up the generic interfaces so that you can deal with
things in a more generic fashion, but it can't stay this way.

Also, just ripping down and completely re-doing the entire mass of cards
every time a 16MB area of memory is added or removed seems like an
awfully big sledgehammer to me.  I would *HATE* to see anybody else
using this driver as an example to work off of?  Can't you just keep
track of which areas the driver is actually *USING* and only worry about
changing mappings if that intersects with an area having hotplug done on
it?

Can you please give it some CodingStyle love and make it so that it
doesn't indent so much?  Something like this:

diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
index c051c7e..72c5652 100644
--- a/drivers/net/ehea/ehea_main.c
+++ b/drivers/net/ehea/ehea_main.c
@@ -2638,38 +2638,40 @@ static void ehea_rereg_mrs(struct work_struct *work)
 	down(&dlpar_mem_lock);
 	ehea_info("LPAR memory enlarged - re-initializing driver");
 
-	list_for_each_entry(adapter, &adapter_list, list)
-		if (adapter->active_ports) {
-			/* Shutdown all ports */
-			for (i = 0; i < EHEA_MAX_PORTS; i++) {
-				struct ehea_port *port = adapter->port[i];
-
-				if (port) {
-					struct net_device *dev = port->netdev;
-
-					if (dev->flags & IFF_UP) {
-						down(&port->port_lock);
-						netif_stop_queue(dev);
-						ret = ...
From: Jan-Bernd Themann
Date: Wednesday, February 13, 2008 - 8:17 am

Hi Dave,



to form a base for the eHEA memory add / remove concept discussion:

Explanation of the current eHEA memory add / remove concept:

Constraints imposed by HW / FW:
- eHEA has own MMU
- eHEA  Memory Regions (MRs) are used by the eHEA MMU  to translate virtual
  addresses to absolute addresses (like DMA mapped memory on a PCI bus)
- The number of MRs is limited (not enough to have one MR per packet)
- Registration of MRs is comparativley slow as done via slow firmware call
(H_CALL)
- MRs can have a maximum size of the memory available under linux
- MRs cover a contiguous virtual memory block (no holes)

Because of this there is just one big MR that covers entire kernel memory.
We also need a mapping table from kernel addresses to this
contiguous "virtual memory IO space" (here called ehea_bmap).

- When memory is added / removed to LPAR (and linux), the MR has to be updated.
  This can only be done by destroying and recreating the MR. There is no H_CALL
  to modify MR size. To find holes in the linux kernel memory layout we have to
  iterate over the memory sections for recreating a ehea_bmap
  (otherwise MR would be bigger then available memory causing the
  registration to fail)

- DLPAR userspace tools, kernel, driver, firmware and HMC are involved in that
  process on System p

Memory add: version without a external memory notifier call
- new memory used in a transfer_xmit will result in a "ehea_bmap
  translation miss", which triggers a rebuild and reregistration
  of the ehea_bmap based on the current kernel memory setup.
- advantage: the number of MR rebuilds is reduced significantly compared to
  a rebuild for each 16MB chunk of memory added.

Memory add: version with external notifier call:
- We still need a ehea_bmap (whatever structure it has)

Memory remove with notifier:
- We have to rebuild the ehea_bmap instantly to remove the pages that are
  no longer available. Without doing that, the firmware (pHYP) cannot remove
  that memory from the ...
From: Dave Hansen
Date: Wednesday, February 13, 2008 - 10:05 am

You're wrong ;).  In mainline, this is true.  There was a version of the
SUSE kernel that did otherwise.  But you can not and should not depend
on this never changing.  But, someone is perfectly free to go out an
implement something better than sparsemem for memory hotplug.  If they

True, and true. (There might be exceptions to the whole sections one,

You can export and use (un)register_memory_notifier.  You just need to
do it in a reasonable way that compiles for randconfig on your
architecture.  Believe me, we don't want to start teaching drivers about

Look at kernel/resource.c

But, I'm really not convinced that you can actually keep this map
yourselves.  It's not as simple as you think.  What happens if you get
on an LPAR with two sections, one 256MB@0x0 and another
16MB@0x1000000000000000.  That's quite possible.  I think your vmalloc'd
array will eat all of memory.  

That's why we have SPARSEMEM_EXTREME and SPARSEMEM_VMEMMAP implemented
in the core, so that we can deal with these kinds of problems, once and

Basically, you can't use anything related to sections outside of the
core code.  You can use things like pfn_valid(), or you can create new

It isn't the act of exporting that's the problem.  It's making sure that
the exports won't be prone to abuse and that people are using them
properly.  You should assume that you can export and use
walk_memory_resource().

Do you know what other operating systems do with this hardware?

In the future, please make an effort to get review from knowledgeable
people about these kinds of things before using them in your driver.
Your company has many, many resources available, and all you need to do
is ask.  All that you have to do is look to the tops of the files of the
functions you are calling.

-- Dave

--

From: Christoph Raisch
Date: Thursday, February 14, 2008 - 1:46 am

Unfortunately this won't work. This was one of our first ideas we tossed
out,

We understand that the add/remove area is not as
settled in the kernel like for example f_ops ;-)
Are there already base working assumptions which are very unlikely to

I'm a little confused here....
...the existing add/remove code depends on sparse mem.
Other pieces on the POWER6 version of the architecture do as well.
So we could either chose to disable add/remove if sparsemem is not there,
I'm glad you mention this part. There are many algorithms out there to
handle this problem,
hashes/trees/... all of these trade speed for smaller memory footprint.
We based the table decission on the existing implementations of the
architecture.
Do you see such a case coming along for the next generation POWER systems?
I would guess these drastic changes would also require changes in base
kernel.

Will you provide a generic mapping system with a contiguous virtual address
space
like the ehea_bmap we can query? This would need to be a "stable" part of
the implementation,
including translation functions from kernel to nextgen_ehea_generic_bmap

We picked sections instead of PFNs because this keeps the ehea_bmap in a
reasonable range
on the existing systems.
But if you provide a abstract method handling exactly the problem we
mention

So this seems to come down to a basic question:
New hardware seems to have a tendency to get "private MMUs",
which need private mappings from the kernel address space into a
"HW defined address space with potentially unique characteristics"
RDMA in Openfabrics with global MR is the most prominent example heading

We're not aware of another open source Operating system trying to address

So we're glad we finally found the right person who takes responsibility
Gruss / Regards
Christoph Raisch + Jan-Bernd Themann


--

From: Dave Hansen
Date: Thursday, February 14, 2008 - 10:12 am

Can you give a ballpark of how many there are to work with? 10? 100?

If you use good interfaces, and someone changes them, they'll likely
also fix your driver.

If you use bad interfaces, people may not even notice when they break.
As I showed you with those compile failures, you're using bad interfaces

Technically, you can do this.  But, it's not a sign of a professionally
written driver that is going to get its patches accepted into mainline.
Technically, you can also use lots of magic numbers and not obey
CodingStyle.  But, you'll probably get review comments asking you to

Dude.  It exists *TODAY*.  Go take a machine, add tens of gigabytes of
memory to it.  Then, remove all of the sections of memory in the middle.
You'll be left with a very sparse memory configuration that we *DO*
handle today in the core VM.  We handle it quite well, actually.

The hypervisor does not shrink memory from the top down.  It pulls
things out of the middle and shuffles things around.  In fact, a NUMA
node's memory isn't even contiguous.

Your code will OOM the machine in this case.  I consider the ehea driver


Yes, that's a real possibility, especially if some other users for it
come forward.  We could definitely add something like that to the
generic code.  But, you'll have to be convincing that what we have now
is insufficient.

Does this requirement:
"- MRs cover a contiguous virtual memory block (no holes)"
come from the hardware?

Is that *EACH* MR?  OR all MRs?

Where does EHEA_BUSMAP_START come from?  Is that defined in the
hardware?  Have you checked to ensure that no other users might want a
chunk of memory in that area?

Can you query the existing MRs?  Not change them in place, but can you

One thing you can guarantee today is that things are contiguous up to
MAX_ORDER_NR_PAGES.  That's a symbol that is unlikely to change and is
much more appropriate than using sparsemem.  We could also give you a
nice new #define like MINIMUM_CONTIGUOUS_PAGES or something.  I ...
From: Badari Pulavarty
Date: Thursday, February 14, 2008 - 10:36 am

On Thu, 2008-02-14 at 09:12 -0800, Dave Hansen wrote:

I am not sure what you are trying to do with walk_memory_resource(). The
behavior is different on ppc64. Hotplug memory usage assumes that all
the memory resources (all system memory, not just IOMEM) are represented
in /proc/iomem. Its the case with i386 and ia64. But on ppc64 is
contains ONLY iomem related. Paulus didn't want to export all the system
memory into /proc/iomem on ppc64. So I had to workaround by providing
arch-specific walk_memory_resource() function for ppc64.

Thanks,
Badari

--

From: Dave Hansen
Date: Thursday, February 14, 2008 - 10:38 am

OK, let's use that one.

-- Dave

--

From: Christoph Raisch
Date: Friday, February 15, 2008 - 6:22 am

It depends on HMC configuration, but in worst case the upper limit is in

Your comment indicates that the upper limit for memory to be set on HMC
does not influence
the upper limit of the partition physical address space.
So our base assumption we discussed internally is wrong here.
EHEA_BUSMAP_START is a value which has to match between the wqe
virtual addresses and the MR used in them.
Fortunately there's a simple answer on that one. Each MR has a own address
space,
so there's no need to check.
A HEA MR actually has exactly the same attributes as a Infiniband MR with
this hardware.

That's definitely the right direction.

From this mail thread I would conclude....
memory space can have holes, and drivers shouldn't make any assumption when
where and how.

A translation from kernel to ehea_bmap space should be fast and predictable
(ruling out hashes).
If a driver doesn't know anything else about the mapping structure,
the normal solution in kernel for this type of problem is a multi level
look up table
like pgd->pud->pmd->pte
This doesn't sound right to be implemented in a device driver.

We didn't see from the existing code that such a mapping to a contiguous
space already exists.
Maybe we've missed it.

If the mapping is less random, the translation gets much simpler.
MAX_ORDER_NR_PAGES helps here, is there more like that?


Gruss / Regards
Christoph Raisch + Jan-Bernd Themann

--

From: Dave Hansen
Date: Friday, February 15, 2008 - 9:55 am

I've been thinking about that, and I don't think you really *need* to
keep a comprehensive map like that.  

When the memory is in a particular configuration (range of memory
present along with unique set of holes) you get a unique ehea_bmap
configuration.  That layout is completely predictable.

So, if at any time you want to figure out what the ehea_bmap address for
a particular *Linux* virtual address is, you just need to pretend that
you're creating the entire ehea_bmap, use the same algorithm and figure
out host you would have placed things, and use that result.

Now, that's going to be a slow, crappy linear search (but maybe not as
slow as recreating the silly thing).  So, you might eventually run into
some scalability problems with a lot of packets going around.  But, I'd
be curious if you do in practice.

The other idea is that you create a mapping that is precisely 1:1 with
kernel memory.  Let's say you have two sections present, 0 and 100.  You
have a high_section_index of 100, and you vmalloc() a 100 entry array.

You need to create a *CONTIGUOUS* ehea map?  Create one like this:

EHEA_VADDR->Linux Section
0->0
1->0
2->0
3->0
...
100->100

It's contiguous.  Each area points to a valid Linux memory address.
It's also discernable in O(1) to what EHEA address a given Linux address
is mapped.  You just have a couple of duplicate entries.  

-- Dave

--

From: Jan-Bernd Themann
Date: Monday, February 18, 2008 - 3:00 am

switching to proper mail client...


Up to 14 addresses translation per packet (sg_list) might be required on 
the transmit side. On receive side it is only 1. Most packets require only 
very few translations (1 or sometimes more)  translations. However, 
with more then 700.000 packets per second this approach does not seem 
reasonable from performance perspective when memory is fragmented as you

This has a serious issues with constraint I mentions in the previous mail: 

"- MRs can have a maximum size of the memory available under linux"

The requirement is not met that the memory region must not be 
larger then the available memory for that partition. The "create MR" 
H_CALL will fails (we tried this and discussed with FW development)


Regards,
Jan-Bernd & Christoph
--

From: Dave Hansen
Date: Wednesday, February 20, 2008 - 11:14 am

OK, but let's see the data.  *SHOW* me that it's slow. If the algorithm
works, then perhaps we can simply speed it up with a little caching and
*MUCH* less memory overhead.

-- Dave

--

From: Dave Hansen
Date: Monday, February 11, 2008 - 9:50 am

How about you fix up the driver, first?  Then, this can go in.

-- Dave

--

From: Dave Hansen
Date: Tuesday, February 12, 2008 - 11:04 am

This driver is broken pretty horribly.  It won't even compile for a
plain ppc64 kernel:

http://sr71.net/~dave/linux/ehea-is-broken.config

I know it's used for very specific hardware, but this is the symptom of
it not using the proper abstracted interfaces to the VM.

In file included from /home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_main.c:42:
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.h:44:14: warning: "SECTION_SIZE_BITS" is not defined
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.h:45:2: error: #error eHEA module cannot work if kernel sectionsize < ehea sectionsize
  CC      drivers/net/mii.o
make[4]: *** [drivers/net/ehea/ehea_main.o] Error 1
make[4]: *** Waiting for unfinished jobs....
  CC      drivers/net/ixgb/ixgb_param.o
In file included from /home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:32:
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.h:44:14: warning: "SECTION_SIZE_BITS" is not defined
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.h:45:2: error: #error eHEA module cannot work if kernel sectionsize < ehea sectionsize
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c: In function 'ehea_create_busmap':
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:574: error: 'NR_MEM_SECTIONS' undeclared (first use in this function)
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:574: error: (Each undeclared identifier is reported only once
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:574: error: for each function it appears in.)
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:575: error: implicit declaration of function 'valid_section_nr'
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c: In function 'ehea_map_vaddr':
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:606: error: ...
Previous thread: [PATCH?][arch/parisc/kernel/pci-dma.c] pcxl_dma_ops.alloc_noncoherent = pa11_dma_alloc_consistent? by Roel Kluin on Monday, February 11, 2008 - 9:23 am. (10 messages)

Next thread: Aborted commands with arcmsr and 2xWD1500ADFD in RAID1 by Aron Stansvik on Monday, February 11, 2008 - 9:44 am. (2 messages)