Re: [RFC] mm: generic adaptive large memory allocation APIs

Previous thread: [GIT PULL] Additional RCU fixes for 2.6.34 by Paul E. McKenney on Wednesday, May 5, 2010 - 5:28 pm. (7 messages)

Next thread: Your Webmail Quota Has Exceeded The Set Quota Please Upgrade by Marcia A. Shobe on Wednesday, May 5, 2010 - 5:26 pm. (1 message)
From: Changli Gao
Date: Wednesday, May 5, 2010 - 5:30 pm

kvmalloc() will try to allocate physically contiguous memory first, and try 
vmalloc to allocate virtually contiguous memory when the former allocation
fails.

kvfree() is used to free the memory allocated by kvmalloc(). It can't be used
in atomic context. If the callers are in atomic contex, they can use
kvfree_inatomic() instead.

There is much duplicate code to do such things in kernel, so I generate the
above APIs.

Thank Eric Dumazet for the "kv" prefix. :)

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/mm.h>
#include <linux/init.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/interrupt.h>

void *kvmalloc(size_t size)
{
	void *ptr;

	if (size < PAGE_SIZE)
		return kmalloc(PAGE_SIZE, GFP_KERNEL);
	ptr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN);
	if (ptr != NULL)
		return ptr;

	return vmalloc(size);
}
EXPORT_SYMBOL(kvmalloc);

void kvfree(void *ptr, size_t size)
{
	if (size < PAGE_SIZE)
		kfree(ptr);
	else if (is_vmalloc_addr(ptr))
		vfree(ptr);
	else
		free_pages_exact(ptr, size);
}
EXPORT_SYMBOL(kvfree);

struct kvfree_work_struct {
	struct work_struct	work;
	void			*head;
	void			**ptail;
};

DEFINE_PER_CPU(struct kvfree_work_struct, kvfree_work_struct);

static void kvfree_work(struct work_struct *_work)
{
	struct kvfree_work_struct *work;
	void *head, *tmp;
	
	work = container_of(_work, struct kvfree_work_struct, work);
	local_bh_disable();
	head = work->head;
	work->head = NULL;
	work->ptail = &work->head;
	local_bh_enable();

	while (head) {
		tmp = head;
		head = *(void **)head;
		vfree(tmp);
	}
}

void kvfree_inatomic(void *ptr, size_t size)
{
	if (size < PAGE_SIZE) {
		kfree(ptr);
	} else if (is_vmalloc_addr(ptr)) {
		struct kvfree_work_struct *work;

		*(void **)ptr = NULL;
		local_irq_disable();
		work = this_cpu_ptr(&kvfree_work_struct);
		*(work->ptail) = ptr;
		work->ptail = (void**)ptr;
		schedule_work(&work->work);
		local_irq_enable();
	} else ...
From: Changli Gao
Date: Wednesday, May 5, 2010 - 5:37 pm

local_bh_disable should be local_irq_disable(), and local_bh_enable()



-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
From: Changli Gao
Date: Wednesday, May 5, 2010 - 6:25 pm

typo mistake, should be kmalloc(size, GFP_KERNEL), thank Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp>.



-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Tetsuo Handa
Date: Wednesday, May 5, 2010 - 8:12 pm

I wonder why "struct kvfree_work_struct" is needed.
According to http://kernel.ubuntu.com/git?p=jj/ubuntu-lucid.git;a=blobdiff;f=security/apparmor/matc... ,

static void kvfree_work(struct work_struct *work)
{
	vfree(work);
}

void kvfree_inatomic(void *ptr, size_t size)
{
	if (size < PAGE_SIZE) {
		kfree(ptr);
	} else if (is_vmalloc_addr(ptr)) {
		/*
		 * We can embed "struct work_struct" inside *ptr
		 * because size >= PAGE_SIZE.
		 */
		struct work_struct *work = ptr;
		BUILD_BUG_ON(sizeof(struct work_struct) > PAGE_SIZE);
		INIT_WORK(&work, kvfree_work);
		schedule_work(&work);
	} else {
		free_pages_exact(ptr, size);
	}
}

should do what you want. (Though, I didn't test it.)
--

From: Changli Gao
Date: Wednesday, May 5, 2010 - 8:22 pm

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Jamie Lokier
Date: Thursday, May 6, 2010 - 8:35 am

Note that converting users from vmalloc() to kvmalloc() may increase
fragmentation problems for other parts of the kernel, because it will
tend to use up more of the available large blocks.  Especially users
who allocate large blocks and often.  That's worth a mention
somewhere.

On the other hand, this API could make it easier to convert some kmalloc()
calls to kvmalloc(), reducing fragmentation problems. :-)

Since the caller is indicating they don't mind which happens, then
anti-fragmentation heuristics (such as checking watermarks) could be
added to kvmalloc() at some future time, if needed.

-- Jamie
--

From: Changli Gao
Date: Thursday, May 6, 2010 - 9:32 pm

I found a way to eliminate the size argument of kvfree() and
kvfree_inatomic(). We can check where the page_head is slab or
compound, as pages owned by slab  subsystem always have Slab flag set
or are compound.

                page = virt_to_head_page(ptr);
                if (PageSlab(page) || PageCompound(page))
                        kfree(ptr);
                else
                        free_pages_exact(ptr, page->private);

And we can save the size argument in page->private after
alloc_pages_eact() returns.

        ptr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN);
        if (ptr != NULL) {
                virt_to_head_page(ptr)->private = size;
                return ptr;
        }

here is the test code:

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/mm.h>
#include <linux/init.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/interrupt.h>

#define kmalloc(size, gfp) ({ \
        void *ptr = kmalloc(size, gfp); \
        if (ptr != NULL) \
                printk("kmalloc: %p(%u)\n", ptr, size); \
        ptr; })

#define alloc_pages_exact(size, gfp) ({ \
        void *ptr = alloc_pages_exact(size, gfp); \
        if (ptr != NULL) \
                printk("alloc_pages_exact: %p(%u)\n", ptr, size); \
        ptr; })

#define vmalloc(size) ({ \
        void *ptr = vmalloc(size); \
        if (ptr != NULL) \
                printk("vmalloc: %p(%u)\n", ptr, size); \
        ptr; })

#define kfree(ptr) do { printk("kfree: %p\n", ptr); kfree(ptr); } while (0)
#define vfree(ptr) do { printk("vfree: %p\n", ptr); vfree(ptr); } while (0)
#define free_pages_exact(ptr, size) do { \
                printk("free_pages_exact: %p(%u)\n", ptr, size); \
                free_pages_exact(ptr, size); \
        } while (0)

void *kvmalloc(size_t size)
{
        void *ptr;

        if (size < PAGE_SIZE)
                return kmalloc(size, GFP_KERNEL);
        ptr = alloc_pages_exact(size, GFP_KERNEL | ...
From: Tetsuo Handa
Date: Friday, May 7, 2010 - 5:42 am

inatomic might be confusing because what vfree() checks is

By the way, is in_interrupt() a heavy operation?

  register unsigned long current_stack_pointer asm("esp") __used;
  static inline struct thread_info *current_thread_info(void)
  {
  	return (struct thread_info *)
  		(current_stack_pointer & ~(THREAD_SIZE - 1));
  }
  #define preempt_count() (current_thread_info()->preempt_count)
  #define irq_count()     (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK | NMI_MASK))
  #define in_interrupt()  (irq_count())

If we can agree on changing from (inatomic) to (in_interrupt()),
--

From: Peter Zijlstra
Date: Friday, May 7, 2010 - 5:52 am

I really dislike all of that, just don't allow usage from interrupt
context.



--

From: KOSAKI Motohiro
Date: Wednesday, May 12, 2010 - 9:45 pm

low order GFP_KERNEL allocation never fail. then, this doesn't works

On x86, vmalloc area is only 128MB address space. it is very rare 
resource than physical ram. vmalloc fallback is not good idea.


--

From: Jiri Slaby
Date: Thursday, May 13, 2010 - 1:43 am

Hi, I suppose you mean the kmalloc allocation -- so kmalloc should fail

These functions are a replacement for explicit
if (!(x = kmalloc()))
   x = vmalloc();
...
if (is_vmalloc(x))
  vfree(x);
else
  kfree(x);
in the code (like fdtable does this).

The 128M limit on x86_32 for vmalloc is configurable so if drivers in
sum need more on some specific hardware, it can be increased on the
command line (I had to do this on one machine in the past).

Anyway as this is a replacement for explicit tests, it shouldn't change
the behaviour in any way. Obviously when a user doesn't need virtually
contiguous space, he shouldn't use this interface at all.

thanks,
-- 
js
suse labs
--

From: KOSAKI Motohiro
Date: Thursday, May 13, 2010 - 2:05 am

I mean, if size of alloc_pages_exact() argument is less than 8 pages,


Why can't we make fdtable virtually contiguous free?
Anyway, alloc_fdmem() also don't works as author expected.



--

From: Jiri Slaby
Date: Thursday, May 13, 2010 - 2:19 am

Sorry, I don't see what's the problem with that. I can see only that

Indeed. I didn't mean that as the users should change that. They should


Pardon my ignorance, why? (There are more similar users:
init_section_page_cgroup, sys_add_key, ext4_fill_flex_info and many others.)

-- 
js
suse labs
--

From: KOSAKI Motohiro
Date: Thursday, May 13, 2010 - 2:40 am

I don't talk about kmalloc. it's ok to never fail.  but low order alloc_pages_exact() never fail too.


I think init_section_page_cgroup is ok. it's called at boot time. we don't enter forever page reclaim.

but other case, I don't know the reason. I guess they also have specific assumption.
I only said, generically it isn't right.



--

From: Jiri Slaby
Date: Thursday, May 13, 2010 - 3:16 am

Well, could you explain what exactly is broken about
x = kmalloc(size, GFP_KERNEL);
if (!x)
  x = vmalloc(size);
? Is is that kmalloc doesn't return until is has the memory to return
when asking for order(size) <= COSTLY_ORDER? I think this is expected.

thanks,
-- 
js
suse labs
--

From: KOSAKI Motohiro
Date: Thursday, May 13, 2010 - 3:43 am

Well, but fdtable doesn't really need contenious memory. no?
To make API mean we recommend to use it. but I don't hope to spread this 
wrong habit.  Instead, to kill it seems better.



--

Previous thread: [GIT PULL] Additional RCU fixes for 2.6.34 by Paul E. McKenney on Wednesday, May 5, 2010 - 5:28 pm. (7 messages)

Next thread: Your Webmail Quota Has Exceeded The Set Quota Please Upgrade by Marcia A. Shobe on Wednesday, May 5, 2010 - 5:26 pm. (1 message)