Re: spinlocks -- why are releases inlined and acquires are not?

Previous thread: [PATCH] plip: replace spin_lock_irq with spin_lock_irqsave in irq context by Mikulas Patocka on Monday, March 31, 2008 - 4:22 pm. (1 message)

Next thread: Re: + hotplug-memory-adding-non-section-aligned-memory-is-bad.patch added to -mm tree by KAMEZAWA Hiroyuki on Monday, March 31, 2008 - 5:48 pm. (1 message)
From: Jiri Kosina
Date: Monday, March 31, 2008 - 5:08 pm

Hi,

include/linux/spinlock.h shows:

	#define spin_lock_irq(lock)             _spin_lock_irq(lock)

unconditionally, i.e. irrespectible of config options, we always (on SMP) 
call kernel/spinlock.c:_spin_lock_irq(), which is even not inlined.

Contrary to that, unlocks are written as one would expect, i.e:

	#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || \
        	!defined(CONFIG_SMP)
	# define spin_unlock_irq(lock)          _spin_unlock_irq(lock)
	#else
	# define spin_unlock_irq(lock)                  \
	do {                                            \
        	__raw_spin_unlock(&(lock)->raw_lock);   \
	        __release(lock);                        \
	        local_irq_enable();                     \
	} while (0)

and __raw_spin_unlock() is of course properly inlined.

What is the reason for this asymetry? Shouldn't the acquiring functions be 
implemented in the very same way? Or at least, shouldn't all the 
__lockfunc functions be inlined?

Thanks,

-- 
Jiri Kosina
SUSE Labs
--

From: Jiri Kosina
Date: Monday, March 31, 2008 - 5:40 pm

i.e. is there any particular reason why we don't have something like the 
patch below (implemented for all the lock variants of course, this is 
just to demonstrate what I mean)?


spinlocks: inline spin_lock() in non-debug cases

inline spin_lock() in the non-debug case.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>

diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 576a5f7..30922ba 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -178,7 +178,12 @@ do {								\
 #define read_trylock(lock)		__cond_lock(lock, _read_trylock(lock))
 #define write_trylock(lock)		__cond_lock(lock, _write_trylock(lock))
 
+#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || \
+	!defined(CONFIG_SMP)
 #define spin_lock(lock)			_spin_lock(lock)
+#else
+#define spin_lock(lock) \
+	 do { __raw_spin_lock(&(lock)->raw_lock); __acquire(lock); } while (0)
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define spin_lock_nested(lock, subclass) _spin_lock_nested(lock, subclass)
--

From: Ingo Molnar
Date: Tuesday, April 1, 2008 - 1:33 am

IIRC the main reason we decided to uninline them was image size. So i'd 
suggest for you to check how this change impacts vmlinux size (on both 
64-bit and 32-bit), a typical distro config (or allyesconfig with lock 
debugging disabled).

If you do the test on x86.git/latest you'll also have the 
CONFIG_OPTIMIZE_INLINING=y and CONFIG_CC_OPTIMIZE_FOR_SIZE=y combination 
as well, which generates the most compact x86 kernel image ever.

	Ingo
--

From: Jiri Kosina
Date: Tuesday, April 1, 2008 - 1:38 am

In fact we have received report from one of our users that he is seeing 
approximately 15% performance degradation of mmap() when spinlocks are not 
inlined. I am going to do some performance measurements myself shortly, as 
it seems quite strange, but while at it, I have noticed the aforementioned 
asymetry in spinlock.h, so I just wanted to know if there is any 
particular reason behind that.

-- 
Jiri Kosina
SUSE Labs
--

From: Ingo Molnar
Date: Tuesday, April 1, 2008 - 1:43 am

inlining decisions almost never have effects of that order of magnitude 
- especially on new CPUs, so that 15% looks quite suspicious to me. If 
it's real then an easy-to-run testcase would be nice.

	Ingo
--

From: Jiri Kosina
Date: Tuesday, April 1, 2008 - 1:44 am

Definitely, that's why I am a little bit suspicious. Thanks,

-- 
Jiri Kosina
SUSE Labs
--

From: Andi Kleen
Date: Tuesday, April 1, 2008 - 4:12 am

At some point -- but that was before queued locks -- I noticed that
for i386 spin unlocks the call sequence for the sub function is
actually larger in code than the actual spin unlock operation and for
x86-64 it was about the same. That was not even counting any negative
register allocation effects the call has on the caller. Spinlocks
don't clobber a lot of registers, but the compiler doesn't know that
when calling the function so it has to assume all ABI callee clobbered
are gone.

I didn't do anything back then because at this point Ingo was
reorganizing the spinlock code hourly[1] for his lockdep etc. merge and
wanted to wait for it to settle down and then it dropped from 
the radar.

Anyways without queued spinlocks that has probably changed again,
might be still worth rechecking.

-Andi

[1] ok I'm exaggerating...
--

From: Jiri Kosina
Date: Tuesday, April 1, 2008 - 5:40 am

spin unlocks seem to be properly inlined anyway, so that should be fine. 
My concern here is the non-inlining of spin locks, for which I don't think 
your argument above is also valid, right?

Thanks,

-- 
Jiri Kosina
--

From: Andi Kleen
Date: Tuesday, April 1, 2008 - 6:30 am

When I did the original numbers (again before queued locks) the
call was slightly shorter than the actual spin lock operation.

This was not counting register allocator effects though which
are difficult to measure generally.

-Andi
--

Previous thread: [PATCH] plip: replace spin_lock_irq with spin_lock_irqsave in irq context by Mikulas Patocka on Monday, March 31, 2008 - 4:22 pm. (1 message)

Next thread: Re: + hotplug-memory-adding-non-section-aligned-memory-is-bad.patch added to -mm tree by KAMEZAWA Hiroyuki on Monday, March 31, 2008 - 5:48 pm. (1 message)