... continuing a discussion on a similar issue some months ago ... Hi experts, To do Futex in userspace and many things in the Kernel there are several predefined "atomic" macros. Depending on the arch, there are several ways to do this: - normal "read-modify-write" processor instructions that by design are not interrupted (e.g. x86 and 68K non-SMP) - normal "read-modify-write" processor instructions that by design are not interrupted and bus locking (e.g. Ubicom 7K SMP) - normal "read-modify-write" processor instructions with bus lock prefix instruction (x86 SMP) - disable, enable the Interrupt (non-SMP archs that allow for this, often used in user space with uClinux) - "atomic region": the global ISR (return) code detects and finalizes atomic functions, because they are in a certain code region (non-SMP archs that don't have atomic instructions, e.g. BlackFin) Now I (we ?) face a new class of archs. Here I use the (currently non-SMP), MMU-enabled NIOS2 (provided as IP-Core in Altera FPGA). I suppose there are other IP-Core processors or otherwise expendable CPUs with similar specs: - no normal processor "read-modify-write" instructions that by design are not interrupted or even bus-locking - new "custom" processor instructions can be defined that might work atomically, as well not interruptible as kind of bus-locking for SMP use - these custom instructions can't access the memory in normal way (through the MMU and the cache). So to implement atomic instructions a dedicated RAM area would be needed to hold the atomically accessible data. Same can't be accessed by the CPU in other ways. This RAM area would be implemented with the new atomic instructions and be located within the FPGA and thus could accessed very fast (no cache issues). Of course this would need for the Kernel and the libraries to avoid any access to that RAM area with normal instructions. All code would need to use the set of "atomic" macros to access that RAM. Here I suppose we would ...
At the moment I am just planning a single core NIOS project. But SMP is a decent option for NIOS, as it is no hardware problem - and often has been requested - to design multiple CPUs in a single in such an FPGA and run SMP Linux on it. OTOH, to do FUTEX in full (MMU) - Linux, atomic operations are definitively necessary, as in Userland you can't disable the interrupt without a Kernel call (which to avoid FUTEX is all about.) Currently the plan with the NIOS (MMU, non-SMP) arch is to simulate atomic operations with the said "atomic region" (as is done with e.g. the BlackFin arch). (BTW.: when enumerating the ways how atomic macros are done I forgot to mention the "new ARM" method: dedicated "load locked, store conditional" operations that help simulating atomic behavior in user space without the Kernel's help.) -Michael --
You could create unpriviledged 'disable interrupts for 10 instructions' and 'test if interrupts are still disabled' instructions, and base your mutex implementation on that. But you'll have to stop calling it futex at that point... Or you could just optimize syscalls to be really fast... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
That would be great, but AFAIK, it's not decently possible with NIOS. The interrupt enable flag of the CPU can't be accessed by custom instructions. The CPU has up to 32 interrupt lines that it exposes to the custom FPGA "hardware". of course it _would_ be possible to gate all of them and manage this additional flag by a custom instruction. We would need to investigate how big the min/max delay between setting the interrupt and the CPU acknowledging it is. According to this, the count of NOPs between the "custom interrupt disable" and the load of the atomic value needs to be chosen and how many clock cycles the interrupt lock needs to be held. Unfortunately, AFAIK, the CPU-external FPGA "hardware" (that implements the custom instructions) can't see the shifting of the CPU's instruction queue. So the lock duration only can be counted in clock cycles, but not in instructions. The CPU might need to wait for a very large count of clocks for accessing (instructions or data) words e.g. in external dynamic RAM. That is why (when considering how to implement the atomic user land macros necessary for FUTEX) I did consider your idea to reduce the average overhead imposed by the necessity of having the Kernel ISR code finish a would be atomic operation. But as the delay is very uncertain, I feel that the ISR-trick can't be dropped completely. It might be a good idea to ask Altera to implement such a userland-enabled instruction (disable interrupt for the next n instructions). That would be really easy within the NIOS CPU iP-Core, but with custom instructions we are out of luck :(. FUTEX (e.g. the userland part of same) is just one paradigm (and IMHO the most important one, as pthread_mutex_..() uses it for fast POSIX compatible thread synchronization) that requires atomic user land operations. So any implementation of the appropriate atomic userland I trust that the Kernel developers already did that :). My test showed that with a x86 PC, the system calls really ...
Of course I do know the Altera hardware Mutex, but same just implements a single Mutex. To allow for FUTEX, which is the base of doing threaded application with the pthread functions provided by libc, the application programmer can create any number of pthread_mutex'es (of course a decent restriction by hardware to some 1000 would usually not harm with typical embedded applications). Of course a multithreaded application should not be done in an architecture depending way, but should be compilable for any arch and thus the count of allowable pthread_mutex'es should not be greatly restricted by the underlying hardware (be it NIOS or whatever else). Of course you are right that using custom instructions to do a decent count of atomic (hardware-) mutex'es necessary for implementing decent SMP aware FUTEX (or even Kernel Mutex) stuff with NIOS and similar archs is not strictly necessary, but I think it's handy to use them instead of I/O-elements, as the MMU would disallow user space access to "normal" I/O addresses. Michael --
Take a look at sparc 32bit. That only has a single meaningful atomic instruction (swap byte with 0xFF). It provides all the kernel atomic_t operations via this: arch/sparc/lib/atomic32.c. That bitops are done a similar way, which leaves spinlocks and the like. More importantly if your true locks in the FPGA are really fast in CPU terms then you can think of every other atomic instructions as being implemented using lock(cpu_atomic_instruction_lock) do bits unlock(cpu_atomic_instruction_lock) (its just this is normally done in hardware/microcode) Doing it per instruction might be a bit naïve but I think you can reasonably do it so that things like spinlocks use a single (or a hashed set) of non kernel locks to implement "atomic" instructions, and as sparc32 shows you only need a tiny subset of them to implement the rest in their terms. So I don't actually think you need any kernel core changes to get going, and given the kernel dynamically allocates a lot of locks I suspect trying to dynamically manage atomic ram allocations is going to cost more than executing a few instructions here and there under a single very fast hardware assisted lock. Alan --
As the NIOS and similar "load/store" archs by the underlaying hardware design can't provide any atomic memory read-modify write operations (without or with bus lock) at all, we need to search for a completely CPU-hardware independent way of doing atomic stuff both with non-SMP and SMP designs. The new ARM processors provide "load locked" / "store conditional" instructions to overcome this with a combination of I feel that this does not help. The important task (for me right now) is providing decent FUTEX (multiple of those ! ). Here you need to do atomic instructions in user spaces on the (multiple) FUTEX handling word (so interrupt disabling/enabling is not possible). If you try to implement this by using a _single_ lock (surrounding the would-be atomic instruction sequence), this IMHO only lifts the problem to another level: If one thread locks the "cpu_atomic_instruction_lock" and now the Kernel does a task switch and now a second thread tries to lock it as well, same would need to do a kernel call to do the waiting. So the "cpu_atomic_instruction_lock" is nothing but a FUTEX itself and asks for the same complexity the FUTEX handling needs: (A) it needs an SMP-safe user space atomic read-modify-write <here of course only a single one instead of multiple> and (B) it needs the Kernel-infrastructure to handle this (see the multiple articles available on Futex being nasty ;) . The "cpu_atomic_instruction_lock" Futex only provides for atomic operations and thus the normal (here: second level) Futex needs to stay in place, as well. I have no idea how the Kernel infrastructure for two-level Futex could I suppose the current NIOS _Kernel_ code implements atomic operations by disabling/enabling the interrupts. This of course is not possible with SMP designs and OTOH it's not possible in user space. That is why implementing FUTEX and SMP seems to ask for similar considerations. -Michael --
From: Michael Schnell <mschnell@lumino.de> Using the spinlock array idea also doesn't work in userspace because any signal handler that tries to do an atomic on the same object will deadlock on the spinlock. You'll have have to do this entirely in the kernel, and your FUTEX implementation will have to always make the futex() system call. It's the only way to do this and have it work completely. We have to do something similar on sparc32, and the signal handler deadlocks are very real, many glibc testcases that use threads and pthread mutexes will deadlock because of this very issue if you try to do the spinlock trick in userspace. --
Yep. I was beeing afraid of signal issues when thinking about this stuff (on and off for several months :) ), too. That is why I finally think that a _completely_ hardware based solution for each necessary atomic operation is necessary, as well to do Futex (if not using said "atomic region" workaround for non-SMP), as to do SMP. I finally think that this might be possible in a decent way with custom instructions using a - say - 1K Word internal FPGA memory space. But this might need some changes in the non-arch dependent Kernel and/or library code as the atomic macros would work on "handles" instead of pointers (of course these handles would be the old pointers with "normal" archs) and the words used by the macros would need to be explicitly allocated and deallocated instead of potentially being just static variables - even though the "atomic_allocate" macro would just create a static variable for "normal archs" and return it's address. -Michael --
One really expensive but safe way to do atomic operations is to always
have them done on one CPU only, and provide a mechanism for other CPUs
Why can't you do a hash by memory address for this?
I would guess you can define an instruction to atomically set and check
a bit in a shared array of implementation-specific size, by passing
a token in that by convention is the memory address you want to lock.
Given two priviledged instructions
/* returns one if we got the lock, zero if someone else holds it */
bool hashlock_addr(volatile void *addr);
void hashunlock_addr(volatile void *addr);
you can do
int atomic_add_return(int i, atomic_t *v)
{
int temp;
while (!hashlock_addr(v))
;
smp_rmb();
temp = v->counter;
temp += i;
v->counter = temp;
smp_wmb();
hashunlock_addr(v);
}
static inline unsigned long __cmpxchg(volatile unsigned long *m,
unsigned long old, unsigned long new)
{
unsigned long retval;
unsigned long flags;
while (!hashlock_addr(m))
;
smp_rmb()
retval = *m;
if (retval == old) {
*m = new;
smp_wmb();
}
hashunlock_addr(m);
return retval;
}
Anything else you can build on top of these two, including the system calls
that are used from user applications. Since you never hold that bit lock for
more than a few cycles, you could do with much less than 1K bits, in theory
a single global mutex (ignoring the address entirely) would be enough.
That said, a real load-locked/store-conditional would be much more powerful,
in particular because it can also be used from user space, and it is typically
more efficient because it uses the same mechanisms as the cache coherency
protocol.
Arnd
--
I don't see how (to do FUTEX) a hashlock can be implemented in a way that we stay in user mode when locking it and - if it's already locked - we do a Kernel call for waiting on it being unlocked by another thread. (This is what FUTEX does.) -Michael --
Of course. This is what pthread_mutex_...() does, if the arch does not provide the appropriate atomic functions and/or does not have FUTEX. But this discussion is about fast thread communication, thus avoiding I don't see why optimizing for speed and especially latency is uninteresting (with embedded systems like the one I'm planning). Multi-threaded user applications is exactly the case that is extremely interesting to me and that is why I started this discussion. The non-SMP-Kernel ( and non-FUTEX) case already is solved for NIOS (supposedly by interrupt disabling). An SMP-Linux is not yet crafted (and for me its a lot lower priority than decent user-space Of course doing load-locked, store-conditional custom instructions was an option I did consider, but as there is no way to access memory through cache and MMU with custom instructions, I don't see how this could be done, as the current way FUTEX works, the code will define the DWORDs to be handled atomically anywhere in the user space memory. Of course disabling the cache completely is not an option for a task that The user space cache coherency is *somehow* provided by the current (non-SMP) Kernel. I understand that special considerations only are necessary when the MMU tables are modified. There is no SMP Kernel for NIOS yet. I in fact have no idea how cache coherency can/will be handled AFAIK, the current NIOS cache hardware design does not provide for features such as external invalidating of cache lines. So those who start the NIOS SMP project will need to deal with that by having Altera enhance the NIOS design and/or use additional "hardware" outside of the CPU/cache/MMU block. Thanks, -Michael --
Ok. Your initial post didn't make it clear that this is all you are looking for. While atomic CPU operations would solve this problem, you don't really need to make the RAM access itself atomic, Right. So if you cannot implement a 'test-and-set', 'exchange' or 'store-conditional' instruction, I don't think any custom instructions will help you. You can probably implement an atomic function in a VDSO though, without any CPU extensions, I think this has been discussed for blackfin before. The idea is to let the kernel check if the instruction pointer is in the critical section of the VDSO while returning to user space. If it is, the kernel can jump back to the caller of that function instead of the function itself, and indicate failure so the user can retry. Arnd --
That is not the point. Custom instructions easily can implement thread-safe and SMP safe atomic read-modify-write user-land-enabled operations such as "test-and-set", "exchange" or "inc" (I don't think that "store conditional" would be necessary). But they can't work if the same data can be accessed by other (non-custom) CPU instructions, as the custom-instructions would necessarily bypass the cache and the MMU while the normal CPU instructions would use both (and as all this is done with performance in mind disabling or invalidating the cache is out of question and in user-land such operations are impossible, anyway. This is why the idea of a dedicated "atomic RAM" has been developed (which requires the arch-independent Kernel and Library code to access all would-be "atomic" locations _only_ through arch-depending macros <including allocating and freeing these memory locations>, even though for "standard" archs, the resulting ASM code would (hopefully, nearly Right., I already mentioned this "Atomic Region" way to implement an "atomic operation" workaround for non-SMP systems in the first message of this thread. Of course it would be advantageous to have this method for NIOS, as well, as it can be implemented without "hardware" support and without modifications of the arch-independent Kernel and library code. But it will not work with SMP and it will introduces an (even if supposedly not huge) overhead with any interrupt. A hardware implementation would avoid this overhead, would allow for atomic SMP user-land operations and could be the base of implementing an SMP Kernel (even though with NIOS/SMP other issues <like cache-synchronization> might need to be solved additionally). -Michael --
Of course you are right that custom instructions can access the Avalon Bus (similar as a DMA-ip's can access it) and thus the memory that is used by Linux, too. But as with DMA this access bypasses the cache and the MMU. The problem this discussion is about is, that the architecture-independent code Linux provides (e.g. in FUTEX Kernel code and in user-land "atomic" macros) defines the memory-words to be accessed in an atomic way just somewhere in the user-land address space and accesses them as well via the "atomic" macros (that _are_ arch-depending and thus can be done appropriately for NIOS) as with normal read and write operations done in C. If the arch-independent Code could be modified in a way that _all_ accesses (including memory allocation and "free") to the memory-words to be accessed in an atomic way, is done through arch-depending macros, an "atomic" RAM could be managed that is accessed thread- and SMP-safe and very fast as well in user-land as in Kernel code (with NIOS is this supposedly would be done via (a set of) custom instructions. Of course this "atomic RAM" _could_ be just a portion of normal Avalon-based RAM, but as it can't use the cache, it would be very slow if allocated in DRAM. But as all this is about a speed optimization, the "atomic RAM" should be _internal_ and now, as it never would be accessed by the CPU in Unfortunately (AFAIK) the arch independent Linux code currently does not provide for avoiding normal CPU access for FUTEX variables. Otherwise this discussion could be done completely here in the NIOS group instead Yep. A custom instruction is not interruptible in a non-SMP system. That is why it can do any of the necessary atomic read-modify-write operations. That is why implementing FUTEX (etc) with same is possible. But with additional normal CPU instructions accessing the same memory word, the data path can provide unexpected results, especially if the data cache is active. And deactivating or invalidating the cache (impossible ...
Sorry, but FUTEX is *irrelevant*, utterly and totally. It's an implementation of a model of fast user space locking for certain classes of processor, and its not exposed to applications in that form. Your first problem is to implement spin_lock and friends in the kernel, which you can do with a single fast lock in your special memory area/instructions. So with your extra bit of magic FPGA memory and instructions you can implement kernel locking. At that point you've got the main thing you need - a bootable SMP kernel that doesn't require mangling the entire kernel. Futexes or not you need a workable SMP kernel The cpu atomic instruction lock lets you do kernel space. That is all. There are other approaches including a semi-insane ways of creating an SMP No The kernel needs a way of implementing a tiny set of operations fast. Those operations need to match the existing kernel locking system. They don't have security properties, they don't have to deal with pageable memory. They have a fixed, definite API. FUTEX is to all intents and purposes an internal kernel magic interface with arch specific corner cases used by the C library to provide posix locking. You don't even need futex. If its not the right model for your platform you make the C library use your own totally unrelated locking scheme internally. Indeed if your FPGA memory doesn't go via the MMU etc I don't see how you can implement any kind of futex like system. If you have got some kind of way to assign/protect futex space then I'd expect your interfaces will look something like this at the low level open /dev/lockspace [some kind of operation to select if we are mapping a page of locks from another task or making a new set] mmap /dev/lockspace [map a suitable page of lockspace somewhere into the task] do lock ops on the lockspace mapping providing userspace sees a correct fast implementation of posix locks which is what actually gets used by well behaved apps (and most people ...
FUTEX is kind of _invisible_ to the user code.Usually it's hidden behind pthread_mutex() and pthread_mutex_...() automatically falls back to non-FUTEX based locking (that always do Kernel calls). But it is very relevant, as without it, a multithreaded application will perform a lot poorer. Here many locks might be necessary to protect resources (mostly modifications to memory locations) used by multiple threads (running on multiple CPUs when doing SMP). This can happen very often and always doing two Kernel calls (lock and unlock) is a very poor option. This problem is even worse with the archs in question as even a simple inc can't be done in an atomic way (even using ASM) and even for this, a lock is necessary (or directly using the "atomic" macros we talk Yep. I suppose in Kernel space this can be easily handled either by disabling / enabling the interrupt in non-SMP designs or by a hardware If "you" is You that might be true. But if "you" is me its utterly and totally wrong. For my heavily multithreaded application I need FUTEX but not SMP (yet). For me, SMP is no advantage if it does not support FUTEX and I suppose the SMP solution with a single hardware mutex can't do this (but maybe I'm wrong here and a software workaround is possible). Happily Thomas (the maintainer of the NIOS distribution) agrees with me that FUTEX is important and I hope we soon will work together on making it possible for the MMU based NIOS distribution. Right now I just want to discuss if doing a hardware based thing - that might help with doing SMP one day, too - would be more agreeable than the currently suggested way with the "atomic region" software workaround (for non-SMP). I suppose the current NIOS _Kernel_ code implements atomic operations by pthread_mutex..() uses FUTEX if available with the arch, so FUTEX is a way of complying to the POSIX standard. Of course there are other ways (that pthread_mutex_...() use if FUTEX is not available) but this asks for Kernel calls with any lock and ...
You are completing missing the point Linux + glibc platforms don't "need" futex - you need fast user space locks. Futex is an implementation of those locks really based around platforms with atomic instructions. People were doing fast user space locks before Linus was even born and on machines without atomic operations. Seperate out - the purpose for which the system exists (fast user locking) - the interfaces by which it must be presented (posix pthread mutex) Nope. Glibc allows you to implement arch specific code for these locks which may not be FUTEX but need not be kernel based. The user space mechanics of the futex stuff include platform specific stuff for all platforms. You might do the blocking kernel parts of it via the futex syscall but what matters are the uncontended fast paths which are arch You clearly need a pthread_mutex that is fast - but the idea that this means FUTEX is misleading and futex on each platform in the user space side is different per architecture anyway. The idea that you need atomic operations to do fast user space locking is also of course wrong - you only need store ordering. Alan --
Of course you are right, but IMHO this is why FUTEX was invented (and AFAIK, Linux himself did the first implementation). With FUTEX there is a standard way of speeding up Posix compatible thread locking (by implementing the user space part of FUTEX in the pthread part of libc and defining a Kernel interface for the fast thread locking/unlocking functions that is not (much ?) more arch depending than other Kernel interfaces. Of course you are right that my suggestion in fact contradicts to this by defining the FUTEX Kernel interface to work on a kind of Handles instead of user-space pointers (even though same would still use the same C-type an in fact can be understood as pointers into the "Atomic RAM, accessible only by some special ASM instructions). Anyway, working on FUTEX for the arch allows for community based work (in the library and in the Kernel code) instead of having anybody interested do their own implementation right within the (propriety) user IMHO the only decent "community-compatible" implementation is doing it in a POSIX way and allowing for "standard Linux user space code", thus Same as any and libc and Linux Kernel stuff: community based and done under GPL, modifying common (arch-independent) code only if necessary Of course you are right again. But is there rally a libc version that implements pthread_mutex() with user space locking without using FUTEX ? I wonder what Kernel interface it uses to perform the waiting. In fact I did a testing program to prepare the implementation of fast user space locking. Here I tried out several methods e.g. - pthread_mutex_...() - system V sema - my own code (several variants taken from "Futexes are tricky by Ulrich Drepper") for the user space part of FUTEX, using the FUTEX Kernel interface - some hombrew buggy testing code I ran this program on PC (libc using FUTEX) and NIOS (libc using Kernel calls) Based on this, I do suppose that creating any _working_ method for user space based thread locking (on any ...
Lamport's Bakery is the classic algorithm for doing this. It has very minimal assumptions about visibility of writes and the order they are seen so should certainly work uniprocessor and may work SMP depending upon the cache coherency rules. It doesn't scale well to large numbers of threads however. Alan --
Hmm. The code implements the lock as a busy spinning wait. This of course is not possible in real world, as the thread that has the lock will not get any CPU time (e.g. in a non-SMP system). I understand that the main purpose of the FUTEX Kernel call(s) is doing a not-busy wait and having the "unlock" code wake the (next) waiting thread. I did implement something like this in my testing program: enhanced by a sleep to allow for the thread that has the lock to proceed it's work, but this of course is not fast at all, as a short sleep() produces too much CPU load and a long sleep produces too much latency. So maybe this algorithm can be used instead of the hardware stuff I suggested but it would need a FUTEX-like Kernel part, too. -Michael --
Alan, If that is true "in depth", there should be no FUTEX-relevant code, neither in the Kernel, nor in the library, that should be considered "arch-independent". If so, we could happily create a "atomic hardware RAM" - based (or whatever type of) FUTEX for the arch in question in the arch's source files, without needing to take care of what happens in the rest of the Kernel or Library code. Provided this implementing the idea I presented might be a doable task. -Michael --
We have a lot of kernel code which is only usable on some architectures and a few opt out of. The largest example of course being the mm layer which is strictly for MMU equipped systems. The important point is that you don't *have* to use futex to provide the posix pthread lock primitives if they don't fit your platform. --
I don't *have* to use Linux to provide the function the controller is supposed to do for the end-user :). But Linux is a very decent way to make it happen.... -Michael --
the Blackfin already has such an instruction (TESTSET), but since it has restrictions (cannot be used safely on cached memory), we found it unusable as a generic solution. this is why we went to the userspace fixed atomic code region. the only time we use this instruction is with the Blackfin SMP port, and only as an inter-core lock (so it has a fixed address and there is only one such lock). i think you underestimate the task of customizing every userspace point that may want atomic instructions. and no matter what sized atomic region you pick, having it arbitrarily fail at runtime because applications have "too many" locks is a poor solution. -mike --
I never said this would be easy ;) Maybe it's close to impossible :( I had hoped modifying the "atomic" macros for the arch in question in a way that they use a handle instead of a pointer to the "atomic" variable in a compatible way and then modifying the code that uses these macros in a way that theses handles are created by additional macros that with other archs can be uses optionally (in a compatible way: returning the appropriate pointer) could work. But of course you are correct that a Of course you are right, but not being able to allow for SMP is a poor solution, too. I suppose it's possible to generate a dedicated system error when an application runs out of "pthread_mutex" locks (i.e. when the atomic_malloc() function detects that the atomic RAM is exhausted). You also can unexpectedly get a normal "out of memory" system in an application error at any point (e.g. when opening a file) ;) I don't suppose any "normal" embedded application will use more than some 100. (In any case, it would be good to have a (of course no-SMP-) distribution that uses the "atomic region" software workaround, as well, in case the user does not want to implement the necessary custom instructions. This would be usable for applications excessively consuming of "atomic RAM", too.) -Michael --
