We are pursuing Linus' suggestion currently. This discussion is completely unrelated to that work. On Thu, May 15, 2008 at 09:57:47AM +0200, Nick Piggin wrote:We would need to deposit the payload into a central location to do the invalidate, correct? That central location would either need to be indexed by physical cpuid (65536 possible currently, UV will push that up much higher) or some sort of global id which is difficult because remote partitions can reboot giving you a different view of the machine and running partitions would need to be updated. Alternatively, that central location would need to be protected by a global lock or atomic type operation, but a majority of the machine does not have coherent access to other partitions so they would need to use uncached operations. Essentially, take away from this paragraph that it is going to be really slow or really large. Then we need to deposit the information needed to do the invalidate. Lastly, we would need to interrupt. Unfortunately, here we have a thundering herd. There could be up to 16256 processors interrupting the same processor. That will be a lot of work. It will need to look up the mm (without grabbing any sleeping locks in either xpmem or the kernel) and do the tlb invalidates. Unfortunately, the sending side is not free to continue (in most cases) until it knows that the invalidate is completed. So it will need to spin waiting for a completion signal will could be as simple as an uncached word. But how will it handle the possible failure of the other partition? How will it detect that failure and recover? A timeout value could be difficult to gauge because the other side may be off doing a considerable amount of work and may just be backed up. It is an assumption based upon some of the kernel functions we call doing things like grabbing mutexes or rw_sems. That pushes back to us. I think the kernel's locking is perfectly reasonable. The problem we run into is we are trying to get from one context in one kernel to a different context in another and the in-between piece needs to be sleepable. XPMEM allows one process to make a portion of its virtual address range directly addressable by another process with the appropriate access. The other process can be on other partitions. As long as Numa-link allows access to the memory, we can make it available. Userland has an advantage in that the kernel entrance/exit code contains memory errors so we can contain hardware failures (in most cases) to only needing to terminate a user program and not lose the partition. The kernel enjoys no such fault containment so it can not safely directly reference memory. Thanks, Robin --
| Arjan van de Ven | [patch] Add basic sanity checks to the syscall execution patch |
| Rafael J. Wysocki | Re: Linux 2.6.25-rc2 |
| Andrew Morton | Re: 2.6.23-rc4-mm1 |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
git: | |
| Linus Torvalds | Re: On Tabs and Spaces |
| Lars Hjemli | Re: kernel.org mirroring (Re: [GIT PULL] MMC update) |
| Eric Wong | Re: [RFC] Git config file reader in Perl (WIP) |
| Jakub Narebski | Re: GSoC 2008 - Mentors Wanted! |
| Karel Kulhavy | OpenBSD sticker considered cool by a layman |
| Richard Stallman | Real men don't attack straw men |
| Marco Peereboom | Re: Multi-Threaded SSH/SCP made by university of Puttsburgh |
| Douglas A. Tutty | lock(1) to lock all virtual terminals? |
| Jim Winstead Jr. | Re: Root Disk/Book Disk Compatibility |
| Brandon S. Allbery | Re: mkdir says "no space left on device" and more problems... |
| Arthur Recktenwald | rcmd: socket: Permission denied |
| massimo rossi | Re: SLS on Compaq Deskpro 66M (486-66/DX2 EISA [ugh])? |
