Ingo Molnar writes:Thinking of an alternative approach, I'd like to revisit the original problem for a moment: on a (big) system running a kernel based on 2.6.16 we found that in the following stack trace, pdflush has just enabled irqs again on its cpu after running with irqs disabled for about 2 minutes. ================================================================ STACK TRACE FOR TASK: 0xe0000168f4378000 (pdflush) 0 kdba_main_loop+0x14c [0xa0000001004190cc] 1 kdb+0x11fc [0xa0000001002a639c] 2 kdb_ipi+0x11c [0xa0000001002a12dc] 3 handle_IPI+0x27c [0xa00000010005671c] 4 handle_IRQ_event+0x8c [0xa0000001000fa40c] 5 __do_IRQ+0x12c [0xa0000001000fa5cc] 6 ia64_handle_irq+0x15c [0xa00000010001213c] 7 ia64_leave_kernel [0xa00000010000cb20] 8 _write_unlock_irqrestore+0x40 [0xa000000100558500] 9 test_set_page_writeback+0x18c [0xa000000100114b6c] 10 xfs_start_page_writeback+0x2c [0xa0000002e4ab914c] 11 xfs_page_state_convert+0xa0c [0xa0000002e4abcbac] 12 xfs_vm_writepage+0x19c [0xa0000002e4abd13c] 13 mpage_writepages+0x3fc [0xa0000001001bcb5c] 14 xfs_vm_writepages+0xac [0xa0000002e4ab9eac] 15 do_writepages+0xac [0xa0000001001136cc] 16 __writeback_single_inode+0x47c [0xa0000001001b8f7c] 17 sync_sb_inodes+0x4ac [0xa0000001001b9cec] 18 writeback_inodes+0x19c [0xa0000001001baafc] 19 wb_kupdate+0x26c [0xa000000100113bcc] 20 pdflush+0x38c [0xa00000010011582c] 21 kthread+0x22c [0xa0000001000c9dac] 22 kernel_thread_helper+0xcc [0xa000000100012f6c] 23 start_kernel_thread+0x1c [0xa0000001000094bc] ================================================================ There is, as far as I could tell, no code in the critical region of test_set_page_writeback that could take this much time, except for the call to _write_lock_irqsave itself. There the root of the problem seems to have been that many threads went after the read lock, and an rwlock_t will allow readers in while readers hold the lock, even if a writer needs it, thus potentially starving the writer indefinitely (up to 2 minutes in the case under discussion). Re-enabling irqs when looping in _write_lock_irqsave (and *_irq and *_bh) works here because _write_lock_irqsave() wasn't called with irqs disabled. But pdflush will still spend a long time getting the lock. The reason we're pursuing this nevertheless is that it is easy to do starting from the current code-base, and doesn't impose API or ABI changes. An alternative approach would to view this as a bug of the rwlock_t code in particular, and fix it by keeping new readers from a acquiring an rwlock_t if a writer is also trying to lock it. For x86 at least, I can sketch an implementation that is based on the new ticket-based spinlocks: - use the ticket-based lock for writers - add a reader count for readers (16 bits out of 32 are free) - readers get in if the write lock is idle - if the write lock wasn't idle, readers loop until it is - on obtaining the ticket-based lock, the writer waits for active readers to leave (eg reader count to go to zero) Contention by writers now keeps readers out, which may be a problem. Advantages of such an approach: - Writers are "fair" amongst themselves. - Readers cannot hold off writers. Disadvantages: - Contention among writers can hold off readers indefinitely. This scheme only works if the write lock is not "permanently" contended, otherwise it goes from frying pan to fire. - Requires changes to the asm-coded locking primitives, which means it is likely to be done for x86 and ia64 only, unless the platform maintainers step in. Given the above, is this worth pursuing for inclusion in mainline? A third solution would be to look at this problem on a case-by-case basis. In the case under discussion, there may be other kernel bugs that aggravate the problem (it is an old kernel after all) and maybe this just means the address_space.tree_lock ought to be replaced with something else (though I wouldn't claim to know what). -- Olaf Weber SGI Phone: +31(0)30-6696752 Veldzigt 2b Fax: +31(0)30-6696799 Technical Lead 3454 PW de Meern Vnet: 955-7151 Storage Software The Netherlands Email: olaf@sgi.com --
| Greg Kroah-Hartman | [PATCH 002/196] Chinese: rephrase English introduction in HOWTO |
| Linus Torvalds | Linux 2.6.27-rc8 |
| Parag Warudkar | BUG: soft lockup - CPU#1 stuck for 15s! [swapper:0] |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| Jakub Narebski | Re: VCS comparison table |
| Wincent Colaiuta | Re: [ANNOUNCE] GIT 1.5.4 |
| Sam Song | Fwd: [OT] Re: Git via a proxy server? |
| Junio C Hamano | Re: More precise tag following |
| Nick Guenther | Re: Real men don't attack straw men |
| Raimo Niskanen | HP ProLiant DL140 G3 problems |
| Todd Pytel | IDE or SCSI virtual disks for VMWare image? |
| R. Fumione | OpenBSD speed on desktops |
| Patrick McHardy | [NET_SCHED 03/15]: act_api: fix netlink API conversion bug |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Theodore Tso | Re: [bug?] tg3: Failed to load firmware "tigon/tg3_tso.bin" |
| Linus Torvalds | Re: [GIT]: Networking |
