Hi Dave. A while ago I sent a message about long AGP delays upon starting and exiting X: http://marc.info/?l=linux-kernel&m=121647129632110&w=2 There was no reply (if that was due to the linux.ie address, could you perhaps update it in MAINTAINERS?) but today Shaohua Li posted a patch that made me wonder about PAT in this context: http://marc.info/?l=linux-kernel&m=121783222306075&w=2 http://marc.info/?l=linux-kernel&m=121783222406078&w=2 http://marc.info/?l=linux-kernel&m=121783222406081&w=2 His patch does not solve anything appreciable for me -- the delays are still as described in that previous post, with an exception for (with Option "AGSize" "64") delays upon exiting X that are now sometimes as bad as a full 12 seconds. What _does_ solve this though is booting with the "nopat" command line parameter. I'm on 2.6.26.1 and have enabled PAT for my AMD Duron myself. With "nopat", there's no problem to be seen anymore -- exiting X specifically is instantaneous. With or without PAT, my /proc/mtrr is always: reg00: base=0x00000000 ( 0MB), size= 512MB: write-back, count=1 reg01: base=0x20000000 ( 512MB), size= 256MB: write-back, count=1 reg02: base=0xe8000000 (3712MB), size= 64MB: write-combining, count=1 under X joined by: reg03: base=0xe4000000 (3648MB), size= 32MB: write-combining, count=2 This is a machine with 768M, the AGP aperture set to 64MB and a 32MB Matrox Millenium G550 AGP card. More detail in previous post. Is this something inherent to PAT? Inherent to PAT on AMD family 6? Inherent to DRM/AGP with PAT? On AMD family 6? This is probably fairly important to get sorted because although I don't know what's where at the moment, last I saw was a patch in x86/tip that enabled PAT on many more models including all of AMD. For reference, /proc/cpuinfo: processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 7 model name : AMD Duron(tm) Processor stepping : 1 cpu MHz : 1313.094 cache size : 64 KB fdiv_bug : ...
To get some more debug data, can you please retest with latest kernel (2.6.27-rc2) using "debugpat" kernel option and provide dmesg output plus contents of <debugfs>/x86/pat_memtype_list? Thanks, Andreas --
No... my kernel message buffer isn't large enough for that :-( Right, I guess I now know where the delay is coming from. I suppose this is not expected. dmesg as captured after starting X and without "debugpat" at: http://members.home.nl/rene.herman/pat/dmesg.x Truncated dmesg with "debugpat": Before starting X (1K): http://members.home.nl/rene.herman/pat/pat_memtype_list.console.debugpat After starting X (625K): http://members.home.nl/rene.herman/pat/pat_memtype_list.x.debugpat (This is with 64MB AGP memory) More data: http://members.home.nl/rene.herman/pat/config-2.6.27-rc2-current http://members.home.nl/rene.herman/pat/xorg.conf http://members.home.nl/rene.herman/pat/Xorg.0.log Thanks, Rene --
<ping> Please note this a 2.6.27 problem (given that PAT isn't enabled by default, not a _pure_ regression I guess, but still). I also still don't know if you (Andreas), Dave or Yinghai should be the To: on this but given that you've been the only one to react at all... --
(more people Cc:-ed) agreed - +12 seconds wait suggest some rather fundamental breakage. Did we go back to uncached for some critical display area that makes X start up (shut down) that slowly? Did we mark the BIOS uncacheable perhaps, causing X to execute BIOS code very slowly? Ingo --
Quite a lot "uncached-minus" in those lists. I am desperately trying to avoid a clue about mostly anything graphics related so, "I dunno". I haven't just disabled PAT yet (although I was about to just do so) and am available for testing. Rene. --
<waiting with bated breath> Additional observation with respect to first,next shutdown: With Option "AGPSize" "64", and booted with "nopat", X startup (from startx<enter> to functional desktop) is approximately 5 seconds, shutdown is 1 second as calibration times. Booted without "nopat", X startup seems to alternate between 10+ and 16+ seconds and for shutdown -- the first shutdown after boot takes some 14 seconds total, subsequent shutdowns settle at around 5 seconds. Rene. --
would it be possible to start up and shut down X in the slow case via strace, by doing something like this: strace -f -ttt -TTT -o trace.log startx and see which system calls (or other activities) took suspiciously long? Ingo --
It wouldn't it seems. Root X (needed for the strace) works fine but started this way hangs indefinitely. I believe the 14 seconds for first shutdown to 5 later might be telling. Sounds like something might have fixed up uncached entries. I'd really like a reply from the AGP or PAT side right about now. Rene. --
Hmm. Looks like there are more than 16000 entries in the PAT list! This delay may be due to the overhead of parsing this linked list everytime for a new entry, rather than any problem with cache setting itself. I am working on a patch to optimize this pat list parsing for the simple case. Should be able to send it out later today, for testing. Thanks, Venki --
Thanks for the reply. It's with 64MB of AGP memory which I guess is at the low end these days. Would your reply mean that basically everyone on 2.6.27 should now be experiencing this? I noticed it was PAT related due to Shaohua Li's: http://marc.info/?l=linux-kernel&m=121783222306075&w=2 which lists very different times (patch there did not help any). As another by the way, probably not surprising but I earlier also tried both unmounting and completely compiling out debugfs just in case I was seeing a debugging related sysmptom. No help either. It's evening here so I'll probably not be able to test until tomorrow. Rene. --
Below is the patch I am testing. Let me know if this patch helps.
Thanks,
Venki
Test patch. Adds cached_entry to list add routine, in order to speed up the
lookup for sequential reserve_memtype calls.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
---
arch/x86/mm/pat.c | 33 +++++++++++++++++++++++++++++++--
1 file changed, 31 insertions(+), 2 deletions(-)
Index: linux-2.6/arch/x86/mm/pat.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/pat.c 2008-08-19 15:21:07.000000000 -0700
+++ linux-2.6/arch/x86/mm/pat.c 2008-08-19 16:00:52.000000000 -0700
@@ -207,6 +207,9 @@ static int chk_conflict(struct memtype *
return -EBUSY;
}
+static struct memtype *cached_entry;
+static u64 cached_start;
+
/*
* req_type typically has one of the:
* - _PAGE_CACHE_WB
@@ -280,11 +283,17 @@ int reserve_memtype(u64 start, u64 end,
spin_lock(&memtype_lock);
+ if (cached_entry && start >= cached_start)
+ entry = cached_entry;
+ else
+ entry = list_entry(&memtype_list, struct memtype, nd);
+
/* Search for existing mapping that overlaps the current range */
where = NULL;
- list_for_each_entry(entry, &memtype_list, nd) {
+ list_for_each_entry_continue(entry, &memtype_list, nd) {
if (end <= entry->start) {
where = entry->nd.prev;
+ cached_entry = list_entry(where, struct memtype, nd);
break;
} else if (start <= entry->start) { /* end > entry->start */
err = chk_conflict(new, entry, new_type);
@@ -292,6 +301,8 @@ int reserve_memtype(u64 start, u64 end,
dprintk("Overlap at 0x%Lx-0x%Lx\n",
entry->start, entry->end);
where = entry->nd.prev;
+ cached_entry = list_entry(where,
+ struct memtype, nd);
}
break;
} else if (start < entry->end) { /* start > entry->start */
@@ -299,7 +310,20 @@ int reserve_memtype(u64 start, u64 end,
if (!err) {
dprintk("Overlap at 0x%Lx-0x%Lx\n",
entry->start, entry->end);
- where ...i've queued this fix up in tip/x86/urgent for more testing - as ~10 seconds delays are serious enough to warrant a quick fix. Rene, you might want to try tip/master, which has this integrated as well: http://people.redhat.com/mingo/tip.git/README Ingo --
Because 64M of AGP memory divided by 4K pages is 16K. That is, the underlying problem seems to be AGP drivers using order 0 allocations. I'm looking. Do note also that this means that Venki's change would not constitite a correct/final fix. Sure, caching the last entry speeds up traversing a 16K entry list but the issue is that there shouldn't be a 16K entry list. Through AGP, or maybe even by coalescing entries in the PAT list if that's at all possible (I guess it's not really). Even if such a more fundamental fix isn't (easily) available, the PAT code already comments that the list, which is sorted by ->start value, is expected to be short, and should be turned into an rbtree if it isn't which might be slightly less of a bandaid. Dave Airlie (as the MAINTAINERS entry) can't be arsed to answer email it seems so I've added Dave Jones for a possible comment from the AGP side. If I'm reading this right upto now, still many AGP driver (among which my amd-k7-agp) are affected. In the short run and if I'm not just mistaken, the best fix might be to make PAT dependent on not having a dumb AGP driver (but as said, still looking). Note that my chipset is capable of a 2G AGP aperture. That's 512K pages if fully used, 256K for 1G, 128K for 512M, ... Rene. --
This was based on a wrong reading; I was looking at the GATT allocation. I'm giving up looking until someone can tell me whether or not those 16K entries are expected though. I have just one AGP card in a PAT capable machine. How many entries in /debug/x86/pat_memtype_list are there on other AGP systems with Option "AGPSize" "64" in their xorg.conf:"Device" section (and their AGP aperture set to 64M or bigger in the BIOS)? Rene. --
OK. I have reproduced this list size issue locally and this order 1 allocation and set_memory_uc on that allocation is actually coming from agp_allocate_memory() -> agp_generic_alloc_page() -> map_page_into_agp() agp_allocate_memory breaks higher order page requests into order 1 allocs. On my system I see multiple agp_allocate_memory requests for nrpages 8841, 1020, 16, 2160, 2160, 8192. Together they end up resulting in more than 22K entries in PAT pages. Thanks, Venki --
Okay, thanks for the confirmation. Now, how to fix... Firstly, it seems we can conclude that any expectancy of a short PAT list is simply destroyed by AGP. I believe the best thing migh be to look into "fixing" AGP rather than PAT for now? In a sense the entire purpose of the AGP GART is collecting non contiguous pages but given that in practice it's generally still just one or at most a few regions, going to multi-page allocs sounds most appetising to me. All in tree AGP drivers except sgi-agp use agp_generic_alloc_page(), ali via m1541_alloc_page and i460 via i460_alloc_page. Rene. --
In the future we will be getting more smaller AGP allocs, so the other problem needs a fix as well. http://git.kernel.org/?p=linux/kernel/git/airlied/agp-2.6.git;a=shortlog;h=agp-pageattr2 contains some code I started on before that moves the interfaces around, Shaohua has been looking at it as it needs the changes to the set_pages interface as well, which is where I ran out of time/steam last time. However with alloc/free pages we could change to a higher order allocation function as long as it fell back to lower orders internally. --
Yes. Atleast during the bootup, we should be able to get higher order pages. And if we cannot get that, agp can internally fall back to zero order pages. That will reduce the number of times set_memory_* gets called, even without pat. We are also looking at changing the reserve_memtype in PAT, not to use linked list for RAM backed pages and track them in page struct. That way we will be using the list only for reserved region which should still be less in number and agp set_memory_* call will not have the list overhead. BTW, Rene: Did the patch from yday, caching the last list add, help in eliminating the slowdown in X startup? Thanks, Venki --
Back when I hacked on this I explicitely chose to not do this because it would make it impossible to put any normal anonymous pages into the PAT list. While that's not done today there's no reason it couldn't be done in the future. Also it doesn't fix the scalability of the data structure anyways (a list is a list), just saves some memory. -Andi --
Andi, we are planning to add couple of page flags which will track the memory attribute of the page. We need to do some checks like, allow the memory attribute of the page to be changed, only if it is not mapped any where and not on free lists(like the in the X driver case, where they allocate the page and then change the attribute). Similarly, in generic -mm, we need to ensure that the page before it gets added to free list, has the right memory attribute etc. If the driver is exposing this page with special attribute, then it is drivers responsibility to use the same attribute across all the mappings. Is there a reason why this won't work with anonymous pages? Can you please With this, we will track only the reserved regions using the linked list and typically these reserved regions will be small number (may be huge contiguous chunks but total number of such chunks will be reasonably smaller). thanks, suresh --
> Andi, we are planning to add couple of page flags which will track page flags are still scarce unfortunately, at least on 32bit. It's a little better now than it used to be (at some point they were nearly out before some were reclaimed), but adding a lot of flags is still difficult. In interest of full disclosure I need at least two for other work too. On 64bit adding a lot of new flags is fine, but 64bit only You want to handle that in __free_pages? I would have thought that should be handled in some higher level function which could just check the memattr. The issue is just if you reuse the two list heads in struct page because they're already used by the Reserved region defined how exactly? -Andi --
Would be nice to test tip/master - it has both that patch included and the latest pageattr-array API (with enablement in AGP drivers) patchset, done by Shaohua Li, based on Dave's original patch. Ingo --
Yes, it does. I was reluctant to say so as I feel it works around the problem instead of solving it but yes, it does (free_memtype() would That patch by itself doesn't help any -- the new set_memory_array_uc() still calls reserve_memtype() for each single page in the array. To help, it needs to coalesce adjacent entries, which is itself easy to do were it not for the need to keep the list of reserved (base,end) pairs around for free_memtype() time (and halfway fail time). Also just now looked at multipage allocs in AGP but that has the same issue really -- if you do multipage allocs, you then need to keep the list of regions around for free time. With worst case lists of AGPSize / 4K entries, this is not nice either (althought worst case is veryvery much worst and unlikely and best case is a 1 entry list). Should PAT just coalesce the memtype list itself? Rene. P.S. The pageattr-array patch that you currently have in tip/master only enables it for intel-agp, not the others. The attached enables it for all drivers currently directly using agp_generic_alloc_page() and agp_generic_destroy_page() (ocal driver is amd-k7-agp).
Worse yet, it appears to be broken. {reserve,free}_memtype() expect phys
addresses but it's being passed virtual ones...
Rene.
Yes. Noticed that too and sent a patch here for x86/tip. http://www.ussg.iu.edu/hypermail/linux/kernel/0808.2/2270.html It is not very critical as it sounds as only set_memory_uc sets PAT bits for RAM pages. Most other users (devmem mmap, ioramep, pci) set PAT bits on the reserved memory. And there will not be conflicts across RAM and reserveed regions. Regardless, this was a stupid bug that we had missed earlier. Thanks, Venki --
And unfortunately I don't think the above fully fixes it for AGP. __pa() gets the real physical address and the memtypes should be on the GART remapped physical addresses it seems. Rene. --
Page being marked here as uncached is the page got from alloc_page(). We are not really marking GART physical address as uncacheable. And that page returned from alloc_page is what we are tracking with reserve and free. IOW, the tracking is only to keep CPU accesses consistent across different va->pa and va across different CPUs and has nothing to do with GART physical address here. Thanks, Venki --
Okay, if you say so... it _used_ to be before this array change to AGP that the GART addresses were in the memtype list, but I'll take your word for that being okay. Rene --
Actually, might as well simply reconstruct the memtype list at free time I guess. How is this for a coalescing version of the array functions? NOTE: I am posting this because I'm going to bed but haven't stared comfortably long at this and might be buggy. Compiles, boots and provides me with: root@7ixe4:~# wc -l /debug/x86/pat_memtype_list 53 /debug/x86/pat_memtype_list otherwise (down from 16384+). <snore> Rene.
cool!
I'd do this in v2.6.27 but i forced myself to be reasonable and applied
your patches to tip/x86/pat instead, for tentative v2.6.28 merging
(assuming it all passes testing, etc.):
# 9a79f4f: x86: {reverve,free}_memtype() take a physical address
# c5e147c: x86: have set_memory_array_{uc,wb} coalesce memtypes.
# 5f310b6: agp: enable optimized agp_alloc_pages methods
( note that i flipped them around a bit and have put your
enable-agp_alloc_pages()-widely patch last, so that we get better
bisection behavior. )
The frontside cache itself is in x86/urgent:
# 80c5e73: x86: fix Xorg startup/shutdown slowdown with PAT
... and should at least solve the symptom that you've hit in practice
(the slowdown), without changing the underlying PAT machinery. (which
would be way too dangerous for v2.6.27)
And it's all merged up in tip/master, you might want to test that too to
check whether all the pieces fit together nicely.
Tens of thousands of page granular memtypes was Not Nice.
Venki, Suresh, Shaohua, Dave, Arjan - any observations about this line
of action?
Ingo
--
The concern I have here is that the coalescing is not guaranteed to work. We may still end up having horrible worst case latency, even though this improves the normal case (boot the system, start X, exit X, reboot the system). It depends on how pages are allocated and how much memory is there in the system and what else is running etc. Here on my test system, without this coalescing change I see [root@localhost ~]# cat /proc/sys/debug/x86/pat_memtype_list | wc -l 19528 With the coalescing change I see [root@localhost ~]# cat /proc/sys/debug/x86/pat_memtype_list | wc -l 135 quit and restart X [root@localhost ~]# cat /proc/sys/debug/x86/pat_memtype_list | wc -l 985 quit and restart X [root@localhost ~]# cat /proc/sys/debug/x86/pat_memtype_list | wc -l 1468 quit and restart X [root@localhost ~]# cat /proc/sys/debug/x86/pat_memtype_list | wc -l 1749 quit and restart X [root@localhost ~]# cat /proc/sys/debug/x86/pat_memtype_list | wc -l 1916 quit and restart X [root@localhost ~]# cat /proc/sys/debug/x86/pat_memtype_list | wc -l 2085 untar a kernel tar.bz2 and restart X [root@localhost ~]# cat /proc/sys/debug/x86/pat_memtype_list | wc -l 2737 compile the kernel and restart X [root@localhost ~]# cat /proc/sys/debug/x86/pat_memtype_list | wc -l 3089 I think this as a good workaround for now. But, for long run we still need to look at other ways of eliminating this overhead (like using page struct that Suresh mentioned in the other thread). Also, there seems to be a bug in the error path of the patch. Below should fix it. Thanks, Venki Fix the start addr for free_memtype calls in the error path. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> --- arch/x86/mm/pageattr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: t/arch/x86/mm/pageattr.c =================================================================== --- t.orig/arch/x86/mm/pageattr.c 2008-08-22 10:38:35.000000000 -0700 +++ ...
Yes, I agree. Independent of the current trigger PAT wants a more scalable approach and yes, worst case is still single page entries. That worst case is the guaranteed case now though, so I do feel it's a generic fix. After all, there wouldn't seem to be a reason to _not_ [ constantly growing number of entries ] Yes, absolutely right, PAT definitely needs something other than the simple linked list. I do believe we also want the coalescing change though - it seems to make sense regardless of trigger and it's only Ah, yes, thanks, just sent out a final version with this fixed as well. Rene. --
Thanks. Stared at it a little longer now and there was a small cut and
paste error in the error path (s/start/tmp/) but other than that, yes,
I'll stand by this. set_memory_array_{uc,wb}() set all pages to the same
type, so coalescing them makes sense in any usage case it seems.
The attached version fixes the out path, is otherwise identical and this
time comes with a proper changelog and a sign-off. Given that you needed
the changelog and the sign-off anyway, I thought there wouldn't be much
point in doing that incrementally, but if you disagree and refactor -- a
changelog for the out path fix would be a simple:
===
x86: fix "have set_memory_array_{uc,wb} coalesce memtypes".
Fix copy and paste error in out path
Signed-off-by: Rene Herman <rene.herman@gmail.com>
Well, please note that that specific commit only fixes X startup -- it
doesn't do anything for shutdown. With only that one, I'm still at 14
seconds for X shutdown (first time after boot that is, 5 seconds
subsequent shutdowns) versus 1 (or sub 1, feels immediate) normally.
It's also a black-screen "hang", so we'll probably be getting a lot of
"long hang at shutdown" reports without something additionally for .27.
Wasn't in tip/master when I just now fetched it. It does indeed sit in
tip/x86/pat.
Rene.
Haven't been subcribed to any lists recently and someone was talking about "the other thread" before but just noticed that an -rc6 was cut. Please note that the shutdown issue remains unfixed in it (I'm doing my coalescing changes locally). Rene --
here's what is pending in tip/x86/pat for v2.6.28:
110e035: x86: make sure the CPA test code's use of _PAGE_UNUSED1 is obvious
c09ff7e: linux-next: fix x86 tree build failure
01de05a: x86: have set_memory_array_{uc,wb} coalesce memtypes, fix
5f310b6: agp: enable optimized agp_alloc_pages methods
c5e147c: x86: have set_memory_array_{uc,wb} coalesce memtypes.
9a79f4f: x86: {reverve,free}_memtype() take a physical address
8b53b57: Merge branch 'x86/urgent' into x86/pat
ab7e792: x86: fix pageattr-test
bd07928: agp: add agp_generic_destroy_pages()
37acee1: agp: generic_alloc_pages()
d75586a: x86, pageattr: introduce APIs to change pageattr of a page array
cacf890: Revert "introduce two APIs for page attribute"
9326d61: Revert "reduce tlb/cache flush times of agpgart memory allocation"
5843d9a: x86, pat: avoid highmem cache attribute aliasing
466ae83: reduce tlb/cache flush times of agpgart memory allocation
1ac2f7d: introduce two APIs for page attribute
012f09e: x86: compile pat debugfs interface only if CONFIG_X86_PAT is set
that includes your coalescing optimizations and fixes. None of that is
really v2.6.27 material i think.
Btw., if you track -tip as per:
http://people.redhat.com/mingo/tip.git/README
you can merge these changes alone into latest -git just by doing:
git merge tip/x86/pat
or you can use tip/master for the full range of pending features and
fixes.
Ingo
--
Only talking about .27 and just making sure again the issue is known. The above mentioned subject for the entry cache one ("fix Xorg startup/shutdown slowdown with PAT") after all mistakingly says it does something for shutdown. Rene. --
Can you try the patch here http://www.ussg.iu.edu/hypermail/linux/kernel/0809.1/2074.html That should resolve both reserve and free issues.. Thanks, Venki --
Hi! What's the status on this? There are graphics devices we're working with that require large (non-AGP) write-combined memory buffers at any time. We can solve the tlb- and cache flush latencies by using pools of these pages to allocate from and free to, but sooner or later we'll probably end up with _huge_ memtype lists. In the (very unlikely) worst case I guess they'd contain every other lowmem page in they system. If I understand things correctly the patch above will fix this issue? Will it be considered for future kernel inclusion. Any enlightenment on this subject would be greatly appreciated. Thanks, /Thomas --
I haven't anything to add, I'm the maintainer not the author, all the people who wrote the offending code were already involved. --
The underlying problem is the order 0 allocations (agp_allocate_memory --> agp_generic_allocate_page) where each single page is set uncached individually, creating a PAT entry. Non order 0 allocations generally would ofcourse help. That's very much AGP internal -- do you feel that's the way to go? All the current AGP drivers except sgi-agp use agp_generic_alloc_page(). Doing a quick local hack to collect pages in agp_allocate_memory() into regions and set the regions (generally 1) UC in one fell swoop, but I don't know if that's safe (and it feels like a rather poor hack anyway). (not to mention that it's time for bed again). Rene. --
