Re: Linux 2.6.27-rc5

Previous thread: [patch 00/03] USB-IP patches by greg on Thursday, August 28, 2008 - 4:00 pm. (13 messages)

Next thread: Re: by Robert Hancock on Thursday, August 28, 2008 - 5:14 pm. (1 message)
From: Linus Torvalds
Date: Thursday, August 28, 2008 - 4:26 pm

Another week (my weeks do seem to be eight days, don't they? Very odd), 
another -rc.

The dirstat pretty much says it all:

  43.0% arch/arm/configs/
  43.9% arch/arm/
  25.5% arch/powerpc/configs/
  26.8% arch/powerpc/
  73.9% arch/
   4.4% drivers/usb/musb/
   5.4% drivers/usb/
   4.0% drivers/watchdog/
  16.0% drivers/
   3.5% fs/

yeah, the bulk of it is all config updates, and with arm and powerpc 
leading the pack.

But seriously, while the config updates amount to about three quarters of 
the diff, and if you don't use a rename-aware diff the blackfin include 
file movement pretty much accounts for the rest, hidden behind all those 
trivial (but bulky) changes are a lot of small changes that hopefully fix 
a number of regressions.

The most exciting (well, for me personally - my life is apparently too 
boring for words) was how we had some stack overflows that totally 
corrupted some basic thread data structures. That's exciting because we 
haven't had those in a long time.  The cause turned out to be a somewhat 
overly optimistic increase in the maximum NR_CPUS value, but it also 
caused some introspection about our stack usage in general. Including 
things like a patch to gcc to fix insane stack usage for vararg functions 
on x86-64.

But that one would only hit anybody who was a bit too adventurous and 
selected the big 4096 CPU configuration. The rest of the regressions fixed 
are a bit more pedestrian.

---
Adel Gadllah (1):
      block: clean up cmdfilter sysfs interface

Adrian Bunk (5):
      ocfs2/cluster/tcp.c: make some functions static
      removed unused #include <linux/version.h>'s
      KVM: fix userspace ABI breakage
      Blackfin arch: let PCI depend on BROKEN
      [ARM] use bcd2bin/bin2bcd

Al Viro (8):
      fix efs_lookup()
      fix osf_getdirents()
      fix hpux_getdents()
      fix regular readdir() and friends
      fix ->llseek() for a bunch of directories
      deal with the first call of ->show() generating no ...
From: Alistair John Strachan
Date: Friday, August 29, 2008 - 8:42 am

(Don't know who's responsible for this one, so I've just added Ragael to CC)

I only noticed this recently but it's probably been happening for a while
(doesn't seem to happen on 2.6.26):

ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         2 83033595  0.0  0     0 ?        S<   10:30 21114574:23 [kthreadd]
root      1740 83078470  0.0  0     0 ?        S<   10:31 21114574:23 [md0_raid1]

Seems to happen only to kernel threads and at random. Last time I booted
it was two XFS threads.

Before I start another bisection, does anybody have any ideas?

-- 
Cheers,
Alistair.
--

From: Alistair John Strachan
Date: Friday, August 29, 2008 - 8:56 am

Okay this is a duplicate report of:

http://bugzilla.kernel.org/show_bug.cgi?id=11209

Which seems to have stalled..

-- 
Cheers,
Alistair.
--

From: Rafael J. Wysocki
Date: Friday, August 29, 2008 - 10:13 am

Doesn't boot on my quad core test box, apparently because of an AHCI failure.

Bisecting ...

Rafael
--

From: Rafael J. Wysocki
Date: Friday, August 29, 2008 - 12:57 pm

Bisection turned up commit a2bd7274b47124d2fc4dfdb8c0591f545ba749dd as the culprit:

commit a2bd7274b47124d2fc4dfdb8c0591f545ba749dd
Author: Yinghai Lu <yhlu.kernel@gmail.com>
Date:   Mon Aug 25 00:56:08 2008 -0700

    x86: fix HPET regression in 2.6.26 versus 2.6.25, check hpet against BAR, v3

Reverting this commit helps.

The symptom is that AHCI probe fails with this commit applied.

Thanks,
Rafael
--

From: Yinghai Lu
Date: Friday, August 29, 2008 - 2:13 pm

can i get whole bootlog with "debug"?

YH
--

From: Yinghai Lu
Date: Friday, August 29, 2008 - 2:19 pm

can you try tip/master?  we have another fix according to Linus..

YH
--

From: Rafael J. Wysocki
Date: Friday, August 29, 2008 - 3:32 pm

I have tested the patch that Linus sent me and it works.  Please see my reply
to Linus for the link to the dmesg output.

Thanks,
Rafael
--

From: Yinghai Lu
Date: Friday, August 29, 2008 - 4:24 pm

pci 0000:00:00.0: BAR has MMCONFIG at e0000000-ffffffff
pci 0000:00:12.0: BAR 5: can't allocate resource
pci 0000:00:13.0: BAR 0: can't allocate resource
pci 0000:00:13.1: BAR 0: can't allocate resource
pci 0000:00:13.2: BAR 0: can't allocate resource
pci 0000:00:13.3: BAR 0: can't allocate resource
pci 0000:00:13.4: BAR 0: can't allocate resource
pci 0000:00:13.5: BAR 0: can't allocate resource
pci 0000:00:14.2: BAR 0: can't allocate resource

your mmconf in BAR is broken....

after forcibly insert that block all others...

YH
--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 5:08 pm

And that seems utter crap to begin with.

	PCI: Using MMCONFIG at e0000000 - efffffff

Where did it get that bogus "ffffffff" end address?

Anyway, that whole MMCONFIG/BAR thing was totally broken to begin with, 
and it's reverted now in my tree, so I guess it doesn't much matter.

			Linus
--

From: Yinghai Lu
Date: Friday, August 29, 2008 - 5:11 pm

On Fri, Aug 29, 2008 at 5:08 PM, Linus Torvalds

the BAR is from pci_read_bases..., so that chipset is broken...
they are even supposed to to hide that BAR to os.

YH
--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 5:45 pm

Ok, can we please

 - *do* get a quirk for known-broken chipsets (at a *PCI* level, this is 
   not an x86 issue)

 - *not* get any more random PCI work-arounds that go through the x86 tree 
   and aren't even looked at by the (very few) people who actually 
   understand the PCI resource handling?

IOW, for the first issue, just teach pci_mmcfg_check_hostbridge() about 
this broken bridge, and have it fix things up (including hiding the thing, 
but also just verifying that the dang thing even -works- etc).

For the second issue - please do realize that we have had much over a 
_decade_ of work on the PCI resource handling, and it's fragile. The thing 
I reverted really isn't something that Ingo should ever have committed in 
the first place. It's not something an x86 maintainer can even make sane 
decisions on.

Resource handling things _need_ to get ACK's from people like Ivan 
Kokshaysky or me. Or at least _several_ other people who actually really 
understand not just PCI resource handling, but have actually seen all the 
horrible crap it causes, and understand how fragile this stuff is. It's 
all different, and it's all about all the million of broken machines out 
there that screw things up.

			Linus
--

From: Yinghai Lu
Date: Friday, August 29, 2008 - 6:14 pm

On Fri, Aug 29, 2008 at 5:45 PM, Linus Torvalds

the quirk work at the first point for David' system.

[PATCH] x86: protect hpet in BAR for one ATI chipset v3

so avoid kernel don't allocate nre resource for it because it can not
allocate the old
address from BIOS.

the same way like some IO APIC address in BAR handling

v3: device id should be 0x4385

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>

---
 drivers/pci/quirks.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Index: linux-2.6/drivers/pci/quirks.c
===================================================================
--- linux-2.6.orig/drivers/pci/quirks.c
+++ linux-2.6/drivers/pci/quirks.c
@@ -1918,6 +1918,22 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_B
                         PCI_DEVICE_ID_NX2_5709S,
                         quirk_brcm_570x_limit_vpd);

+static void __init quirk_hpet_in_bar(struct pci_dev *pdev)
+{
+       int i;
+       u64 base, size;
+
+       /* the BAR1 is the location of the HPET...we must
+        * not touch this, so forcibly insert it into the resource tree */
+       base = pci_resource_start(pdev, 1);
+       size = pci_resource_len(pdev, 1);
+       if (base && size) {
+               insert_resource(&iomem_resource, &pdev->resource[1]);
+               dev_info(&pdev->dev, "HPET at %08llx-%08llx\n", base,
base + size - 1);
+       }
+}
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATI, 0x4385, quirk_hpet_in_bar);
+
 #ifdef CONFIG_PCI_MSI
 /* Some chipsets do not support MSI. We cannot easily rely on setting
  * PCI_BUS_FLAGS_NO_MSI in its bus flags because there are actually



stop working on following path?
[PATCH] x86: split e820 reserved entries record to late v4

sound good, will look at after get lspci -tv and lspci -vvxxx from Rafael.

also quirk between probe::pci_read_bases and pci_resource_survey

YH
--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 7:16 pm

Now, this is probably fine too in theory, but

 - you didn't check if the BAR is even enabled, afaik

 - the other patch - to move the reserved e820 range later - should make 

No, I think this is worth doing, BUT IT MUST NOT BE MERGED BY JUST SENDING 
IT TO INGO.

It's not an "x86 patch". It's about the PCI resources.

And those kinds of patches need to be acked by people who know and 
understand the PCI resource issues and have some memory of just how 
broken machines can exist.

			Linus
--

From: Yinghai Lu
Date: Friday, August 29, 2008 - 7:29 pm

On Fri, Aug 29, 2008 at 7:16 PM, Linus Torvalds


i see.

YH
--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 6:11 pm

Btw, what was the original regression that commit was 
a2bd7274b47124d2fc4dfdb8c0591f545ba749dd trying to fix?

It's not listed in that commit, even though the commit has a "Bisected-by: 
David Witbrodt <dawitbro@sbcglobal.net>".

In fact, I can find it with google by searching for

	David Witbrodt bisect

and I see that it is 3def3d6ddf43dbe20c00c3cbc38dfacc8586998f.

I'm wondering why that commit wasn't just reverted? Because now that I see 
it, I notice that _that_ is the real bug to begin with.

That commit really was buggy. NO WAY can you insert the code/bss/data 
resources before you've done e820 handling, because it may well be that 
some strange e820 table contains things that cross the resources.

So that original thing was buggy, and made x86-64 do odd thigns. They were 
doubly odd, since x86-32 did it differently (and better, I think).

Then, when actally doing the common arch/x86/kernel/setup.c, the commit 
that does so _claims_ that the common code came from the 32-bit version, 
but that doesn't seem to be true, at least wrt this thing. The current 
setup.c comes from the *broken* cleanup of setup_64.c that had been 
bisected to be broken.

And that, in turn, happened in 41c094fd3ca54f1a71233049cf136ff94c91f4ae 
("x86: move e820_resource_resources to e820.c") which also did "and make 
32-bit resource registration more like 64 bit.", so it got the bug into 
32-bit code that had been introduced in 64-bit code.

Ugh.

So why was then that other broken commit added to paper it over, even 
though the original broken commit had been bisected and the breakage was 
known to have been due to _that_?

Hmm?

Yinghai - I'm hoping that the code movement is all over and done with, but 
you need to be a _lot_ more careful here. And Ingo, this really wasn't 
very well done.

			Linus
--

From: Yinghai Lu
Date: Friday, August 29, 2008 - 6:30 pm

On Fri, Aug 29, 2008 at 6:11 PM, Linus Torvalds

we reverted the commit , David's problem still happen.

the root cause is:
before 2.6.26, call init_apic_mapping and will insert_resource for
lapic address.
and then call e820_resource_resouce (with request_resource) to
register e820 entries.
so the lapic entry in the resource tree will prevent some entry in
e820 to be registered.
later request_resource for BAR res (==hpet) will succeed.

from 2.6.26. we move lapic address registering to late_initcall, so
the entry is reserved in e820 getting into resource tree at first.
and later pci_resource_survey::request_resource for BAR res (==hpet,
0xfed00000) will fail. so pci_assign_unsigned... will get new
res for the BAR, so it messed up hpet setting.

solutions will be
1. use quirk to protect hpet in BAR, Ingo said it is not generic.
2. or the one you are reverted... check_bar_with_valid. (hpet, ioapic,
mmconfig) --> happenly reveal another problem with Rafael's
system/chipset.
3. or sticky resource... , but could have particallly overlapping
4. or don't register reserved entries in e820.. Eric, Nacked.
5. or you sugges, regiser some reserved entries later...., and have
insert_resource_expand_to_fit...

YH
--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 7:33 pm

So the problem there was that traditionally, e820_reserve_resource() 
expected to be the first one to populate any resources. That's changed, 
and that's why it now needs to use "insert_resource()" rather than 


Yeah, I don't like it. The quirk I was talking about was the one about 




Yeah, no, we do want reserved entries from e820 to show up to at least 

Yes. And I do think this is a workable model.

			Linus
--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 7:56 pm

Ok, and here's the patch to do

	insert_resource_expand_to_fit(root, new);

and while I still haven't actually tested it, it looks sane and compiles 
to code that also looks sane.

I'll happily commit this as basic infrastructure as soon as somebody ack's 
it and tests that it works (and I'll try it myself soon enough, just for 
fun)

		Linus

---
 include/linux/ioport.h |    1 +
 kernel/resource.c      |   88 ++++++++++++++++++++++++++++++++++-------------
 2 files changed, 64 insertions(+), 25 deletions(-)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 22d2115..8d3b7a9 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -109,6 +109,7 @@ extern struct resource iomem_resource;
 extern int request_resource(struct resource *root, struct resource *new);
 extern int release_resource(struct resource *new);
 extern int insert_resource(struct resource *parent, struct resource *new);
+extern void insert_resource_expand_to_fit(struct resource *root, struct resource *new);
 extern int allocate_resource(struct resource *root, struct resource *new,
 			     resource_size_t size, resource_size_t min,
 			     resource_size_t max, resource_size_t align,
diff --git a/kernel/resource.c b/kernel/resource.c
index f5b518e..72ee95b 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -362,35 +362,21 @@ int allocate_resource(struct resource *root, struct resource *new,
 
 EXPORT_SYMBOL(allocate_resource);
 
-/**
- * insert_resource - Inserts a resource in the resource tree
- * @parent: parent of the new resource
- * @new: new resource to insert
- *
- * Returns 0 on success, -EBUSY if the resource can't be inserted.
- *
- * This function is equivalent to request_resource when no conflict
- * happens. If a conflict happens, and the conflicting resources
- * entirely fit within the range of the new resource, then the new
- * resource is inserted and the conflicting resources become children of
- * the new resource.
+/*
+ * Insert a ...
From: Yinghai Lu
Date: Friday, August 29, 2008 - 8:07 pm

On Fri, Aug 29, 2008 at 7:56 PM, Linus Torvalds

we need to use insert_resource_split_to_fit instead...

otherwise __request_region will not be happy.

have one shrink one

only work with
  |----------------|
          |---------------------|

still has problem with

  |----------------| |------------|     |-----------|
          |------------------------------------|

need to get rid of middle one too.

YH


---
 arch/x86/kernel/e820.c |   20 +++++++++++++-
 include/linux/ioport.h |    2 +
 kernel/resource.c      |   66 ++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 74 insertions(+), 14 deletions(-)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -1319,8 +1319,24 @@ void __init e820_reserve_resources_late(

        res = e820_res;
        for (i = 0; i < e820.nr_map; i++) {
-               if (!res->parent && res->end)
-                       insert_resource(&iomem_resource, res);
+#if 1
+               /* test for shrink_with fit */
+               if (!res->parent && res->end) {
+                       if (res->start == 0xe0000000)
+                               res->start = 0xde000000;
+               }
+#endif
+
+               if (!res->parent && res->end &&
insert_resource(&iomem_resource, res)) {
+
+                       printk(KERN_WARNING "found conflict for %s
[%08llx, %08llx], try to insert with shrink\n",
+                               res->name, res->start, res->end);
+
+                       insert_resource_shrink_to_fit(&iomem_resource, res);
+
+                       printk(KERN_WARNING "   shrink to %s [%08llx,
%08llx]\n",
+                               res->name, res->start, res->end);
+               }
                res++;
        }
 }
Index: linux-2.6/include/linux/ioport.h
===================================================================
--- ...
From: Linus Torvalds
Date: Friday, August 29, 2008 - 8:24 pm

Are you really really sure?

Try just removing the IORESOURCE_BUSY. As mentioned, if we expect the PCI 
BAR's to work with the e820 resources, then BUSY really is simply not 
right any more. Not that I think it should matter either..

The ones that are added _early_ should be IORESOURCE_BUSY (ie the ones 
that cover RAM), but the others we now expect to nest with PCI BARs.

But since we add them after we have parsed the BAR's, I don't even see why 
the BUSY bit should even matter - we've already added the fixed BARs, and 
any newly allocated non-fixed ones shouldn't be allocated in e820 areas 
_regardless_ of whether the BUSY bit is set or not.

So pls explain why it matters?

		Linus
--

From: Yinghai Lu
Date: Friday, August 29, 2008 - 9:41 pm

On Fri, Aug 29, 2008 at 8:24 PM, Linus Torvalds

not all. some are MMCONF, some are for GART, and some for fixed lapic,

if we don't add the IORESOURCE_BUSY, why bother to add these entries...

good layout from BIOS, it should only reserve mmio range is not showing in BAR.

for example:
0xdc000000 - 0xdd000000 for GART ( some offset BAR 0x94)
0xdd000000 - 0xde000000 is for bus 0x80
0xde000000 - 0xdf000000 is for bus 0x00
0xe0000000 - 0xf0000000 is for mmconfig ( CPU set it in MSR for amd fam 10h)

if one stupid BIOS set
0xdc000000 - 0x100000000 for reserved.

then when in insert that range late
we should still have set ranges other than range 0xdd000000 - 0xdf000000

also do we need set other leaf range in 0xdd000000 - 0xdf0000000 ?

YH
--

From: Yinghai Lu
Date: Friday, August 29, 2008 - 10:02 pm

we may not need put reserve entries from e820 into resource tree.
and only insert those sticky resources (with _BUSY) before
pci_assign_unassign and _request_region etc.

YH
--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 10:52 pm

You don't understand how the resource allocator works.

IORESOURCE_BUSY is really more of a "legacy bit". It has almost no bearing 
on the actual allocations.

Just grep for IORSOURCE_BUSY in kernel/resource.c. The _only_ thing that 
cares about busy/non-busy is the legact "request_region()" function. That 
one isn't actually used by any core PCI code - it's more of a driver 
issue to claim exclusive ownership of particular resources by inserting a 
marker in that resource.

So IORESOURCE_BUSY is a red herring. The only reason I said you can clear 
it is because you claimed it causes problems, but the more I look at it, 
the more I think you're likely just mistaken - because IORESOURCE_BUSY 
doesn't make any difference at all to normal resource handling until you 
get to actual drivers.

The bigger issue is that just inserting the resource (and it really 
doesn't matter if it is marked busy or not) is in itself a mark of 
"there's something here". THAT is what all the resource code cares about. 
The IORESOURCE_BUSY bit is almost immaterial (ie _is_ immaterial except 
for some very specific cases).

And the reason we need to add the e820 resources is exactly so that we 
don't try to allocate PCI resources on top of some system resources we 

I agree, but "good layour" and "BIOS" don't really go together. There's 

Sure, but really, the only point of even caring about e820 resources in 
the first place has really nothing to do with the BAR's we can see 
(because the kernel can handle _those_ perfectly well on its own), and has 
everything to do with teh fact that a lot of devices have invisible 
resources that we _cannot_ see (ie magic non-standard BAR's for the 
motherboard chips).

And those are exactly why we want to populate the resource map with the 
e820 information - to avoid having dynamic resources (like Cardbus or PCI 
hotplug, or just devices that weren't set up statically by the BIOS) be 
then allocated by the kernel on top of those "invisible" ...
From: Linus Torvalds
Date: Friday, August 29, 2008 - 11:18 pm

And just to clarify - I think that while you get that error for the 
qla2xxx driver, I suspect that your actual resource tree is all good, and 
that the PCI allocations were fine.

And then the problem you his is now that the driver literally thinks that 
some other driver already took that resource.

The patch I just sent is not actually the patch I think you should do: the 
proper patch is to just remove IORESOURCE_BUSY from the e820 resources, 
simply because they are _not_ indicative of a driver already holding on to 
the resource.

Of course, the sad part is that potentially IORESOURCE_BUSY might actually 
be a really good bit for exactly that - we've had tons of issues with 
hardware sensors literally having a kernel driver _and_ a system level 
driver (ie ACPI), and things get confused exactly because there are now 
two drivers trying to drive the same piece of hardware.

But basically, if you have BAR's and the e820 resource areas co-existing, 
then the e820 resources shouldn't be marked BUSY.

Anyway - to just re-cap - you might as well just ignore the patch I just 
sent out, and instead just avoid doing that BUSY bit to begin with in the 
"late e820" case. Simpler and more correct.

		Linus
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 1:02 am

please check fix v3

[PATCH] x86: split e820 reserved entries record to late v4 - fix v3

try to insert_resource second time, by expand the resource...

for case: e820 reserved entry is partially overlapped with bar res...

hope it will never happen

v3: use reserve_region_with_split() instead to hand overlapping
       with test case by extend 0xe0000000 - 0xeffffff to 0xdd800000 -
       get
               e0000000-efffffff : PCI MMCONFIG 0
                        e0000000-efffffff : reserved
       in /proc/iomem
       get
               found conflict for reserved [dd800000, efffffff], try
to reserve with split
                   __reserve_region_with_split: (PCI Bus #80)
[dd000000, ddffffff], res: (reserved) [dd800000, efffffff]
                   __reserve_region_with_split: (PCI Bus #00)
[de000000, dfffffff], res: (reserved) [de000000, efffffff]
               initcall pci_subsys_init+0x0/0x121 returned 0 after 381 msecs
       in dmesg

YH
--

From: Yinghai Lu
Date: Friday, August 29, 2008 - 10:22 pm

On Fri, Aug 29, 2008 at 8:24 PM, Linus Torvalds


please check

  __request_region: conflict: (reserved) [dd000000, efffffff], res:
(qla2xxx) [ddffc000, ddffffff]
busy flag
qla2xxx 0000:83:00.0: BAR 1: can't reserve mem region [0xddffc000-0xddffffff]

YH

...



Initializing cgroup subsys cpuset...............................................
Linux version 2.6.27-rc5-tip-00672-ge5c5407-dirty (yhlu@linux-zpir)
(gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision
135036] (SUSE Linux) ) #220 SMP Fri Aug 29 22:02:53 PDT 2008..
Command line: console=uart8250,io,0x3f8,115200n8
initrd=kernel.org/mydisk11_x86_64.gz rw root=/dev/ram0 debug
show_msr=1 i8042.noaux initcall_debug apic=verbose pci=routeirq
ip=dhcp load_ramdisk=1 ramdisk_size=131072
BOOT_IMAGE=kernel.org/bzImage_2.6.27_k8.1
                     done
KERNEL supported cpus:s
  Intel GenuineIntel
  AMD AuthenticAMD
  Centaur CentaurHauls                                               done
BIOS-provided physical RAM map:                                      done
 BIOS-e820: 0000000000000000 - 0000000000097400 (usable)             done
 BIOS-e820: 0000000000097400 - 00000000000a0000 (reserved)           done
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)s.
 BIOS-e820: 0000000000100000 - 00000000d7fa0000 (usable)
 BIOS-e820: 00000000d7fae000 - 00000000d7fb0000 (usable)
 BIOS-e820: 00000000d7fb0000 - 00000000d7fbe000 (ACPI data)
 BIOS-e820: 00000000d7fbe000 - 00000000d7ff0000 (ACPI NVS)
 BIOS-e820: 00000000d7ff0000 - 00000000d8000000 (reserved)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved)
 BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000002028000000 (usable)
Early serial console at I/O port 0x3f8 (options '115200n8')
console [uart0] enabled
insert_resource: parent: (PCI mem) [0, ffffffffffffffff], new: (Kernel
code) ...
From: Linus Torvalds
Date: Friday, August 29, 2008 - 11:11 pm

Ok, this is actually when the driver wants to reserve the BAR, and then it 
norices that there is an existing "reservation" there.

So yes, drivers will care - they literally will think that somebody else 
owns their resource if they have a BUSY resource inside of them. So this 
is a driver protecting against another driver.

The sad part is that it looks like it's entirely due to the PCI code 
trying to emulate an ISA driver model, and use a flat resource space - so 
it hits the upper resources first.

Does this patch make a difference? It actually removes a fair chunk of 
code, by just saying "we really don't care if the resource is IO or MEM, 
we just want to reserve space inside of it, regardless of type".

Untested - obviously.

		Linus

---
 drivers/pci/pci.c |   26 +++++++++-----------------
 1 files changed, 9 insertions(+), 17 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index c9884bb..a3de4fe 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1304,15 +1304,11 @@ pci_get_interrupt_pin(struct pci_dev *dev, struct pci_dev **bridge)
 void pci_release_region(struct pci_dev *pdev, int bar)
 {
 	struct pci_devres *dr;
+	struct resource *res = pdev->resource + bar;
 
 	if (pci_resource_len(pdev, bar) == 0)
 		return;
-	if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
-		release_region(pci_resource_start(pdev, bar),
-				pci_resource_len(pdev, bar));
-	else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM)
-		release_mem_region(pci_resource_start(pdev, bar),
-				pci_resource_len(pdev, bar));
+	__release_region(res, pci_resource_start(pdev, bar), pci_resource_len(pdev, bar));
 
 	dr = find_pci_dr(pdev);
 	if (dr)
@@ -1336,20 +1332,16 @@ void pci_release_region(struct pci_dev *pdev, int bar)
 int pci_request_region(struct pci_dev *pdev, int bar, const char *res_name)
 {
 	struct pci_devres *dr;
+	struct resource *res = pdev->resource + bar;
 
 	if (pci_resource_len(pdev, bar) == 0)
 		return 0;
-		
-	if ...
From: Linus Torvalds
Date: Friday, August 29, 2008 - 8:15 pm

.. and it even works (apart from a missing '\n' for the expansion report 
;).

I tested it with the appended silly test-case, and it shows

	...
	 BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved)
	 BIOS-e820: 0000000100000000 - 0000000160000000 (usable)
	Expanded resource Kernel dummy due to conflict with Kernel code
	Expanded resource Kernel dummy due to conflict with Kernel data
	last_pfn = 0x160000 max_arch_pfn = 0x3ffffffff
	x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
	...

and /proc/iomem shows

	...
	00100000-9cf64fff : System RAM
	  00200000-006ea27f : Kernel dummy
	    00200000-00561f37 : Kernel code
	    00561f38-006ea27f : Kernel data
	  00777000-007d6cc7 : Kernel bss
	...

so it correctly expanded that "Kernel dummy" resource to cover the 
resources it had clashed with.

And no, it's not perfect. We certainly _could_ split things instead. But I 
hope that odd "e820 resources were bogus" case almost never would actually 
trigger in practice, and the expansion case is not only simpler, it's also 
slightly more robust in the sense that a single big resource is likely to 
fit the things we need than multiple smaller resources that have been 
chopped up.

		Linus

--- dummy test patch for the 'insert-resource-expand-to-fit' thing ---
 arch/x86/kernel/setup.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 362d4e7..6265a38 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -578,6 +578,14 @@ static struct x86_quirks default_x86_quirks __initdata;
 
 struct x86_quirks *x86_quirks __initdata = &default_x86_quirks;
 
+static struct resource dummy_resource = {
+	.name	= "Kernel dummy",
+	.start	= 0,
+	.end	= 0,
+	.flags	= IORESOURCE_BUSY | IORESOURCE_MEM
+};
+
+
 /*
  * Determine if we were loaded by an EFI loader.  If so, then we have also been
  * passed the efi memmap, systab, etc., so we should use ...
From: Yinghai Lu
Date: Friday, August 29, 2008 - 8:00 pm

On Fri, Aug 29, 2008 at 7:33 PM, Linus Torvalds

orginally it works, because lapic address entry open the big hole for


if update res->end according mmconfig end, before insert it forcibly,

BTW, insert_resource_expand_to_fit need to be replaced with
insert_resource_split_to_fit....
test stub reveal expand will make __request_region not working for
some devices...because reserved_entries from e820 take
IORESOUCE_BUSY...

YH
--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 8:10 pm

Except it's still a horrible patch that special-cases all the wrong things 
(ie random resources that we just happen to know that ACPI etc cares 
about).

There's no way to know in general if ACPI might care deeply where some 
random resource is (say, graphics memory) and it might be done with a BAR.


Well, we should probably just remove the IORESOURCE_BUSY part.

Again, that comes from the fact that the e820 resources used to _override_ 
everything - they were inserted first, and nothing else was _ever_ allowed 
to allocate in that region. 

But if we're changing that, then the whole IORESOURCE_BUSY part doesn't 
make sense.

In fact, in general, IORESOURCE_BUSY doesn't much make sense any more in 
general, because it was actually more of an ISA-timeframe locking model 
saying "you can't touch this region". But if the whole point is that we 
now try to allow PCI device BAR's and the e820 maps to co-exist, then the 
whole - and only - reason for IORESOURCE_BUSY for them goes away..

			Linus
--

From: Yinghai Lu
Date: Friday, August 29, 2008 - 5:20 pm

On Fri, Aug 29, 2008 at 5:08 PM, Linus Torvalds

we need to handle it. otherwise if the BAR go first, and it will stop
other BARs to be registered...

a quirk should do the work....

Rafael, can you send out lspci -tv and lspci --vvxxx too.

YH
--

From: Yinghai Lu
Date: Friday, August 29, 2008 - 5:27 pm

cat /proc/iomem please.

YH
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 9:05 am

00:00.0 Host bridge: ATI Technologies Inc RD790 Northbridge only dual
slot PCI-e_GFX and HT3 K8 part
	Subsystem: ATI Technologies Inc RD790 Northbridge only dual slot
PCI-e_GFX and HT3 K8 part
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort+ >SERR- <PERR-
	Latency: 0
	Region 3: Memory at <ignored> (64-bit, non-prefetchable)
	Capabilities: [c4] HyperTransport: Slave or Primary Interface
		Command: BaseUnitID=0 UnitCnt=12 MastHost- DefDir- DUL-
		Link Control 0: CFlE- CST- CFE- <LkFail- Init+ EOC- TXO- <CRCErr=0
IsocEn- LSEn- ExtCTL- 64b-
		Link Config 0: MLWI=16bit DwFcIn- MLWO=16bit DwFcOut- LWI=16bit
DwFcInEn- LWO=16bit DwFcOutEn-
		Link Control 1: CFlE- CST- CFE- <LkFail+ Init- EOC+ TXO+ <CRCErr=0
IsocEn- LSEn- ExtCTL- 64b-
		Link Config 1: MLWI=8bit DwFcIn- MLWO=8bit DwFcOut- LWI=8bit
DwFcInEn- LWO=8bit DwFcOutEn-
		Revision ID: 3.00
		Link Frequency 0: [b]
		Link Error 0: <Prot- <Ovfl- <EOC- CTLTm-
		Link Frequency Capability 0: 200MHz+ 300MHz- 400MHz+ 500MHz- 600MHz+
800MHz+ 1.0GHz+ 1.2GHz+ 1.4GHz- 1.6GHz- Vend-
		Feature Capability: IsocFC- LDTSTOP+ CRCTM- ECTLT- 64bA- UIDRD-
		Link Frequency 1: 200MHz
		Link Error 1: <Prot- <Ovfl- <EOC- CTLTm-
		Link Frequency Capability 1: 200MHz- 300MHz- 400MHz- 500MHz- 600MHz-
800MHz- 1.0GHz- 1.2GHz- 1.4GHz- 1.6GHz- Vend-
		Error Handling: PFlE- OFlE- PFE- OFE- EOCFE- RFE- CRCFE- SERRFE- CF-
RE- PNFE- ONFE- EOCNFE- RNFE- CRCNFE- SERRNFE-
		Prefetchable memory behind bridge Upper: 00-00
		Bus Number: 00
	Capabilities: [40] HyperTransport: Retry Mode
	Capabilities: [54] HyperTransport: UnitID Clumping
	Capabilities: [9c] HyperTransport: #1a
00: 02 10 56 59 06 00 30 22 00 00 00 06 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 04 00 00 e0

so bar3 of 00:00.0 has oxe0000000 - 0xffffffff
and request_resource failed, so
	Region 3: Memory at <ignored> (64-bit, non-prefetchable)

BIOS should hide ...
From: Rafael J. Wysocki
Date: Saturday, August 30, 2008 - 10:14 am

Could you please rebase them on top of current -git?

Rafael
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 10:55 am

please check attached quilt series based on linus tree.

YH
From: Yinghai Lu
Date: Saturday, August 30, 2008 - 11:11 am

there is some problem with fix -v4...on one test machine.

please don't use it now

YH
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 12:06 pm

this one should work.

YH
From: Yinghai Lu
Date: Saturday, August 30, 2008 - 1:10 pm

calling  pci_subsys_init+0x0/0x120
PCI: Using ACPI for IRQ routing
request_resource: root: (PCI IO) [0, ffff], new: (PCI Bus 0000:01)
[9000, 9fff] conflict 0
request_resource: root: (PCI mem) [0, ffffffffffffffff], new: (PCI Bus
0000:01) [fe700000, fe7fffff] conflict 0
request_resource: root: (PCI mem) [0, ffffffffffffffff], new: (PCI Bus
0000:01) [d8000000, dfffffff] conflict 0
request_resource: root: (PCI IO) [0, ffff], new: (PCI Bus 0000:02)
[a000, bfff] conflict 0
request_resource: root: (PCI mem) [0, ffffffffffffffff], new: (PCI Bus
0000:02) [fe800000, fe8fffff] conflict 0
request_resource: root: (PCI IO) [0, ffff], new: (PCI Bus 0000:03)
[c000, cfff] conflict 0
request_resource: root: (PCI mem) [0, ffffffffffffffff], new: (PCI Bus
0000:03) [fe900000, fe9fffff] conflict 0
request_resource: root: (PCI IO) [0, ffff], new: (PCI Bus 0000:04)
[d000, efff] conflict 0
request_resource: root: (PCI mem) [0, ffffffffffffffff], new: (PCI Bus
0000:04) [fea00000, feafffff] conflict 0
request_resource: root: (PCI mem) [0, ffffffffffffffff], new: (PCI Bus
0000:05) [feb00000, febfffff] conflict 0
request_resource: root: (PCI mem) [0, ffffffffffffffff], new:
(0000:00:00.0) [e0000000, ffffffff] conflict 1
pci 0000:00:00.0: BAR 3: can't allocate resource

so pci_resource_survey is depth first. sub buses request some resource
at first...

we don't need quirk to handle that strange BAR res.

and we got reserved register correctly
in /proc/iomem
d7fe0000-d7ffffff : reserved
..
fff00000-ffffffff : reserved

for
 BIOS-e820: 00000000d7fe0000 - 00000000d8000000 (reserved)
 BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)

YH
--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 2:44 pm

Heh, interesting, since we were talking about reverting that one for other 
reasons entirely.

See the thread "x86: split e820 reserved entries record to late" (yeah, I 
know that subject isn't very grammatical or sensible) for some patches 
worth trying _after_ you've reverted that one.

Anyway, clearly that commit needs to be reverted regardless, so I'll do 
the revert. Can you please test the appended test-patch by Yinghai on top 
of the revert?

(This is not the final version, but it should be sufficient to be tested)

And if you have the whole dmesg, that would be useful.

		Linus

---
From: Yinghai Lu <yhlu.kernel@gmail.com>
Subject: [PATCH] x86: split e820 reserved entries record to late v3
Date: Thu, 28 Aug 2008 17:41:29 -0700

so could let BAR res register at first, or even pnp?

v2: insert e820 reserve resources before pnp_system_init
v3: fix merging problem in tip/x86/core
    please drop the one in tip/x86/core use this one instead

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>

---
 arch/x86/kernel/e820.c |   20 ++++++++++++++++++--
 arch/x86/pci/i386.c    |    3 +++
 include/asm-x86/e820.h |    1 +
 3 files changed, 22 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -1271,13 +1271,15 @@ static inline const char *e820_type_to_s
 /*
  * Mark e820 reserved areas as busy for the resource manager.
  */
+struct resource __initdata *e820_res;
 void __init e820_reserve_resources(void)
 {
 	int i;
-	struct resource *res;
 	u64 end;
+	struct resource *res;
 
 	res = alloc_bootmem_low(sizeof(struct resource) * e820.nr_map);
+	e820_res = res;
 	for (i = 0; i < e820.nr_map; i++) {
 		end = e820.map[i].addr + e820.map[i].size - 1;
 #ifndef CONFIG_RESOURCES_64BIT
@@ -1291,7 +1293,8 @@ void __init e820_reserve_resources(void)
 		res->end = end;
 
 		res->flags = ...
From: Rafael J. Wysocki
Date: Friday, August 29, 2008 - 3:30 pm

dmesg from -rc5 with the offending commit reverted and with the patch
below applied is at:

http://www.sisk.pl/kernel/debug/mainline/2.6.27-rc5/2.6.27-rc5-git.log

Thanks,


--

From: Linus Torvalds
Date: Saturday, August 30, 2008 - 10:39 am

Ok, the more I look at this, the more interesting it gets.

In particular, this:

	...
	ACPI: bus type pnp registered
	pnp 00:08: mem resource (0xfec00000-0xfec00fff) overlaps 0000:00:00.0 BAR 3 (0xe0000000-0xffffffff), disabling
	pnp 00:08: mem resource (0xfee00000-0xfee00fff) overlaps 0000:00:00.0 BAR 3 (0xe0000000-0xffffffff), disabling
	pnp 00:09: mem resource (0xffb80000-0xffbfffff) overlaps 0000:00:00.0 BAR 3 (0xe0000000-0xffffffff), disabling
	pnp 00:09: mem resource (0xfff00000-0xffffffff) overlaps 0000:00:00.0 BAR 3 (0xe0000000-0xffffffff), disabling
	pnp 00:0b: mem resource (0xe0000000-0xefffffff) overlaps 0000:00:00.0 BAR 3 (0xe0000000-0xffffffff), disabling
	pnp 00:0c: mem resource (0xfec00000-0xffffffff) overlaps 0000:00:00.0 BAR 3 (0xe0000000-0xffffffff), disabling
	pnp: PnP ACPI: found 13 devices
	ACPI: ACPI bus type pnp unregistered
	SCSI subsystem initialized
	libata version 3.00 loaded.
	usbcore: registered new interface driver usbfs
	usbcore: registered new interface driver hub
	usbcore: registered new device driver usb
	PCI: Using ACPI for IRQ routing
	pci 0000:00:00.0: BAR 3: can't allocate resource
	...

there's a few things to note here:

 - the resource at 0000:00:00.0 BAR 3 is totally bogus.

   We know it's totally bogus because you actually have other resources in 
   the 0xf....... range, and they work fine. It's also likely to be 
   totally bogus because it so happens that the end-point of 0xffffffff is 
   commonly something that the BIOS leaves as a "I sized this resource", 
   because that's how resources are sized (you write all ones into them 
   and look what you can read back).

   But your lspci -vxx output clearly shows that (a) MEM is enabled in 
   the command word, and yes, the BAR register at 0x18 does indeed have 
   value 0xe0000000. So it's just the length that is really bogus.

 - pnp clearly sees that bogus resource at 0xe0000000-0xffffffff

 - BUT: the "can't allocate resource" thing is from 
   ...
From: Yinghai Lu
Date: Saturday, August 30, 2008 - 11:07 am

On Sat, Aug 30, 2008 at 10:39 AM, Linus Torvalds

again, should use MCFG end as the res _end

YH
--

From: Linus Torvalds
Date: Saturday, August 30, 2008 - 11:43 am

No. Again - we shouldn't DO that insane crap.

We simply shouldn't try to compare the BAR start with randomly chosen 
things.

So the crap got reverted, and it's not going to get done again. Get over 
it.


			Linus
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 12:10 pm

On Sat, Aug 30, 2008 at 11:43 AM, Linus Torvalds

do you agree to use quirk to make the BAR res to have correct end
between pci_probe and pci_resource_survey?

YH
--

From: Linus Torvalds
Date: Saturday, August 30, 2008 - 12:31 pm

In general I would agree, but now that I've looked at it a bit more, I 
actually don't think it's a bug in the chipset any more. See my previous 
email that crossed with yours.

I suspect that that northbridge resource is basically acting as a bridge 
resource. So 0xe0000000 - 0xffffffff is actually _correct_. And MCFG being 
in that window (and being first in it) is just a detail.

Look at the resource allocations on Rafael's machine: there are two 
different classes:

 - outside that BAR3 window:

   The "external gfx0 port A" decode (bridged by device 0000:02.0):

	d8000000-dfffffff : PCI Bus 0000:01
	  d8000000-dfffffff : 0000:01:00.0
	    d8000000-d8ffffff : vesafb

   and suspect the graphics port is special (considering that this is an 
   ATI chipset)

 - inside that BAR3 window: everything else (PCI express):

	e0000000-efffffff : PCI MMCONFIG 0
	fe6f4000-fe6f7fff : 0000:00:14.2
	  fe6f4000-fe6f7fff : ICH HD audio
	fe6fa000-fe6fafff : 0000:00:13.4
	  fe6fa000-fe6fafff : ohci_hcd
	fe6fb000-fe6fbfff : 0000:00:13.3
	  fe6fb000-fe6fbfff : ohci_hcd
	fe6fc000-fe6fcfff : 0000:00:13.2
	  fe6fc000-fe6fcfff : ohci_hcd
	fe6fd000-fe6fdfff : 0000:00:13.1
	  fe6fd000-fe6fdfff : ohci_hcd
	fe6fe000-fe6fefff : 0000:00:13.0
	  fe6fe000-fe6fefff : ohci_hcd
	fe6ff000-fe6ff0ff : 0000:00:13.5
	  fe6ff000-fe6ff0ff : ehci_hcd
	fe6ff800-fe6ffbff : 0000:00:12.0
	  fe6ff800-fe6ffbff : ahci
	fe700000-fe7fffff : PCI Bus 0000:01
	  fe7c0000-fe7dffff : 0000:01:00.0
	  fe7e0000-fe7effff : 0000:01:00.1
	  fe7f0000-fe7fffff : 0000:01:00.0
	fe800000-fe8fffff : PCI Bus 0000:02
	  fe8ffc00-fe8fffff : 0000:02:00.0
	    fe8ffc00-fe8fffff : ahci
	fe900000-fe9fffff : PCI Bus 0000:03
	  fe9c0000-fe9dffff : 0000:03:00.0
	  fe9fc000-fe9fffff : 0000:03:00.0
	    fe9fc000-fe9fffff : sky2
	fea00000-feafffff : PCI Bus 0000:04
	  feaffc00-feafffff : 0000:04:00.0
	    feaffc00-feafffff : ahci
	feb00000-febfffff : PCI Bus 0000:05
	  febff000-febfffff : 0000:05:08.0
	    febff000-febff7ff ...
From: Yinghai Lu
Date: Saturday, August 30, 2008 - 1:14 pm

On Sat, Aug 30, 2008 at 12:31 PM, Linus Torvalds

wonder:
in old kernel, after BAR3 request_filed, pci_assigned_unassigned
should get update resource for that... but it could find that big
space for it.

that is interesting...

YH
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 1:38 pm

please check

[PATCH] x86: split e820 reserved entries record to late v4
[PATCH] x86: split e820 reserved entries record to late v4 - fix v6

YH
--

From: Rafael J. Wysocki
Date: Saturday, August 30, 2008 - 1:46 pm

What kernel should I apply those to and in what order?

Rafael
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 2:12 pm

linus git tree
1. [PATCH] x86: split e820 reserved entries record to late v4
2. [PATCH] x86: split e820 reserved entries record to late v4 - fix v6

tip/master
1. Resource handling: add 'insert_resource_expand_to_fit()' function
2. [PATCH] x86: split e820 reserved entries record to late v4 - fix v6

YH
From: Yinghai Lu
Date: Saturday, August 30, 2008 - 2:13 pm

actually it is almost the same to tar ball send you for your system...

YH
--

From: Rafael J. Wysocki
Date: Saturday, August 30, 2008 - 2:34 pm

I've just tested these two patches on top of the current Linus' tree and the
system works normally.

Thanks,
Rafael
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 2:49 pm

thanks,

David, can you test those two patches on top of linus tree?

YH
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 6:10 pm

Can you try attached in addition to those to patches ?

want to check if the BAR3 get new resource..., and after that what
could happen...

YH
From: Rafael J. Wysocki
Date: Sunday, August 31, 2008 - 5:27 am

Works, dmesg is at:
http://www.sisk.pl/kernel/debug/mainline/2.6.27-rc5/2.6.27-rc5-test.log

Please let me know if you want it with any more command line options.

Thanks,
Rafael
--

From: Linus Torvalds
Date: Sunday, August 31, 2008 - 10:42 am

That BAR is indeed "locked". Now that we try to reallocate it, you get 
this in the log:

	pci 0000:00:00.0: BAR 3: error updating (0x40000004 != 0xe0000004)
	pci 0000:00:00.0: BAR 3: error updating (high 0x000001 != 0x000000)

ie now the code _tried_ to update the BAR to point to 0x1_4000_0000 
instead, but the hardware refused, and it is still at 0x0_e000_0000.

So Yinghai's patch "worked", but it worked by doing nothing.

See my earlier guess about locked read-only resources a few emails back. 
IOW, I'm not at all surprised. I really do suspect that that BAR is some 
very special "this is the HT->PCIE region" BAR.

			Linus
--

From: Yinghai Lu
Date: Sunday, August 31, 2008 - 10:54 am

On Sun, Aug 31, 2008 at 10:42 AM, Linus Torvalds

so the code could allocate the 64 bit resource above 4g,...

wonder how the probe could find out the size of is 1fff_ffff..

YH
--

From: Linus Torvalds
Date: Sunday, August 31, 2008 - 11:03 am

Heh. That's how PCI sizing works: you write all ones to the register, and 
read back the result. The low bits won't change, and that indicates the 
size.

But if _none_ of the bits change, then that simply means that the size 
will be calculated to be 0xffffffff-start.

So the sizing will "work", it will just always report that the BAR covers 
everything from start to the 4G limit.

		Linus
--

From: Yinghai Lu
Date: Sunday, August 31, 2008 - 2:03 pm

On Sun, Aug 31, 2008 at 11:03 AM, Linus Torvalds

how about


diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index cce2f4c..3b5269a 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -240,6 +240,11 @@ static int __pci_read_base(struct pci_dev *dev,
enum pci_bar_type type,
        pci_read_config_dword(dev, pos, &l);
        pci_write_config_dword(dev, pos, mask);
        pci_read_config_dword(dev, pos, &sz);
+
+       /* sticky and non changable */
+       if (sz == l)
+               goto fail;
+
        pci_write_config_dword(dev, pos, l);

        /*


Rafael,

can you check attach one  to see if we still have warning ?

YH
From: Linus Torvalds
Date: Monday, September 1, 2008 - 10:53 am

No, because a resource really _can_ be at the end. It's perfectly ok to 
have something like a memory resource at 0xff000000-0xffffffff, and then 
the BAR register would always read 0xff000000 (or 0x...4 for a 64-bit 
resource).

So calling that a failure case would be wrong.

		Linus
--

From: Linus Torvalds
Date: Saturday, August 30, 2008 - 3:41 pm

Exactly. So what happens is that it doesn't actually re-allocate it at 
all. Not that it is necessarily even possible - it's quite possible that 
that field is effectively locked some way and is read-only. Without 
knowing the chipset details, we can only guess.

		Linus
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 3:50 pm

On Sat, Aug 30, 2008 at 3:41 PM, Linus Torvalds

wait, THAT BAR is 64BIT capable, So kernel should assign 64bit range to it...

not touch it at this time...

except Jordan could find some clue with the DOC.

YH
--

From: Linus Torvalds
Date: Saturday, August 30, 2008 - 4:28 pm

I don't think we've ever done new allocations in 64 bits. Although looking 
for it, I have to admit that I don't see what would limit us right now. 
There used to be some paths that weren't 64-bit clean, but I think we 
fixed all of those.

		Linus
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 4:39 pm

On Sat, Aug 30, 2008 at 4:28 PM, Linus Torvalds

would be some corner case...

didn't see anything there.

calling  pcibios_assign_resources+0x0/0x90
request_resource: root: (PCI Bus 0000:01) [fe700000, fe7fffff], new:
(0000:01:00.0) [fe7c0000, fe7dffff] conflict 0
request_resource: root: (PCI Bus 0000:03) [fe900000, fe9fffff], new:
(0000:03:00.0) [fe9c0000, fe9dffff] conflict 0
pci 0000:00:02.0: PCI bridge, secondary bus 0000:01
pci 0000:00:02.0:   IO window: 0x9000-0x9fff
pci 0000:00:02.0:   MEM window: 0xfe700000-0xfe7fffff
pci 0000:00:02.0:   PREFETCH window: 0x000000d8000000-0x000000dfffffff
pci 0000:00:04.0: PCI bridge, secondary bus 0000:02
pci 0000:00:04.0:   IO window: 0xa000-0xbfff
pci 0000:00:04.0:   MEM window: 0xfe800000-0xfe8fffff
pci 0000:00:04.0:   PREFETCH window: disabled
pci 0000:00:06.0: PCI bridge, secondary bus 0000:03
pci 0000:00:06.0:   IO window: 0xc000-0xcfff
pci 0000:00:06.0:   MEM window: 0xfe900000-0xfe9fffff
pci 0000:00:06.0:   PREFETCH window: disabled
pci 0000:00:07.0: PCI bridge, secondary bus 0000:04
pci 0000:00:07.0:   IO window: 0xd000-0xefff
pci 0000:00:07.0:   MEM window: 0xfea00000-0xfeafffff
pci 0000:00:07.0:   PREFETCH window: disabled
pci 0000:00:14.4: PCI bridge, secondary bus 0000:05
pci 0000:00:14.4:   IO window: disabled
pci 0000:00:14.4:   MEM window: 0xfeb00000-0xfebfffff
pci 0000:00:14.4:   PREFETCH window: disabled
pci 0000:00:02.0: setting latency timer to 64
pci 0000:00:04.0: setting latency timer to 64
pci 0000:00:06.0: setting latency timer to 64
pci 0000:00:07.0: setting latency timer to 64

YH
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 5:27 pm

pci_assign_unassigned_resources==>pci_bus_assign_resources==>pbus_assign_resources_sorted(struct

static void pbus_assign_resources_sorted(struct pci_bus *bus)
{
        struct pci_dev *dev;
        struct resource *res;
        struct resource_list head, *list, *tmp;
        int idx;

        head.next = NULL;
        list_for_each_entry(dev, &bus->devices, bus_list) {
                u16 class = dev->class >> 8;

                /* Don't touch classless devices or host bridges or ioapics.  */
                if (class == PCI_CLASS_NOT_DEFINED ||
                    class == PCI_CLASS_BRIDGE_HOST)
                        continue;


it skips the host bridge...

YH
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 5:50 pm

what's story for not touching host bridges?

YH
--

From: Linus Torvalds
Date: Saturday, August 30, 2008 - 8:00 pm

Ahh. Exactly because of things like this. The hist bridge BAR's are often 
special.

That code comes from almost four years ago, the commit message was:

  Author: Maciej W. Rozycki <macro@mips.com>
  Date:   Thu Dec 16 21:44:31 2004 -0800

    [PATCH] PCI: Don't touch BARs of host bridges
    
     BARs of host bridges often have special meaning and AFAIK are best left
    to be setup by the firmware or system-specific startup code and kept
    intact by the generic resource handler.  For example a couple of host
    bridges used for MIPS processors interpret BARs as target-mode decoders
    for accessing host memory by PCI masters (which is quite reasonable).
    For them it's desirable to keep their decoded address range overlapping
    with the host RAM for simplicity if nothing else (I can imagine running
    out of address space with lots of memory and 32-bit PCI with no DAC
    support in the participating devices).
    
     This is already the case with the i386 and ppc platform-specific PCI
    resource allocators.  Please consider the following change for the generic
    allocator.  Currently we have a pile of hacks implemented for host bridges
    to be left untouched and I'd be pleased to remove them.
    
    From: "Maciej W. Rozycki" <macro@mips.com>
    Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>

and we've had other things where host bridges are special (ie iirc, if you 
turn off PCI_COMMAND_MEM from a host bridge, it stops access to real RAM 
from the CPU for some bridges - so you must never turn those things off or 
you get a dead system).

(But at least Intel host bridges will just ignore writes to the CMD 
register, I think - you cannot turn MEM off).

			Linus
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 8:53 pm

On Sat, Aug 30, 2008 at 8:00 PM, Linus Torvalds

then
1. we should not probe them in probe.c
2. at least we should not try to request_resource for them in
pcibios_resource_survey...

just pretend that they are not existing.

YH
--

From: Linus Torvalds
Date: Saturday, August 30, 2008 - 8:58 pm

You are missing the fact that we need to know where existing resources 
are, even if we can't do anything about them!

Read my explanation from yesterday about why we need to add the e820 
resources to the resource map in the first place.

Short recap:

 - we need to populate the resource map with as much possible information 
   about the system as we can..

 - .. because when we assign _dynamic_ resources, we need to make sure 
   that they don't clash with random system resources that we don't really 
   otherwise have a lot of visibility into.

So the resource tree is not just about resources we control, it's also 
about resources that others control(led) and we don't necessarily know a 
lot about.

			Linus
--

From: Linus Torvalds
Date: Saturday, August 30, 2008 - 9:12 pm

Btw, this is a problem that we seldom actually have on most desktops, 
because the BIOS will normally set up just about _all_ the resources, and 
we seldom have to worry about anything but just enumerating them (and the 
occasional buggy setup).

The problems with resource allocation mostly happen on laptops, and 
especially with cardbus controllers. Now, that's obviously going away 
(people mostly use USB for most things that Cardbus/PCMCIA was used for a 
few years ago), but it still exists and with docking stations etc it can 
actually be even worse (although that's mainly because access to docking 
stations is much more limited, I suspect).

So what used to happen _all_ the time was that cardbus worked fine on 99% 
of all machines, but then some machines would lock up when you inserted a 
card in them, or the card just wouldn't work. And the reason was that some 
stupid motherboard resource (like the ACPI sleeping registers or the LPC 
control regs) were not done as a normal BAR, so the kernel wouldn't know 
about them, and the BIOS didn't necessarily even list it because it never 
mattered with Windows (since Windows has a different algorithm for laying 
out the bus resources, and wouldn't hit the magic resource).

So this is why we populate the resources with everything we can _possibly_ 
try to find, including hardware-specific quirks (see things like 
quirk_ali7101_acpi or all the quirk_ich4_lpc_acpi things etc) for finding 
resources that aren't done by BAR's.

And the hardware quirks have generally worked pretty well. I'd love to add 
some quirk for the RD790 chipset, but I'd like to know what the rules are 
for it. I know we have some AMD contacts, I wonder if they could give docs 
(I don't personally do NDA's, but I can do "gentleman's agreements" where 
I just say I won't spread things further, as long as I can write code 
based on them. I know other kernel developers do similar things).

Jordan?

			Linus
--

From: Linus Torvalds
Date: Saturday, August 30, 2008 - 12:14 pm

Btw, looking at that bogus BAR#3 some more: I don't actually think it's 
even an MCFG resource.

I think it's literally the resource that describes the HT window for the 
host bridge. So it's literally like the "root" resource - all external 
MMIO resources that go over HT have to be in that window.

IOW, I'm starting to think that it's not even broken. It is probably 
perfectly real. It's not a "PCI bridge" in the sense that it doesn't 
bridge one PCI bus to another, but it's a host bridge, and it bridges the 
CPU memory accesses to another bus. 

The fact that the MCFG area happens to be at the start of that window is 
probably just a random detail.

Does anybody know how to find chipset docs for AMD/ATI chipsets? I find 
CPU docs, and the GPU docs, but not the 790 chipset docs anywhere (yeah, 
it looks promising with a link that says "AMD 790FX Chipset 
Specifications", but the link just takes you to some trivial overview, not 
any actual specs.

Anybody?

			Linus
--

From: Rafael J. Wysocki
Date: Saturday, August 30, 2008 - 12:29 pm

There are some at:
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_15137,00.html

Well, that's 690/SB600 only and I'm not sure how useful this is.

Thanks,
Rafael
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 12:29 pm

that is for HW guys.

YH
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 12:26 pm

On Sat, Aug 30, 2008 at 12:14 PM, Linus Torvalds


AMD CPU/NB (quad core aka fam 10h later) has MSR to state MMCONFIG, and
 the ATI bridge BAR that have same address for MMCONFIG not even have
chance to decode that.

it seems ATI chipset doesn't have public version of doc...like reg
info and BIOS/Kernel porting guide.

YH
--

From: Linus Torvalds
Date: Saturday, August 30, 2008 - 12:41 pm

Ok, so it's similar to the local APIC in that respect (and presumably IO 
APIC too, I haven't checked).

But that still just implies that the BAR probably means something else 
totally, and the fact that it happens to have the same value as the MCFG 

Yeah, I'm not finding anything either. The 690G databook that Rafael 
pointed to does mention the config registers in passing, but it's really 
just about electricals (pin setup etc). No BIOS writers guide indeed..

			Linus
--

From: Yinghai Lu
Date: Saturday, August 30, 2008 - 12:48 pm

On Sat, Aug 30, 2008 at 12:41 PM, Linus Torvalds

Those BIOS porting guide need extra NDA...
they don't want to everyone know that there is lots workaround for
their silicon bugs.

YH
--

From: Rafael J. Wysocki
Date: Saturday, August 30, 2008 - 12:20 pm

Well, I thought something like this happened, but I wasn't quite sure about the
exact mechanism.  Thanks for the explanation. :-)

Rafael
--

From: Jeff Garzik
Date: Friday, August 29, 2008 - 3:34 pm

Just to be sure...  Does "helps" imply that unresolved AHCI behavior 
exists after reverting that commit?

Thanks,

	Jeff



--

From: Rafael J. Wysocki
Date: Friday, August 29, 2008 - 3:47 pm

No, after reverting this commit AHCI works normally.

Thanks,
Rafael
--

From: J.A.
Date: Sunday, August 31, 2008 - 4:27 pm

r8169 is not working on an Aspire One. It looked like working some time,
but now it has begun to say:

Sep  1 01:09:35 one klogd: r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
Sep  1 01:09:35 one klogd: r8169 0000:02:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
Sep  1 01:09:35 one klogd: r8169 0000:02:00.0: cache line size of 32 is not supported
Sep  1 01:09:35 one klogd: r8169 0000:02:00.0: PCI INT A disabled
Sep  1 01:09:35 one klogd: r8169: probe of 0000:02:00.0 failed with error -22

Any ideas ? Any more info needed ?

TIA

one:/var/log# lspci -vv -s 02:00.0
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101E PCI Express Fast Ethernet controller (rev ff) (prog-if ff)
    !!! Unknown header type 7f
    Kernel modules: r8169


-- 
J.A. Magallon <jamagallon()ono!com>     \               Software is like sex:
                                         \         It's better when it's free
Mandriva Linux release 2009.0 (Cooker) for i586
Linux 2.6.25-jam18 (gcc 4.3.1 20080626 (GCC) #1 SMP
--

Previous thread: [patch 00/03] USB-IP patches by greg on Thursday, August 28, 2008 - 4:00 pm. (13 messages)

Next thread: Re: by Robert Hancock on Thursday, August 28, 2008 - 5:14 pm. (1 message)