Re: What to do about the 2TB limit on HDIO_GETGEO ?

Previous thread: Re: 2.6.24.3 bug in sysfs with md. by Neil Brown on Monday, March 24, 2008 - 11:52 pm. (1 message)

Next thread: [PATCH 0/4, v11] PCI, ACPI: Physical PCI slot objects by Alex Chiang on Tuesday, March 25, 2008 - 12:13 am. (10 messages)
To: Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Greg KH <gregkh@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 12:02 am

(resending .. forgot to copy the lists originally)

We have a problem coming down the pipeline.

Practically all utilities that care about it,
use ioctl(fd, HDIO_GETGEO) to determine the starting
sector offset of a hard disk partition.

SCSI, libata, IDE, USB, Firewire.. you name it.

The return value uses "unsigned long",
which on a 32-bit system limits drive offsets to 2TB.

There will be single drives exceeding this limit within
the next 12 months or less, and we already have RAID arrays
that exceed 2TB.

So.. what's the replacement for HDIO_GETGEO on 32-bits ?

One candidate might seem to be the existing /sys/block/dev/partition/start
which I expect is already 64-bit friendly.

But this requires about 150 lines of somewhat complex C code to access,
using only the dev_t (from stat(2) on a file) as a starting point,
or less if one relies upon the udev device name matching the sysfs device name.

Is it time now for HDIO_GETGEO64 to make an appearance?
Similar to how the existing BLKGETSIZE64 is supplanting BLKGETSIZE ?

??
--

To: Mark Lord <lkml@...>
Cc: Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Greg KH <gregkh@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 11:17 am

Perhaps I've missed something, but surely geometry doesn't make sense on
a >2TB drive does it? The only reason we use it on modern disks (which
usually make it up specially for us) is that the DOS partition scheme
requires it. Once we're over 2TB, isn't it impossible to use DOS
partitions (well, OK, unless you increase the sector size, but that's
only delaying the inevitable), so we can just go with a proper disk
labelling scheme and use BLKGETSIZE64 all the time.

James

--

To: James Bottomley <James.Bottomley@...>
Cc: Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Greg KH <gregkh@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 1:45 pm

On Tue, Mar 25, 2008 at 11:17 AM, James Bottomley

I believe GUID Partition Tables (GPTs) are the answer.

I believe one of the features of GPT is the elimination of the 32-bit
sector restrictions.

http://en.wikipedia.org/wiki/GUID_Partition_Table

Windows VISTA 64-bit supports GPTs on data disks and new Mac OS based
systems have been using it on internal drives for a couple years at
least.

GPTs are part of the Extensible Firmware Interface (EFI), so they
should be usable for PC bootable disks at some point. (Maybe now in
some cases?)

I'm not sure what the Linux Kernel support is for GPTs.

Greg
--
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--

To: Greg Freemyer <greg.freemyer@...>
Cc: James Bottomley <James.Bottomley@...>, Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Greg KH <gregkh@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Sunday, March 30, 2008 - 12:28 am

It has been supported since the first Itanium systems shipped. It's
the first code I wrote 7+ years before it was really needed. :-) Most
distributions have it enabled, as do userspace tools like GNU Parted.

--
Matt Domsch
Linux Technology Strategist, Dell Office of the CTO
linux.dell.com & www.dell.com/linux
--

To: Greg Freemyer <greg.freemyer@...>
Cc: James Bottomley <James.Bottomley@...>, Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Greg KH <gregkh@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 1:52 pm

It's implemented. Not sure about how well used/tested it is.

config EFI_PARTITION
bool "EFI GUID Partition support"
depends on PARTITION_ADVANCED
select CRC32
help
Say Y here if you would like to use hard disks under Linux which
were partitioned using EFI GPT.

---
~Randy
--

To: Randy Dunlap <randy.dunlap@...>
Cc: Greg Freemyer <greg.freemyer@...>, James Bottomley <James.Bottomley@...>, Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Greg KH <gregkh@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 2:09 pm

ia64 uses it exclusively ... at least on discs that you want to use from
EFI.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Randy Dunlap <randy.dunlap@...>, Greg Freemyer <greg.freemyer@...>, James Bottomley <James.Bottomley@...>, Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Greg KH <gregkh@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Wednesday, March 26, 2008 - 5:58 am

I thinks intel-Macs do too.

Boaz

--

To: James Bottomley <James.Bottomley@...>
Cc: Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Greg KH <gregkh@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 1:31 pm

..

I haven't thought much about problems with the virtual geometry,
because, as you say, we really don't care about it for the most part.
We use LBA values from the partition tables rather than CHS.
I suppose those also likely to be 32-bit limited.

The "partition offset", or "starting sector" is the important
bit of information for most things. And that's currently available
from HDIO_GETGEO, and from /sys/block/XXX/XXXn/start, if sysfs is mounted.

We just need an easy way to get it, given a dev_t from stat(2).
Currently there isn't an easy way, and HDIO_GETGEO returns
only 32-bits on a 32-bit system.

Cheers

--

To: Mark Lord <lkml@...>
Cc: Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Greg KH <gregkh@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 3:32 pm

But I think where this is leading is that you've been using the geometry
call, but all you really want to know is the actual partition start in
sector units, so a new BLKGETPARTSTART (or something) ioctl that was
designed to return a u64 would work for you? That sounds reasonable to
me; so not a HDIO_GETGEO64 which gets us into trouble with geometries,
but a simple ioctl that gives you exactly what you're looking for.

James

--

To: Mark Lord <lkml@...>
Cc: Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Greg KH <gregkh@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 1:13 am

Probably a better thing to have would be a way to look up block devices
in sysfs by device number.

-hpa
--

To: H. Peter Anvin <hpa@...>, Greg KH <gregkh@...>
Cc: Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 9:37 am

[Empty message]
To: Mark Lord <lkml@...>
Cc: Greg KH <gregkh@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 9:55 am

It shouldn't be under /sys/block... there are enough many things that
scan /sys/block and assume any directory underneath it has the current
format.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Greg KH <gregkh@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 1:37 pm

..

So long as we only add things, and not remove them, then any software
that scans /sys/block/ shouldn't care, really.

But yes, it could go elsewhere, too.
Perhaps a /sys/dev/ directory, populated with symbolic links
(or hard links?) back to the /sys/block/ entries, something like this:

/sys/dev/block/8:0 -> ../../../block/sda
/sys/dev/block/8:1 -> ../../../block/sda/sda1
/sys/dev/block/8:2 -> ../../../block/sda/sda2
...

That's just a suggestion, really.
And what about character devices?

Perhaps Greg will chime in.
--

To: Mark Lord <lkml@...>
Cc: H. Peter Anvin <hpa@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 3:25 pm

I've been waiting to see if sanity will take hold of anyone here.

Come on people, adding symlinks for device major:minor numbers in sysfs
to save a few 10s of lines of userspace code? Can things get sillier?

You can add a single udev rule to probably build these in a tree in /dev
if you really need such a thing...

And what's wrong with your new ioctl recomendation?

greg k-h
--

To: Greg KH <gregkh@...>
Cc: H. Peter Anvin <hpa@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 8:34 pm

..

So have we. sysfs is a total nightmare to extract information from
under program / script control. The idea presented in this thread,
is to have it cross-index the contents with a method that actually
makes it easy to access in many common scenarios, without requiring
huge gobs of code in user space. Or in kernel space.

And it's not just a few 10s of lines of code currently,
but rather about 80-100 lines just to find the correct device subdir,
and *then* a few more 10s of lines of code to retrieve the value.

In a bulletproof fashion, that is. Sure it can be slightly smaller
if niceties such as error checking/handling are omitted.

There's no guarantee that udev is present, and even if it were present,
there's no guarantee that the names in /dev/ will match /sysfs/ pathnames,
since udev is very configurable to do otherwise.

So lookups are by dev_t, which sysfs has no simple or even easy way
of accomplishing. O(n) at a minimum.

If we make it easier to access, then more programs will use it
rather than us having to expand our tricky binary ioctl interfaces.

Isn't that part of the idea of sysfs -- to limit the need for new ioctls ?

Cheers
--

To: Mark Lord <lkml@...>
Cc: Greg KH <gregkh@...>, H. Peter Anvin <hpa@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Thursday, March 27, 2008 - 2:51 pm

Hmm, 100 lines? What else do you need?

$ grep -l 8:3 /sys/class/block/*/dev
/sys/class/block/sdc/dev

Kay
--

To: Kay Sievers <kay.sievers@...>
Cc: Mark Lord <lkml@...>, Greg KH <gregkh@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Thursday, March 27, 2008 - 2:55 pm

That's particularly funny, because your very own example gives the wrong
result -- sdc is 8:32 not 8:3 (which is sdc3, which is also excluded by
your search.)

Not to mention the fact that it is still O(n).

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Mark Lord <lkml@...>, Greg KH <gregkh@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Thursday, March 27, 2008 - 3:03 pm

Very true, but I guess you get the idea, and know how to add the proper

Any real numbers from a large setup, which show that we want to have a
reverse devnum map in sysfs?

Thanks,
Kay

--

To: Mark Lord <lkml@...>
Cc: Greg KH <gregkh@...>, H. Peter Anvin <hpa@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 8:54 pm

Hello,

The questions are...

1. Are we gonna push sysfs as the primary interface and not provide an
alternative interface (ioctl here) which can provide equivalent
information? There are people running their systems w/o sysfs but I
think we're getting closer to this everyday.

2. Is udev an essential part of all systems? I'm not sure about this
one. Lots of small machines run w/o udev and I think udev is a bit too
high level to depend on for every system.

If both #1 and #2 are true, I agree with Mark that we need an easy to
map from device number to matching sysfs nodes. Tools which are used
early during boot and emergency sessions need this mapping and many of
them are minimal C program w/o much dependency for a good reason.
Requiring each of them to implement their own way to map device node to
sysfs node is too awkward.

Probably something like /sys/class/block/MAJ:MIN or
/sys/class/devnums/bMAJ:MIN?

--
tejun
--

To: Tejun Heo <htejun@...>
Cc: Mark Lord <lkml@...>, Greg KH <gregkh@...>, H. Peter Anvin <hpa@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Thursday, March 27, 2008 - 3:29 pm

"Devices directories" are not supposed to contain duplicate entries.

These are no devices belonging to the class "devnums", so it may
confuse things which crawl these directories to get "all devices".
Current coldplug-like setups will likely add duplicate devices with
the wrong subsystem. There are also bus-devices with have a dev_t, and
that will make them show up in /sys/class, which might confuse some
tools too.

I guess we will need to find some other solution as a /sys/class/ for
that. And we must prefix the links with 'c' and 'b' because dev_t is
not unique across char and block devices.

Thanks,
Kay
--

To: Kay Sievers <kay.sievers@...>
Cc: Tejun Heo <htejun@...>, Mark Lord <lkml@...>, Greg KH <gregkh@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Thursday, March 27, 2008 - 3:38 pm

It doesn't really seem to be to belong under class at all. I would
suggest /sys/dev/char/ and /sys/dev/block/, for char and block respectively.

-hpa

--

To: H. Peter Anvin <hpa@...>
Cc: Kay Sievers <kay.sievers@...>, Tejun Heo <htejun@...>, Mark Lord <lkml@...>, Greg KH <gregkh@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Friday, April 11, 2008 - 7:25 pm

This thread fizzled out without a patch... here goes:

[ note: I'm replying via gmail, so if it has whitespace mangled the
patch please see the attachment ]

-----snip---->
sysfs: add /sys/dev/{char,block} to lookup sysfs path by major:minor

From: Dan Williams <dan.j.williams@intel.com>

Why?:
There are occasions where userspace would like to access sysfs
attributes for a device but it may not know how sysfs has named the
device or the path. For example what is the sysfs path for
/dev/disk/by-id/ata-ST3160827AS_5MT004CK? With this change a call to
stat(2) returns the major:minor then userspace can see that
/sys/dev/block/8:32 links to /sys/block/sdc.

What are the alternatives?:
1/ Add an ioctl to return the path: Doable, but sysfs is meant to reduce
the need to proliferate ioctl interfaces into the kernel, so this
seems counter productive.

2/ Use udev to create these symlinks: Also doable, but it adds a
udev dependency to utilities that might be running in a limited
environment like an initramfs.

Cc: NeilBrown <neilb@suse.de>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg KH <gregkh@suse.de>
Cc: Mark Lord <lkml@rtr.ca>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

drivers/base/core.c | 37 ++++++++++++++++++++++++++++++++++++-
1 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 24198ad..de925f8 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -27,6 +27,9 @@

int (*platform_notify)(struct device *dev) = NULL;
int (*platform_notify_remove)(struct device *dev) = NULL;
+static struct kobject *dev_kobj;
+static struct kobject *char_kobj;
+static struct kobject *block_kobj;

#ifdef CONFIG_BLOCK
static inline int device_is_not_partition(struct device *dev)
@@ -759,6 +762,11 @@ static void device_remove_class_symlinks(struct
device *dev...

To: Dan Williams <dan.j.williams@...>
Cc: H. Peter Anvin <hpa@...>, Kay Sievers <kay.sievers@...>, Tejun Heo <htejun@...>, Mark Lord <lkml@...>, Greg KH <gregkh@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Linus Torvalds <torvalds@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, April 15, 2008 - 3:18 am

Crickets are chirping and I can't remember what the conclusion to all this
was. In fact the thread was more than ten-deep so I probably fell asleep.

I queued it up so that others cannot do the same ;)
--

To: Andrew Morton <akpm@...>
Cc: Dan Williams <dan.j.williams@...>, H. Peter Anvin <hpa@...>, Kay Sievers <kay.sievers@...>, Tejun Heo <htejun@...>, Mark Lord <lkml@...>, Greg KH <gregkh@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Linus Torvalds <torvalds@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, April 15, 2008 - 10:20 am

The expressed preference was simply to expand the ioctl (or add a new
one that got the required information without having to go through the
old HDIOGETGEO path to extract the value from a fictitious geometry).

Greg was a bit sceptical of the value of the above proposal ...

James

--

To: James Bottomley <James.Bottomley@...>
Cc: Andrew Morton <akpm@...>, Dan Williams <dan.j.williams@...>, Kay Sievers <kay.sievers@...>, Tejun Heo <htejun@...>, Mark Lord <lkml@...>, Greg KH <gregkh@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Linus Torvalds <torvalds@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, April 15, 2008 - 2:16 pm

However, you have to admit that kind of defeats the whole point of
having this information in sysfs. IMNSHO, even scanning sysfs is better
than keep adding binary ioctls.

-hpa

--

To: Andrew Morton <akpm@...>
Cc: James Bottomley <James.Bottomley@...>, Kay Sievers <kay.sievers@...>, Tejun Heo <htejun@...>, Mark Lord <lkml@...>, Greg KH <gregkh@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Linus Torvalds <torvalds@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>, H. Peter Anvin <hpa@...>
Date: Tuesday, April 15, 2008 - 7:43 pm

[Empty message]
To: <dan.j.williams@...>, <James.Bottomley@...>, <akpm@...>, <axboe@...>, <gregkh@...>, <hpa@...>, <htejun@...>, <jgarzik@...>, <kay.sievers@...>, <linux-ide@...>, <linux-kernel@...>, <linux-scsi@...>, <lkml@...>, <neilb@...>, <steve@...>, <torvalds@...>
Date: Wednesday, April 16, 2008 - 4:55 pm

This is a note to let you know that I've just added the patch titled

Subject: sysfs: add /sys/dev/{char,block} to lookup sysfs path by major:minor

to my gregkh-2.6 tree. Its filename is

sysfs-add-sys-dev-char-block-to-lookup-sysfs-path-by-major-minor.patch

This tree can be found at
http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/

From dan.j.williams@intel.com Wed Apr 16 13:49:38 2008
From: Dan Williams <dan.j.williams@intel.com>
Date: Tue, 15 Apr 2008 16:43:15 -0700
Subject: sysfs: add /sys/dev/{char,block} to lookup sysfs path by major:minor
To: Andrew Morton <akpm@linux-foundation.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>, Kay Sievers <kay.sievers@vrfy.org>, Tejun Heo <htejun@gmail.com>, Mark Lord <lkml@rtr.ca>, Greg KH <gregkh@suse.de>, Jens Axboe <axboe@kernel.dk>, Jeff Garzik <jgarzik@pobox.com>, Linus Torvalds <torvalds@linux-foundation.org>, Linux Kernel <linux-kernel@vger.kernel.org>, IDE/ATA development list <linux-ide@vger.kernel.org>, linux-scsi <linux-scsi@vger.kernel.org>, "H. Peter Anvin" <hpa@zytor.com>
Message-ID: <1208302995.21877.12.camel@dwillia2-linux.ch.intel.com>

From: Dan Williams <dan.j.williams@intel.com>

Why?:
There are occasions where userspace would like to access sysfs
attributes for a device but it may not know how sysfs has named the
device or the path. For example what is the sysfs path for
/dev/disk/by-id/ata-ST3160827AS_5MT004CK? With this change a call to
stat(2) returns the major:minor then userspace can see that
/sys/dev/block/8:32 links to /sys/block/sdc.

What are the alternatives?:
1/ Add an ioctl to return the path: Doable, but sysfs is meant to reduce
the need to proliferate ioctl interfaces into the kernel, so this
seems counter productive.

2/ Use udev to create these symlinks: Also doable, but it adds a
udev dependency to utilities that might be running ...

To: Andrew Morton <akpm@...>
Cc: Dan Williams <dan.j.williams@...>, H. Peter Anvin <hpa@...>, Kay Sievers <kay.sievers@...>, Tejun Heo <htejun@...>, Mark Lord <lkml@...>, Greg KH <gregkh@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Linus Torvalds <torvalds@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, April 15, 2008 - 9:47 am

..

Last I recall, Greg was vehemently opposed to having direct path access
by device number in sysfs, but many other people saw benefit.

Myself (the originator), I simply decided that my sysfs access code has
to work with older kernels too, so for now I'm just doing a brute force
tree search to find things in sysfs. I did get the code size down smaller
for it, but it's still a pain.

When the direct access feature goes in, I'll just change my code to try it first,
..

Good!
--

To: Tejun Heo <htejun@...>
Cc: Mark Lord <lkml@...>, H. Peter Anvin <hpa@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 11:38 pm

I think you are using either the wrong programming language, or your
sysfs walking logic is quite convulted. Look at the udev and HAL code

Exactly, originally you suggested a new ioctl, which would be trivial to
add, and trivial to switch any program that was currently using an ioctl
to get the disk size, to use it instead.

Since when is the major:minor view of devices the "standard" one that
userspace uses? Last I looked, userspace uses symlinks and lots of
other ways of directly accessing block devices in /dev/, and does not
rely on major:minor.

And finally, I haven't seen a patch that implements this "shadow" tree,

My tiny little phone runs udev, I don't see why anyone wouldn't run it
these days, except in very limited embedded applications with no dynamic
devices. But if you are in that situation, you aren't querying the size
of any random block device either :)

And heck, this phone is a very limited embedded application, with razor
thin margins, if it can use udev, I'd be interested in hearing the
justifications for anyone who says it is too large for their systems to

Why the preopcupation with major:minor? Just because you are able to
grab it from an open file handle? Heck, why not just an ioctl to get
the path within sysfs for the device currently open? :)

thanks,

greg k-h
--

To: Greg KH <gregkh@...>
Cc: Mark Lord <lkml@...>, H. Peter Anvin <hpa@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Wednesday, March 26, 2008 - 12:24 am

Hello, Greg.

The fact that major:minor is the unique identifier of a device makes it

It's possible, all that's needed are symlinks. We do similar things all

I agree udev is affordable for most cases but it's still a major step to
require it for every system. I would hate to hear that hdparm or fdisk
doesn't work unless udev is online. These are tools which are used to

Because major:minor is the key attribute to devices?

Thanks.

--
tejun
--

To: Tejun Heo <htejun@...>
Cc: Greg KH <gregkh@...>, Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Wednesday, March 26, 2008 - 2:04 am

In particular, stat() and friends returns the device number, not a
device name.

-hpa
--

To: Greg KH <gregkh@...>
Cc: Mark Lord <lkml@...>, H. Peter Anvin <hpa@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 3:34 pm

Ah, there's some sanity. :)

---
~Randy
--

To: Randy Dunlap <randy.dunlap@...>
Cc: Greg KH <gregkh@...>, Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 4:36 pm

It's not so much an issue of a few tens of lines of user space code, but
rather the fact that something that should be O(1) is currently O(n).

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Randy Dunlap <randy.dunlap@...>, Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 5:20 pm

"should"? why? Is this some new requirement that everyone needs? I've
_never_ seen anyone ask for the ability to find sysfs devices by
major:minor number in O(1) time. Is this somehow a place where such
optimization is warranted?

thanks,

greg k-h
--

To: Greg KH <gregkh@...>
Cc: Randy Dunlap <randy.dunlap@...>, Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 5:26 pm

Well, when dealing with shell scripts a O(n) very easily becomes O(n^2).
For the stuff that I, personally, do, it's not a big deal, but people
with large number of disks have serious gripes with our boot times.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Greg KH <gregkh@...>, Randy Dunlap <randy.dunlap@...>, Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Thursday, March 27, 2008 - 3:05 pm

This should be a solved problem with scsi_mod.scan=async (or equivalent
compile option). Are people still complaining about it, and if so, have
they tried this option?

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: H. Peter Anvin <hpa@...>
Cc: Randy Dunlap <randy.dunlap@...>, Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 7:00 pm

How does this have anything to do with boot times? Do you really have a
foolish shell script that iteratorates over every single disk in the
sysfs tree for every disk? What does it do that for?

I thought we were talking about 2TB disks here, with a proposed new
ioctl, not foolishness of boot scripts...
--

To: Greg KH <gregkh@...>
Cc: Randy Dunlap <randy.dunlap@...>, Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 7:05 pm

Any time you want to get the sysfs information for a filesystem which is

I pointed out that having a way to map device numbers to sysfs
directories would have the same effect, *and* would be usable for other
purposes. I'd rather see that than a new ioctl, and another, and another...

ioctl()s are also nasty since they're generally root-only (or rather,
device-owner only). Since the information is already in sysfs, there is
no benefit to this hiding. Otherwise one could consider a ioctl() "give
me the sysfs name of this device."

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Randy Dunlap <randy.dunlap@...>, Mark Lord <lkml@...>, Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 7:22 pm

Again, a simple udev rule will give you that today if you really want
it...

And I think 'udevinfo' can be used to retrieve this information as well.

thanks,

greg k-h
--

To: Mark Lord <lkml@...>
Cc: Jens Axboe <axboe@...>, Jeff Garzik <jgarzik@...>, Tejun Heo <htejun@...>, Greg KH <gregkh@...>, Linus Torvalds <torvalds@...>, Linux Kernel <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, linux-scsi <linux-scsi@...>
Date: Tuesday, March 25, 2008 - 12:19 am

That sounds useful.

But you're the one who has investigated this - please make a recommendation?
--

Previous thread: Re: 2.6.24.3 bug in sysfs with md. by Neil Brown on Monday, March 24, 2008 - 11:52 pm. (1 message)

Next thread: [PATCH 0/4, v11] PCI, ACPI: Physical PCI slot objects by Alex Chiang on Tuesday, March 25, 2008 - 12:13 am. (10 messages)