My impression from asking questions on the linux-scsi mailing list is that the scsi upper/middle/lower layers doesn't use the block layer described in Documentation/block/*. For example, the scsi guys say: http://marc.info/?l=linux-scsi&m=118633268527856&w=2 Instead of using the block layer, SCSI reinvents this particular wheel itself. There's a scsi "upper layer" that provides /dev nodes, scsi low-level drivers, and a gigantic glue layer in between call the "scsi midlayer" that's something like a networking stack, and is responsible for losing track of all your devices so that the one SATA disk hardwired into your laptop might be sda or sdc depending on whether or not you had a USB key plugged in when you booted up. Anyway, the block layer isn't between any of these three, that I can tell. Now that IDE disks have been rerouted through the scsi layer, SATA goes through the scsi layer, USB goes through the scsi layer, firewire goes through the scsi layer... What's left? It seems like everything but ramdisks have now been routed through the scsi layer. My laptop hasn't got a single SCSI device but it also hasn't got any block devices that don't show up as scsi. So what's still using the block layer? How do the scsi layers and the block layer relate? I'm confused! (This is normal for me, but still...) Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. -
That's nice. Why not take a look in drivers/block? Floppy, CCISS,
sd and sr are block drivers. In fact, the whole SCSI subsystem ...
depends on BLOCK
Just take a look at sd.c. The init code reads:
for (i = 0; i < SD_MAJORS; i++)
if (register_blkdev(sd_major(i), "sd") == 0)
majors++;
Then look at struct scsi_cmnd. It has a pointer to the block request
that was passed down to it. struct scsi_device has a pointer to the
block request_queue that's associated with the device. Block is what
has elevators and io schedulers -- that work isn't duplicated by scsi.
There's work to push more of scsi's infrastructure up into the block
layer, so non-scsi block devices can take advantage of it.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
-
That is so rude. You need to learn some manners. -
Such responses sometimes happen after provocative posts like the thread starter's. He could have asked straight away for help with fixing his boot environment instead of wrapping his question into a feigned design discussion. It appeared as if he is out for a fight rather than interested in help. -- Stefan Richter -=====-=-=== =-=- -===- http://arcgraph.de/sr/ -
Provocation is often in the eye of the beholder, and basic manners No, he couldn't have. He quite obviously didn't even know enough to understand his boot environment might be at fault, and hence It may have appeared like that from the highly antagonistic mindset that seems so prevalent in LKML. But if one just stepped back and took a breath before answering it should have been quite obvious that he wasn't. (out for a fight, that is) Granted, it can be difficult to comprehend the point of view of someone who does not know or understand something you yourself know or understand well. But you should at least be aware of that inability, and consequently refrain from accusing of provocation where there may be none. Hanlon's razor, cynical as it may sound at first, is an eminently humanistic principle. --=20 Tilman Schmidt E-Mail: tilman@imap.cc Bonn, Germany Diese Nachricht besteht zu 100% aus wiederverwerteten Bits. Unge=F6ffnet mindestens haltbar bis: (siehe R=FCckseite)
When a reply contains as a reply to the first paragraph "you're wrong" with no elaboration, and as a reply to the second paragraph nothing but expletives and personal insults, I tend to stop reading. It really doesn't come across as a serious reply. Actually, I was going through Documentation/block thinking about making a 00-INDEX for it, but my earlier questions of the scsi guys left me with the impression that the block layer is _not_ used by the SCSI layer. And since every non-embedded modern storage device I'm aware of has been consumed by the SCSI layer (despite none of them actually having a discernably closer relationship to SCSI than ATA did), I didn't know whether or not it was more appropriate to index this directory or request its deletion. So I asked. Back when I asked the scsi guys about this, I got no direct answer. I asked "where does the block layer work into this" in the context of questiosn about the relationship between the scsi upper, middle, and lower layers, and I never got a reply, even though the question was quoted back at me here: http://www.mail-archive.com/linux-scsi%40vger.kernel.org/msg09086.html The closest I got to an answer was later in the thread: http://www.mail-archive.com/linux-scsi%40vger.kernel.org/msg09131.html The gist of the thread (and the documentation I was referred to) is that the scsi "upper layer" presents /dev nodes and ioctls, the scsi mid-layer is a routing layer very roughly analogus to a TCP/IP stack, and the scsi low-layer drivers interface with specific pieces of hardware. Apparently, the block layer is not between any of these, they talk directly to each other. This would seem to indicate that I/O requests made to scsi devices are never routed through a common block I/O request handling layer shared with non-SCSI block devices. I was not, however, certain of this, hence my attempt to bring the topic back up. Oh, and sending a patch correcting Jens Axboe's address in this old ...
Ah, so it was about your documentation work. I already forgot the context of your previous inquiries. Alas the tone of them already did some damage, leading to responses like these. ... The Linux SCSI subsystems don't consume, they provide services; nowadays not only for SCSI hardware and SCSI protocols but also for a number of subsystems whose tasks are similar enough to SCSI subsystems to make the SCSI core and upper SCSI layer useful to them too. BTW: | Now that IDE disks have been rerouted through the scsi layer, SATA goes | through the scsi layer, USB goes through the scsi layer, firewire goes | through the scsi layer... As a side note, SBP-2 is a SCSI transport protocol, hence ieee1394/sbp2 and firewire/fw-sbp2 are Linux SCSI low-level drivers. Anything else would be just wrong and infeasible in this particular case. -- Stefan Richter -=====-=-=== =-=- -==== http://arcgraph.de/sr/ -
Well, triggered by. (This documentation stuff makes me poke into corners of the kernel I ordinarily otherwise avoid, for various reasons. I don't currently have the luxury of saying "beats me how this bit works, not my Sorry about that. My social skills are finite, I tend to exhaust them when I do too much at once. :( This discussion has clarified for me that my objection isn't the scsi layer itself, it's the /dev/sd? namespace combining devices that would otherwise be /dev/hda, /dev/nd0, /dev/ub0 (or usb0 or some such), and /dev/sata into a My "scsi mid layer" vs "block layer" question was about whether I should read up on the block layer if the scsi mid layer didn't use it. Neil Brown just sent me a nice email (which I'll have to reread in the morning when I'm more awake) that helps there. The "ide/sata/usb/firewire->scsi" complaint didn't belong in the same email as the original question, it's a line of questioning I put on hold on linux-scsi back in August when the thread started getting a bit heated for my tastes. To clarify, I think that merging ide, sata, usb, firewire, and others into a single device namespace causes each type of device to inherit that namespace's cumulative ordering issues, which is a bad thing. I have no real attachment to the underlying scsi or block layers. I've never seriously worked on either (although I'm trying to understand both). For example, usb devices are never easy to order. IDE devices (back when they had their own namespace) were trivial to order back when /dev/hda couldn't move without use of a screwdriver. USB and IDE devices are very different in that it's not possible to plug a USB device into an IDE controller (not without one _heck_ of an adapter) and vice versa. USB devices usually live outside the computer's case, and IDE devices inside the case. They're not the same thing. Combining USB and IDE into the same /dev/sd? namespace makes enumerating the IDE devices much harder than in ...
Ah, but it could. If you had more than one IDE controller (which is even possible on laptops; the Fujitsu P7120 is one that I'm familiar with that has more than one), the initialisation order *of the It's not something anyone particularly set out to do, it's just how it worked out. It was justified by saying "ok, this goes from a 99% solution to a 96% solution, but there's 100% solution called uuids". I don't particularly agree with this line of argumentation, but it did hold sway. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
Low-level networking drivers suggest a default interface name (per interface or as a template like eth%d into which the networking core inserts a lowest spare number). Userspace can rename interfaces, but nevertheless it's nice to have different default kernel names for ethernet, wlan etc.. Could low-level SCSI drivers provide similar name templates which give a hint on the transport involved? It's a bit more difficult as with networking interfaces though because - SCSI devices can have sd, sr, st, osst, ch, sg interfaces, - SCSI device files share a namespace with all other device files. E.g. /dev/sd-ide-b - second IDE HDD, /dev/sd-iscsi-e - fifth iSCSI direct access device, /dev/sr-sata-0 - first SATA CD-ROM, /dev/sr-usb-0 - a USB CD-ROM, /dev/st-fw-0 - a FireWire tape drive, /dev/sda - a device whose transport driver didn't propose a name Of course the really interesting names will still be provided by udev-generated symlinks. -- Stefan Richter -=====-=-=== =-=- -==== http://arcgraph.de/sr/ -
this is a nice option, and since most of the existing userspace code is looking for /dev/sd*, /dev/sr*, etc this should be able to work for new installs with no userspace changes. Since it would break existing installs it would need to be optional. one other option that could be considered (and I do realize I'm bringing up flame-bait here) is that drivers that have fixed addresses could offer up a device name that include that address. i.e. depending on the config option a device could show up as either sda, sd-scsi-a, sd-scsi-0:0:0:0, or even sd-scsi-<WWN> if the driver or bus doesn't have a real numbering, it wouldn't invent a fake one (which is a big problem with most of the prior suggestions that have tried to offer a numbering option), it would just offer the most specific information it has. David Lang -
... That's already implemented. :-) Transport drivers expose transport specific information in sysfs; udev scripts examine it and create by-id and by-path symlinks to device files of HDDs. Not everybody agrees, but many think that it's sensible to implement just mechanism in kernel and leave policy to userspace. My suggestion and the default network interface names already violate this principle to a degree, but it can still be implemented in a transport independent way, and userspace can continue to create whatever names the user needs. -- Stefan Richter -=====-=-=== =-=- =---- http://arcgraph.de/sr/ -
I wouldn't try dividing those by pata v sata. You'll cause all sorts of problems in the process because of PATA-SATA and SATA-PATA bridges. -
if you use a PATA-SATA bridge (IDE drive SATA controller), it would look to the system like a SATA drive and be addressed and enumerated as SATA. if you use a SATA-PATA bridge (SATA drive, PATA controller), it would look to the system like a PATA drive and be addressed and enumerated as PATA. prior to libata the device would be /dev/hdX, with X depending on how it's cabled and if it's set to master or slave, it wouldn't matter if that device then converts to other things, the system would still know it as an IDE drive. this works exactly the same way that external encosures that hold SATA or IDE drives, but have SCSI interfaces to the system have always worked so it's what sysadmins will expect. David Lang -
But you don't know where the bridge is. It might be on the drive's board, it might be an explicit enclosure, or it might be on the motherboard. Each of those scenarios is going to have a different user expectation. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
If the bridge is on the drive's board or in an enclosure, the user's
expectations are fully met.
If the bridge is on the motherboard, then the user may be surprised
unless he knows the motherboard well enough.
But this is _far_ less of an issue than
- the hda<->sda confusion,
- the confusion caused by all kernel default names put into a single
namespace.
I don't have a personal interest in PATA/SATA distinction though. I
suppose once PATA went into the SCSI namespace and then this namespace
is divided again, it's not a big issue anymore whether PATA and SATA
share an ATA namespace or are distinct, except perhaps for people with
IDE drive and eSATA slots.
--
Stefan Richter
-=====-=-=== =-=- =----
http://arcgraph.de/sr/
-
And worse yet, depending on what BIOS options you set at config time, or what might happen after you upgrade the BIOS, whether the drive looks like PATA or SATA could change over time. So if you have /dev/hda hard-coded in your /etc/fstab file, you could and probably will potentially lose after you change a BIOS option or take a BIOS upgrade causing the BIOS configs to get resent and disabling PATA emulation, such that your disk that had previously been /dev/hda now shows up as /dev/sda. (And this is something you will very badly *want* since your disk drive access will be **much** faster once you stop using PATA emulation.) Yet another reason why people who desperately are trying to cling to the good old days of stable device enumerations are going to be disappointed; the *type* of the drive can change over time, even for something as simple as a laptop's primary hard drive, which seem to be some people's favorite example. Unfortunately, people are just going to have to suck it up and get used to a much more complicated world. - Ted -
Sure enough; stable device enumeration is a thing of the past. This doesn't have to stop us though from providing speaking default names for device files, just like we already provide speaking default names for network interfaces. (Not for all, but for many.) -- Stefan Richter -=====-=-=== =-=- =---- http://arcgraph.de/sr/ -
the only one of these that I would find unexpected would be the one on the motherboard. why is this any different from the external enclosures? they have always appeared as the type of device that connects them to the motherboard, (and even with SCSI, there are some controllers that don't generate sdX devices) the driver for the controller is what has historicly determined what the device appears as to the system. an example of this is the 3ware driver that is a SCSI drive but the drives attached to the card are IDE drives. another example is the I2O drivers (which give you access to the Raid array and to the individual drives, in different namespaces). while I may disagree with some of the selections that have been made (the 3ware has always seemed odd to me for example) it's pretty simple to figure out. but in any case, historicly IDE (PATA) and SATA drives have been handled differently, IDE drives have had fixed device names based on how they are connected, SATA devices have had 'order found' device names from the SCSI heritige. mixing the two types into one namespace requires changing one or the other. while I would love to see SATA gain hardware path dependant names I'm not holding my breath, but I hate to loose the predictable nameing (even if the names change) for the IDE drives. David Lang -
Nope. Historically it depended whether you had a PATA controller with SATA bridge, a SATA controller with SATA drives, a PATA controller with PATA drives or a SATA controller with PATA bridge. Often the bridges are on the card or mainboard. So some VIA systems would historically use /dev/hda for the first SATA device. Even more fun is stuff like Jmicron where the BIOS settings determined whether PATA or SATA was /dev/hda Alan -
In the past enclosures supported only one kind of connector so this
assumption was fine. But nowadays an external disk may have several
connectors (like USB, Firewire and eSata). Why should the disk's name
depend on what type of cable did I manage to grab first? It is the
_same_ disk regardless of the cable type.
There is one thing however that could be improved: renaming a disk in an
udev rule should propagate the new name back to the kernel, just like
renaming an ethernet interface does. That way mapping error messages to
physical disk locations could be made much easier.
Gabor
--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------
-
Yes, but even udev won't give you one and the same symlink to the disk's device file then.¹ There isn't a persistent unique target/unit property which all of these transports have in common. The only thing that could be common in the best case is the symlink to the partition's device file, based on filesystem UUID or filesystem label. ¹) unless you write your own rule specific to this on particular enclosure -- Stefan Richter -=====-=-=== =-=- =---= http://arcgraph.de/sr/ -
the right type for the type of cable you choose to use. yes it's the same disk, but by choosing to hook it up in a different way you get different results from it (different performance, different predictability) again, if you want to have a udev rule that then maps these different name onto the same name, more power to you, but why do you insist on makeing _everyone_ work that way (or go to significant extra effort to find the info in the changing directory structure of sysfs to track down the info definantly. David Lang -
I have udev here, and it generates several useful symlinks. /dev/disk/by-path/pci-0000:00:1f.1-scsi-0:0:0:0-part2 will always point to the second primary partition of the IDE master on the first IDE channel here, be there as many USB sticks as there may. (But still I'd like it if it wasn't named "scsi-0:0:0:0", because the I don't think there was any intent to merge namespaces. It "just happened" as a byproduct of having sata/pata use the scsi subsystem. Wilfried -- Irgendwas ist ja immer... -
OK, right ... could we please get a sense of decorum back on this list. Rob, if you didn't ask your alleged questions in such a pejorative manner, we'd get a lot further; and Matthew, if you didn't rise to the bait so spectacularly it wouldn't prolong these threads. Really, both of you, I have better things to do with my time than mediate behaviours that should have been educated out of you in the kindergarten sand pit. James -
I'm not attempting to be pejorative. I admit a certain amount of personal annoyance that once the SCSI layer consumes a category of device (USB, SATA, PATA), they can often _only_ be used by going through the SCSI midlayer. (This strikes me as analogous to TCP/IP claiming ethernet and PPP devices so thoroughly that you can no longer address them as eth1 or /dev/ttyS0.) This has the annoying effect of bundling together different types of devices and making device enumeration unnecessarily difficult: my laptop only has one SATA hard drive and can't gain another without a soldering iron, but that drive could move from /dev/sda to /dev/sdb if I reboot the system with a USB key plugged in. This seems like a regrettable loss of orthogonality to me. I remember back when /dev/usb0 and /dev/hda were separate devices that showed up in /dev, but these days "it's SCSI" seems to trump "it's USB", "it's ATA", or "it's SATA". (Even though none of those are actually SCSI hardware, they just send a similar packet protocol across the wire.) The fact that udev can theoretically unwind this hairball is not an excuse for conflating different categories of devices in the first place. Avoiding an unnecessary problem seems superior to trying to get udev to solve it. Note that Ubuntu 7.04 solves it by sticking a UUID on every _partition_, and then spinning up my external USB hard drive trying to find the root partition on a Conflating categories of hardware that cannot easily be enumerated (USB) with categories that can (the SATA hard drive in my laptop, of which there can be only one) strikes me as a bad thing. Putting them in a common "scsi device pool" within which they do not enumerate consistently is not something I enjoy dealing with. However, the response to my attempts to express this dissatisfaction on the SCSI list a few months ago came too close to a flamewar for me to consider continuing it productive. I'd still love to update the "2.4 scsi howto" and ...
That's because modern USB, ATAPI (what was once known as IDE), SATA really *all* using the SCSI command protocols at the low level, just as Ethernet and PPP interfaces really are fundamentally the same thing. You can rail against it, but that's the mark of someone who You're showing your ignorance here. In fact in the past few years, ATA and SCSI has been converging significantly, with the ATAPI specification has essentially incorporating the SCSI protocol by reference and by value --- with the point that SAS was developed by the SCSI Trade Association, and SAS is effectively a superset of SATA, to the point where with care, you can actually mix SAS and SATA drives on the same in enclosure (SAS and SATA are physically compatible on the connector level). More to the point, with SATA, hot plugging has been designed in, so probing order is not going to be well defined, just as with USB devices. And there are already relatively common situations where the same disk can show up via multiple different interfaces. For example, if you have a modern Thinkpad with an secondary SATA hard drive in an Ultrabay, and you plug it into the Ultrabay in your T60, it will show up as a SATA drive. However, if you plug it into the Advanced dock, it shows up as a USB device. And with iSCSI not only can you encapsulate a SCSI command stream over USB, you can do so over IP as well. In any case, regardless of how the physical SATA drive is attached to the system, you want it to show up as the same device and be mounted in the same location. That's why identifying filesystem by UUID's or Labels is so critical. This is not a new concept; we've had the capability to do this for over a decade, and I always knew it would be necessary for us to do this sooner or later --- which is why I added the UUID support to ext2 See the thinkpad Ultrabay drive example above. You address hosts by IP address; it doesn't matter whether you access them via a PPP interface, or a wireless interface, or a ...
Ok, I'll bite. If it's all "real" scsi, why does ioctl(SG_EMULATED_HOST) They're the same thing? Do you mean that on a system with both, going: ifconfig eth1 66.92.53.140 ifconfig ppp 192.168.0.42 Would be functionally equivalent to: ifconfig eth1 192.168.0.42 ifconfig ppp 66.92.53.140 So if on one boot the addresses are assigned the first way, and upon reboot they're assigned in the second way by exact the same set of commands... well that's not IMPORTANT, is it? (Or is it that everyone everywhere should use dhcp for everything, and static addressing is obsolete and no longer supported? Apparently dhcp addresses should be delivered by machines with only one network interface of any type...) This is my objection. Even when enumerating multiple devices of the same type is tricky, enumerating multiple devices of _different_ types should not be. There's a great big type indicator that is being _deliberately_ ignored, and large classes of devices (millions of laptops) where you know there's only going to be _one_ instance of a given type. By the way, ethernet cards contain a unique MAC address. Hard drives do not seem to, or if they do it's not being consistently exposed in a way I can find. This is sad. (No, reading data from the device to determine this gets us back to the "spinning up the external USB drive to find my root partition" Let me clarify: I'm talking about device enumeration. I've never had trouble enumerating a device that was _not_ routed through the scsi layer, largely because the systems I work with don't usually have more than one device of the same type. (There are millions of laptop and desktop devices out there where this is the common case. As I said, I may have four USB ports and the ability to plug hubs into them, but you can't add another SATA hard drive to my laptop without a soldering iron.) However, as soon as a device _is_ routed through the scsi layer (as PATA was a few versions back), it gets ...
I hate to go completely offtopic here, but disks are so incredibly slow when compared to RAM that there is really nothing the kernel can do about this. Presumably the job will finish, given infinite time. How much swap do you have configured? You really shouldn't configure so much unless you do want the kernel to actually use it all, right? Because if we're not really conservative about OOM killing, then the user who actually really did want to use all the swap they configured gets angry when we kill their jobs without using it all. Would an oom-kill-someone-now sysrq be of help, I wonder? -
I gave it about half an hour, then it locked solid and stopped writing to the disk at all. (I gave it another 5 minutes at that point, then held down the power button.) Two words: "Software suspend". I've actually been thinking of increasing it I tend to lower "swappiness" and when that happens all sorts of stuff goes weird. Software suspend used to say says it can't free enough memory if I put swappiness at 0 (dunno if it still does). This time the OOM killer never triggered before hard deadlock. (I think I had it around 20 or 40 or some *shrug* It might. I was a letting it run hoping it would complete itself when it locked solid. (The keyboard LEDs weren't flashing, so I don't _think_ it paniced. I was in X so I wouldn't have seen a message...) (To be honest, I can never remember how to trigger sysrq on a laptop keyboard. Presumably X won't intercept it the way it does alt-f1 and ctrl-alt-del...) Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. -
Kernel doesn't know that you want to use it for suspend but not If you can work out where things are spinning/sleeping when that happens, along with sysrq+M data, then it could make for a useful bug report. Not entirely helpful, but if it is a reproducible problem for you, then you might be able to get that data from outside X. -
Couldn't you mount swap before suspend and unmount it after resume? -
sysrq works even in X, and should be pressable on todays laptop keyboards... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
About 6 weeks ago, on a 2.6.23-rc kernel, I accidentally typed "make -j", and left off the 4 before I hit the return key. About 2-3 minutes later, the box locked pretty tight. I managed to switch to a VT console before I lost total control of X (took many, many minutes to do the switch), but after many minutes, managed to get logged into the console, but I wasn't able to get a ps command to complete so I could start killing processes. (I probably should have just done a "killall make" right away, but hindsight is 20/20.) The console was showing that the OOM killer was attempting to kill processes, but apparently not fast enough to stem the tide of all of the new processes getting generated by the make -j. (I'm guessing I tried sysrq-f (oom_kill), but no dice. Given that the oom killer was active and apparently triggering on its own, this wasn't all that surprising. The interesting thing is I tried to do an sysrq-e (send SIGTERM to all processes except), waited 5 minutes or so, then tried an alt-sysrq-i (send SIGKILL to all processes except init), and the system was still thrashing itself to death, even after giving it plenty of time to try to recover. I finally gave up and held down the power button. This was on a box with 4 gigs memory (but only 3 gigs visible thanks a cheap BIOS/chipset) and 4 gigs swap (mainly intended for suspend/resume). I chalked it up to me being stupid (I should have noticed and Ctrl-C'ed the make -j much more quickly, or if I were a sysadmin on a time-sharing system with users I didn't trust, configured RLIMIT_NPROC and/or per-user container resource limits) and the OOM killer not being aggressive enough in such a situation. But having better things to do, I didn't go whining on LKML about it, although I have to say that the kernel behavior isn't exactly ideal. One of these days when I have time, I'll try investigating it with a few memlocked processes running at real-time priorities and Systemtap and figure out what the heck was ...
swapon -a; swsusp; swapoff -a? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
No. There are three basic swapping scenarios. - Pushing unused data out of ram - Swapping - Thrashing To effectively swap you need SWAP > RAM because after a little while of swapping all of your pages in RAM should be assigned a location in the page cache. I have not heard of many people swapping and not thrashing lately. I think part of the problem is that we do random access to the swap partition which makes us seek limited. And since the number of seeks per unit time has been increasing at a linear or slower rate that if we are doing random disk I/O then the amount we can use the disk for is very limited. I wonder if we could figure out how to push and pull 1M or bigger chunks into and out of swap? I don't know if swap has actually worked since we vmscan stopped I totally agree. The fact that the OOM killer started is a sign that the system was completely overwhelmed and nothing better could happen. In this case my gut feel says limiting the total number of processes would have been much more effective then anything at all to do with Well we have SAQ which should kill everything on your current VT which should include X and all of it's children. Eric -
on some kernel versions you are correct about needing swap > ram, but on current versions you are not. the swap space gets allocated as needed, and re-used as needed (I don't know the mechanism of this, but I remember the it has been noted by many people that linux is very slow to pull things back into ram from swap, significantly slower then simple seed limiting would seem to account for. Davdi Lang -
I don't think I can recall a linux kernel that required swap > ram. However for serious swapping under linux having swap > ram was very useful and pretty much a requirement for a workload that involved Yes. It may be the large amount of random access (my current guess) or it may be something else. I'm wonder if I should build an application with a configurable data set and working set that can be used for swap testing. I don't think it would be very hard and it might help sort through some of the swap performance problems. Eric -
I don't follow your logic. We don't need SWAP > RAM in order to swap I don't know if there is a causal relationship there. I mean, I think it's been a long time since thrashing was ever a viable mode of operation, right? Maybe desktops just have less need for swapping now, so nobody sees it much until something goes _really_ bad. When I'm using my 256MB Pulling in 1MB pages can really easily end up compounding the thrashing problem unless you're very sure a significant amount Which is exactly what you don't want to do if you've just forkbombed yourself. I missed the fact that we now have a manual oom kill... -
The steady state of a system that is heavily and usably swapping but not thrashing is that all of the pages in RAM are in the swap cache, Right. But swapping heavily has been a viable mode of operation and that the vast gap in disk random IO performance seems to have hurt significantly. It be very clear is used to able to run a problem at little below full speed with the disk pegged with swap traffic, and I did this There is a bit of truth in the fact that there is less need for swapping now. At the same time however swapping simply does not It's a hard call. The I/O time for 1MB of contiguous disk data You probably have a point there. Eric -
Or, just not improved as fast as everything else is improving. There isn't too much the kernel can do about that. It just relatively changes the point at which you'd consider "swapping I can do this now. In make -jhuge tests for example, you can get a 4GB, 4 core machine to max out a disk with swapping and still have 0 idle time. Of course you can also go past that point and And if you're thrashing, then by definition you need to throw We had several bugs and things that caused swapping performance regressions vs 2.4 in earlyish 2.6. After those were fixed, we're pretty competitive with 2.4 in some basic tests I was using. I haven't run them for a fair while, so something might have broken since then, I don't know. -
Right. But you need a differential hit rate of only a few percent on that 1020 extra kb of data you swapped in versus the 1Mb of data you swapped out for this to be advantageous. With "differential hit rate" I mean the chances of getting a hit on the 1Mb of data just paged in, minus the chances of getting a hit on the 1Mb of data just paged out. With a little luck that 1Mb that is paged out didn't get used for quite a while, while there is a hint that the 1Mb you're paging in is active, as one of its sub-pages just got a hit. So... IMHO, it would be useful to implement something that pages out chunks of memory larger than a single hardware page. This would reduce the size of the memory management tables (*), as well as improve disk throughput if things DO come to paging.... This should of course be configurable. Some workloads are better off with a virtual page size of 8k, some with 128k. some with 1M. As far as I can see, the "page-cluster" parameter defines how many pages at a time are selected for page-out at a time. This increases the page-out efficiency. Improving the page-in efficiency is also useful: It is the other half of hte equation. Roger. (*) If the kernel starts working with a 1Mb virtual page size, you need a 256 times smaller mapping table between processes and memory or swap. Of course, the hardware doesn't support this (actually, it does for 1Mb virtual pages), so you'll have to create 256 page table entries for the hardware instead of just one. -- ** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 ** ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 ** *-- BitWizard writes Linux device drivers for any device you may have! --* Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. Does it sit on the couch all day? Is it unemployed? Please be specific! Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ -
I believe that was more or less the topic of this paper: http://kernel.org/doc/ols/2006/ols2006v2-pages-73-78.pdf Although these seem sort of tangentially related: http://kernel.org/doc/ols/2006/ols2006v1-pages-369-384.pdf http://kernel.org/doc/ols/2006/ols2006v2-pages-125-130.pdf Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. -
Not really. They are talking about doing this for the page cache. That's where filesystem files are cached in memory. I'm talking about the memory that programs use while they are running. Roger. -- ** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 ** ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 ** *-- BitWizard writes Linux device drivers for any device you may have! --* Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. Does it sit on the couch all day? Is it unemployed? Please be specific! Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ -
Mind if I throw in some vague and questionable numbers? :) I vaguely recall that my old 486 laptop with 16 megabyes of ram (circa 1998) used to be able to do 3 point something megabytes per second to/from disk, according to hdparm -t. (That was with DMA enabled.) This means that my old laptop, using sequential writes and not being bogged down by excessive seeking, could write its entire memory contents to disk and read it back in again in about 10 seconds total (5 write, 5 read). My current laptop has 2 gigabytes of ram, and hdparm -t /dev/sda says: /dev/sda: Timing buffered disk reads: 116 MB in 3.01 seconds = 38.54 MB/sec So that's a little over a factor of 10 speed improvement. (Although I note that I got 30 megabytes/second off of an ATA/100 adapter in 2002, so it's barely any faster than it was 5 years ago.) This means I can expect my current laptop to write out its memory in 50 seconds (2000/40), and another 50 seconds to read it back in. So 10 seconds to cycle through memory 10 years ago, vs a little under 2 minutes today, on systems at roughly the same price point. And that's limited by what the hardware is doing, assuming a _perfect_ linear read/write pattern with no seeks. Oh, and my old 486 had its RAM maxed out. This one can hold twice as much. And heavy seeking sucks more than it used to relative to sequential reads by something like a proportional amount (hence the rise of I/O elevators as a The problem is the gap is getting bigger. The 486-75 laptop mentioned above had a 25 mhz 32 bit front side bus. A quick google suggests my core 2 duo has a 667 mhz FSB and I'm guessing a 128 bit data path (two 64-bit channels). I could boot up memtest86 and get actual benchmarks, but total handwaving for a moment, 25*32=800 and 667*128=85376, and the second divided by the first is over 100 times as big. That concurs with the 16mhz->1733 mhz processor speed increase. Factor of 10 disk speed increase, factor of 100 memory ...
I'm pretty certain Intels' arechitecture is only has a 64bit front side bus. Well it will be interesting to see what happens with NAND flash. So far it is pricey but you can easily make it faster then todays hard drives. An interesting point. What would really impress me is actually finding a current work load that can productively swap after everything kernel side is fixed up and optimized. So far it seems like real swapping is so painful that everyone is simply avoiding it. Eric -
Funnily enough someone thought of that many years ago. They even added and documented it, then they made it adjustable. See the vm section of Documentation/filesystems/proc.txt Alan -
I presume you refer to: page-cluster ------------ page-cluster controls the number of pages which are written to swap in a single attempt. The swap I/O size. It is a logarithmic value - setting it to zero means "1 page", setting it to 1 means "2 pages", setting it to 2 means "4 pages", etc. The default value is three (eight pages at a time). There may be some small benefits in tuning this to a different value if your workload is swap-intensive. I didn't know that controlled whether the pages were contiguous (or written to contiguous locations in swap). I thought it was just how many the VM tried to free at a time. Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. -
On Mon, 15 Oct 2007 23:37:44 +1000 Is already there: sysrq-f. -
Umm, not quite, from my experiences with pre-production wireless drivers, (another story, another time) fancy stuff is being done in udev to make sure that your gigabit card is always assigned to eth0. -- Julian Calaby Email: julian.calaby@gmail.com -
I remember building a 2.4 kernel, statically linking in all the drivers, and getting the ethernet devices showing up in a reliable order for years. Where does the need for fancy stuff come in? Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. -
Because PCI devices reorder their bus numbers all the time. And we have ethernet devices hanging off of USB connections now (yes, even built-in to the machine), and we have network connections on other hot-pluggable busses (remember, PCI is hot pluggable.) So, the distros need to name network devices in a persistant way, that is why the distros now do this. If you don't like the distro doing it, complain to them, it's not a kernel issue :) thanks, greg k-h -
do PCI devices reorder their bus numbers spontaniously, or only if you I have, at least the response was to tell me how to kill this 'feature' even if they won't change it. David Lang -
The only system I've had that reordered PCI bus numbers was when I had a partitionable system and changed the partitioning. Not quite "change the hardware", but neither was it "spontaneous". It was certainly unexpected (for me). Greg probably has quite different examples. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
I would definantly be interested in hearing some of them. Greg's comment makes it sound like this is something that (with modern hardware) could happen to anyone at any time (which, if true, would be sufficiant to require 'best effort' nameing of devices for everything), while my experiance is that if the hardware is static (i.e. you don't plugin or unplug PCI devices) the numbering of exisitng PCI devices and buses is static. and while I understand that consumer distros want to have everything 'best effort' named to make it easier for users, I disagree that this should force everyone to use 'best effort' when there are many situations where it's unnessasary overhead and chances for errors. David Lang -
On Mon, 15 Oct 2007 22:04:01 -0600 a very common one is booting your laptop docked (a real dock, not just a port extender) versus non-docked.... -
Changing the hardware (adding a new PCI device or removing one) are the most common times this happens. But I have seen reports of this happening when you upgrade/downgrade BIOS versions, and, in some oops-we-messed-up cases, when we changed things in the kernel. thanks, greg k-h -
BIOS upgrades qualify as changing hardware (or close to it) oops-we-messed-up cases of kernel changes don't justify 'best effort' nameing, it's a regression that needs to be fixed. now the other example given of docking a laptop is closer to reasonable (and is definantly a reason to have 'best effort' nameing as an option), but that's still a relativly special case, and it _is_ definantly changeing the hardware David Lang -
How do you define real SCSI ? The definition of SCSI in the kernel is
"a device that accept the SCSI command set" (more precisely "a
suitably large subset a the SCSI command set". It looks as if you
definition of SCSI is "a device that is sold with written SCSI on the
box and that attaches to a card with SCSI written on the box"; is it
correct ?
The host is the expansion card that connects the device to the
motherboard. If it is emulated this means that it is not a native
Your objection is interesting. It is lost in the middle of e-mails which,
to the untrained eye, look like you are trying to fight everyone and
As far as I can tell the hard drives do not have serial numbers easily
readable by the kernel (I think it's only printed on the label). However
(feverishly plugging his USB key in the laptop), you can tell how a drive
is attached to the motherboard:
Laptop's SATA drive:
cognac $ readlink /sys/block/sda/device
../../devices/pci0000:00/0000:00:12.0/host0/target0:0:0/0:0:0:0
USB key:
coghac $ readlink /sys/block/sdb/device
../../devices/pci0000:00/0000:00:13.5/usb6/6-3/6-3:1.0/host4/target4:0:0/4:0:0:0
By the way, did you look in /dev/disk/by-id (udev magic) ? It's probably
not very difficult to reconfigure udevd to not read the UUIDs of the
partitions and not spin up your holy external disk at each reboot. I think
the one that is spinning up your holy external hard drive is udevd. By
the way, how many time do you reboot instead of resuming from
suspend-to-disk ? Have you given a try to TuxOnIce ?
If you had asked your first question in a way similar to this one:
"I have my laptop hard drive that shows as different devices depending
whether there are USB drives plugged in or not, what should I do ?
Shouldn't SATA/USB drives/PATA/iSCSI drives be enumerated in different
queues ?"
You would probably have received more interesting answers and less
Indeed. Propose a solution. ...This is where I hit my ad hominem attack quota and stopped reading. Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. -
SG_EMULATED_HOST was added before Linux 2.4, at least six or seven years ago. Back then the migration of ATA devices through the various versions the ATAPI specification and then into SATA was very early in its evolution, and back then, yes there were people who considered anything that didn't use the honking huge parallel SCSI cables not "real" SCSI. Over time, that distinction at both the physical connector level and logical level has declined to the point of almost non-existence. It's note quite at the point where SAS exists only to justify massive prices differences between commodity and "data-center grade" disks to the benefit of hard drive manufacturers, but it's darned close. (There are differences such as voltage levels so that the max cable differences for SAS are larger, etc., but those could have been optional additions to the SATA spec, and allegedly SAS drives are supposedly manufactered to be more robust --- although some recent papers published at FAST have raised some interesting questions No, of course not. But we don't have separate IP stacks for ethernet and ppp devices. And how we connect to a host via ssh makes no difference whether we accessed it via Ethernet or PPP. And I would argue that how we address a filesystem should also make no difference You can pull a Model and Serial number via hdparm -i, but it's not as easy to manipulate as a fixed-length MAC address. That's why people That may be true for laptops today, but Linux doesn't run just on servers. You can easily get home servers with hot-swap SATA bays. My home fileserver, which is a white box I purchased on my own nickel, NOT IBM big iron, has 3TB of raw storage for less than $10,000 a year ago. Today, that amount of home storage with hot-swap SATA drives and a battery-backed hardware RAID controller could probably be purchased for about half that price. And even for laptops, if you need the performance, you can get Cardbus cards that will allow you to connect eSATA drives to ...
ATA8 at the moment looks set to add a true "MAC" or "WWN" type identifier to each device.. Right now model/serial is not always unique. -
True, but most manufacturers try to make the serial number unique for their own reasons (like warrantee service), and you can have manufacturing errors with MAC assignment just as easily as you can with serial numbers. I still remember when SGI shipped MIT 20 SGI Indy pizza boxes that all had the same MAC addresses (that we knew about --- we found out because all 20 were installed on the same subnet). That was a mildly entertaining bug to track down.... especially since IIRC, Irix at the time didn't print warning messages when someone else with a different IP addresses responded to your MAC address. - Ted -
WWN was added in ATA-7, AFAICS. However, I've seen quite a few ATA-7 devices that do not bother to fill it in. I wonder if ATA-8 device firmwares will act with similar slackness. :) Jeff -
SG_EMULATED_HOST was present when I started maintaining the the sg driver in 1997. Back then some folks (one German name comes to mind) toyed with the idea of sending SCSI Parallel Interface (SPI) messages through a pass through interface. SPI messages are obviously transport specific and hence any app trying to send them needed to ascertain what the transport was. There were really only two to choose from at the time (in linux): SPI and the ATA Packet Interface (ATAPI). If SG_EMULATED_HOST was every used I'm not sure. It is just an historical remnant now. On the contrary, the distinction between the logical (command) level and the transport level (down to the physical/connector level) is pivotal. There is one industry accepted storage architecture (SAM (yes, ATA documents defer to it)), two command sets: ATA and SCSI (and ways to tunnel one within the other and translate between the two) and about 10 transports (interconnects) that I can think of. Comparisons between PATA and SCSI (SPI) are now history. More precise terminology is now required. For example the "ATAPI specification" IMO is a handful of ATA commands designed to convey a packet based protocol (which the rest of the ATA command set is not). So ATAPI could be used to send IP over ATA! Is that what you meant? You should read more about SAS. Anyway Seagate have announced a ES.2 family of 3.5" disks that rotate at 7200 rpm. One would not normally expect disks below 10000 rpm to come with a SCSI transport (FCP, SAS or SPI) but the ES.2 series breaks the pattern since it comes with either a SATA or a SAS interface. What will be really interesting is how Seagate will price the two versions. Apart from the SAS variant having dual ports it is pretty close to an apples versus apples comparison. A port selector could be added to the SATA variant to provide dual port functionality. However the SCSI command set offers persistent reservations which are beyond the scope of ATA command sets which assume a ...
I think a close analogy would be that after a partition is mounted you don't need to know the path to the hard drive, and that is already true today. when you mount a drive (or assign and IP address to a network I also have a 3TB raid I built at home, it uses 3ware cards and a dozen 300G IDE drives. since the 3ware driver is classified as SCSI if a drive fails all the other drives get renumbered on the next boot and it's painful to figure out which drive has a problem. I have to reboot and go into the 3ware BIOS to figure out which drive isn't reporting. This system also has an adaptec raid card in it and an adaptec regular SCSI card. The fact that these three cards take different drivers, and so the order of detection changes the drive numbering is a real pain when I'm installing a new distro onto it. once I get it installed I compile my own monolithic kernel and this problem stops becouse the kernel linking order determins the detection order. this replaced a 1.2TB raid that I just about filled up, and then stared having drive failures due to age on. It used 8 160G IDE drives, and when I had problems with a drive it was easy to see that /dev/hdk was missing from the set, and I was still able to have a removable drive bay for /dev/hdc that I could hook my tivo drive into (on a reboot for safety) and not have things go haywire if I left the bay empty (or switched off) when I booted. this may not be hundreds of drives, but it should be enough to show that I have experianced the pain that some people claim is the reason all of this must be dynamic with a userspace helper to sort it all out. My take is that adding the userspace helper and not enumerating things that are easy but these are seperate SATA buses, while you could run into ordering issues if you hook multiple devices to one bus, you should be able to have no ordering issues if you don't have more then one device of a type on any one bus (you could have a SATA hard drive on the internal ...
I've gotten burned by that heuristic enough times to not rely on it. My last laptop had an ethernet on the motherboard, a *separate* ethernet in the docking station, an ethernet on a multifunction pcmcia card (I usually just used the modem side), and a wireless that looked like an ethernet - so it was possible for a given interface to be eth1 (if no dock and no pcmcia card) or eth3 (if both were present). And that's on a laptop from almost 5 years ago. And then there's the recent Sun and Dell 1U rack-mounts that have 4 ethernets on the motherboard, and they *never* seem to assign in a 0,1,2,3 order that matches the 0 1 2 3 printed above the 4 RJ45's ;) So I have for years been a proponent of 'ethN is nailed by MAC address' :)
on the other hand, I have two systems in my lab with identical hardware, loaded with the same OS image, but one calls the interfaces eth0, eth1, eth2 while the other calls them eth12, eth13, eth14 becouse it had three quad cards installed in it for a few days several months ago. also think what happens to a system if you replace a failed NIC with an card identical except the MAC addresses. instead of everything just working as before, you now have new ethX devices and are missing the old ethX devices. both ways of doing things can yield nonsense results in cases where the other one gives perfectly useable results. nobody is arguing that the ability to nail things down by MAC address (or drives by UUID) should be removed, we're just arguing that the option to get useable consistant names from hardware that is consistant is being removed and that it shouldn't be, it has it's place just like the 'best effort' naming. David Lang -
If you hate USB storage devices using scsi, please use the ub driver, When did usb-storage devices ever show up as /dev/usb0? USB flash disks are really SCSI devices, look at the USB storage spec for proof of that. thanks, greg k-h -
The ub driver is a really dumb piece of shit. It only drivers usb storage devices using a scsi protocol set, and duplicates the scsi stack in a very suboptimal way. -
For the embedded space, the ability to configure out the scsi layer is interesting from a size perspective. I bookmarked that a while back, but had forgotten about it. Thanks for the reminder. For the desktop I don't object to the scsi layer. I object to the naming. Merging a half-dozen different types of devices into a single name space, and then warning us that the order they appear within that namespace could be the result of race conditions... Seems like an artificially inflated problem to me. Don't merge them together and each namespace is a smaller problem, often with only a single device or with a stable relationship between the devices. (That said, the answer to my original question, "is the block layer still in use" seems to be yes, so creating a 00-INDEX for Documentation/block is a good thing, and I'll go do that. I acknowledge that I asked this question Um, possibly I _was_ playing with the ub driver and got a /dev/ub0. (I vaguely recall playing with back around... February? When did it wander across Pavel's blog... I don't actually remember if I got it to work or not.) Possibly this is from playing with a usb scanner back around 2004. (I just dragged out my other USB device from that period, an ethernet dongle, but it doesn't create /dev anything. Just shows up as usb2. :) The point I was trying to make is that it seems to me like it would be possible to keep the namespace separate here, and thus reduce the enumeration problems to the point where common cases (like my laptop) aren't impacted by them during early boot. I don't think anybody (outside the embedded space) is actually upset that /dev/hda now goes through the scsi layer: they're Thank you, Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. -
They *are* SCSI devices. USB storage is a SCSI over USB transport. ATAPI is a SCSI over ATA transport. SAS is much the same thing, as is FC, and it continues. With the exception of ATA disk for historical reasons SCSI essentially For the emedded CF using world we could do with a truely dumb ATA only CF driver, possibly even with pure polled support that used neither the IDE or the ATA layer. Alan -
On Mon, 15 Oct 2007 03:36:15 -0500 that's a choice Ubuntu made in their udev scripts... if you don't like it, complain to them. I'm surprised you would even need to care about what device name things are though.... with mount-by-label (deployed for a bunch of years now in most distros), and various helpful links like /dev/cdrom .... anyway.. if you don't like your distros udev configuration, lkml is the wrong forum. -
Keeping the naming as hda while changing the semantics (such as the reduced number of partitions) would have been differently confusing. We did look into keeping compatibility symlinks, but decided to just transition everything to UUIDs instead. -- Matthew Garrett | mjg59@srcf.ucam.org -
Proposals on how to do this would be gladly reviewed. But again, please remember that these USB devices are really SCSI devices. Same for SATA devices. There is a reason they are using the Use mount-by-label instead, it's much saner and handles device name movement just fine (as does the UUID method that you seem to hate.) Look in /dev/disk/ for a wide range of options that you have in which to choose how to pick your block devices. Oh, and this seems like a very Ubuntu specific rant, might I suggest you contact the Ubuntu developers about this? The kernel doesn't dictate that the distro has to use these long identifiers, and there is nothing we can do about it. good luck, greg k-h -
But you still have to spin up the disc to read the label (which seems like a legitimate complaint to me). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
/somewhat/ true I'm afraid: libata uses the SCSI layer for ATAPI devices because they are essentially bridges to SCSI devices. It uses the SCSI layer for ATA devices because the SCSI layer provided a huge amount of infrastructure that would need to have been otherwise duplicated, /then/ massaged into coordinating between <jgarzik's ATA layer> and <SCSI layer> when dealing with ATAPI. There is also a detail that was of /huge/ value when introducing a new device class: distro installers automatically work, if you use SCSI. If you use a new block device type, that behaves differently from other types and is on a different major, you have to poke the distros into action or do it yourself. IOW, it was the high Just Works(tm) value of the SCSI layer when it came to ATA (not ATAPI) devices. For the future, ATA will eventually be more independent (though the SCSI simulator will be available as an option, for compat), but the value is big enough to put that task on the back-burner. Jeff -
- move the networking core's facilities to build the default name of
an interface into lib/
- expand it to optionally use base-26 numbering (a...nn...zzz) as
alternative to decimal numbering
- let SCSI low-level drivers optionally provide a short constant
string, resembling its transport name, in the host template or
transport template
- let SCSI high-level driver make use of the new naming functions in
lib/, providing either just "sd", "sr" etc. or "sd-$transport-" as
name prefix
No patch yet, and alas I'm currently short of spare time.
--
Stefan Richter
-=====-=-=== =-=- =----
http://arcgraph.de/sr/
-
I remember being told that I didn't understand the problem when I suggested using ide-scsi for everything and just hiding the transport. I get great pleasure from having been (mostly) right on that one. I still have old systems running ZIP drives as scsi... -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot -
I was just trying to use the strangeness in a large distributor's first attempt at this functionality as an evidence that it's apparently not trivial to get even the common cases right under the new model, while the common cases used to be trivial to get right under the old model. (Or at least it seemed so to me.) I think I've exhausted this line of argument, though, and will stop now. Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. -
OK, so could we get back to the original discussion? The question I think you meant to ask is "does SCSI use the block layer, and if so; how?" The answer is yes (just do an ls /sys/block on any scsi machine). The how is that it bascially uses the block layer as a service library (i.e. most SCSI services are built on top of those already provided by block). The email you cited was basically from our one area of confusion: SCSI and block both provide services to decode the SG_IO ioctl. This is partly historical; block and SCSI are very much intertwined; so much so that they both tend to drive each other's development. The programme over the last few years has been to identify features in SCSI that should be more generic (and hence moved to block). SG_IO is one of these, so we end up with the situation where Block provides this as a service (and sr, st and sd make use of it) while the sg driver still doesn't use what the block layer provides but rolls its own. I think the layout of how all this works is illustrated at a reasonably high level here on slide 15: OK. But that's the bit I need you to separate from your inquiry into how SCSI actually works. You can't go on a research trip if you allow preconceived notions to spill over into it. For the record, USB and firewire are SCSI at their core, so they can never really be separated. SATA (but not SATAPI) is a separate protocol, so it can theoretically be separated later, and we are actually working on that. It's only in SCSI because there's a well defined and standardised way to place it their (called the SAT layer---SCSI to ATA Translation) and because it's a lot easier since SCSI has all the features and quite a few of the necessary ones aren't However, by design choice, we got the SCSI layer in the kernel out of the business of trying to provide a stable name space, since Richard Gooch did a brilliant job of demonstrating the insoluability of that problem. There are many ways to identify a device (UUID ...
Sorry about that. Not my intent. I was aiming more at "I'm trying to document this and I don't understand how it works at all, or why it does things this way. It seems backwards from what I would expect." Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. -
I really didn't find Rob's email "pejorative" at all. It seems to me he was just asking for clarification, information and trying to understand how it all works and ties together. His email seemed genuine enough of a person just asking to understand how it all works. Matthew's expletive and extremely rude response really shows the general attitude of the linux-scsi people. Heck, I got a similar response just a week ago here on the list, trying to convince Garzik and his band, that storage nodes SHOULD NOT be SAS WWN generators. Should I have even tried? That's the question. Good luck everyone, Luben -
No, it doesn't. James Bottomley has been exceedingly polite and helpful, as were several other people on the linux-scsi list when I asked them about this stuff back in August. Religion, politics, and anything remotely related to hotplug appear to be topics to avoid in polite company if you want it to remain polite. (My gripes with scsi mostly have to do with device enumeration. My attempts to use sysfs also have to do with device enumeration. I've spotted a trend here.) Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. -
I wasn't referring to him specifically. He also stepped into the WWN -
