I follow -current on an i386 at work and an amd64 at home, and rarely run into any problem which is not self-inflicted. So when I had a weird experience this weekend, I assumed it was my fault. What happened was that after the usual sequence of [build kernel; reboot; build userland; reboot] the system complained that it could not fsck wd1j and dropped into single-user mode. wd1j is mounted on /usr/obj, and I thought that something in the last build had messed it up, so I ran "newfs wd1j" and got newfs: /dev/rwd1j: Device not configured "disklabel wd1" showed partitions d-i and k-p, but no j. I added the partition, ran newfs, and everything seemed fine. This afternoon I installed the i386 snapshot downloaded this morning (dated Jun 3 19:19) on the work pc, and after reboot it was missing the /usr/obj partition (sd0g in this case). Everything seems to be working fine on both computers, but I didn't expect the partitions to disappear. Did nobody else run into this "problem"? Or did everybody else who saw it thought it was too obvious to mention it to the mailing list? Emilio
I had a similar problem on sparc64 with a snapshot from jun 2. The system was unable to fsck some partitions and dropped to single user mode. Here the problems were with the /usr, /var, /tmp and /home partitions. Some further (and larger partitions) weren't affected. I installed an older snapshot. Any suggestions how to get this fixed or what to test/try? Regards, Markus
There were some validations checkc added to partitions. If a bad partition is found, it will be marked "unused". The checks were a little to strict for some cases. A fix for that went in yesterday, so try a new snap. If the problem persists, please report with full disklabel output. -Otto
The problem showed up on the latest snapshot as of now, which may well have been built before the fix you mention was incorporated. The home PC running -current has not had a problem since Saturday afternoon. The daily insecurity reports show four changes in this partition during the last couple of months. (Note that since this is on /usr/obj on a PC running -current, newfs is run just about every day.) It seems "funny" that on May 29 the fsize and bsize were changed to 0, but nothing weird happened until the day after they were changed to what appeared to be more reasonable numbers. Anyhow, in case the information is useful, the insecurity messages and current disklabel follow: ====== sd0 diffs (-OLD +NEW) ====== --- /var/backups/disklabel.sd0.current Fri Apr 21 01:31:35 2006 +++ /var/backups/disklabel.sd0 Tue Apr 17 01:31:10 2007 @@ -26,4 +26,4 @@ d: 1048128 3144384 4.2BSD 2048 16384 416 # Cyl 1236 - 1647 e: 1048128 4192512 4.2BSD 2048 16384 416 # Cyl 1648 - 2059 f: 8387568 5240640 4.2BSD 2048 16384 480 # Cyl 2060 - 5356 - g: 4139682 13628208 4.2BSD 2048 16384 480 # Cyl 5357 - 6984* + g: 4139682 13628208 4.2BSD 2048 16384 1 # Cyl 5357 - 6984* ====== sd0 diffs (-OLD +NEW) ====== --- /var/backups/disklabel.sd0.current Tue Apr 17 01:31:10 2007 +++ /var/backups/disklabel.sd0 Wed May 30 01:32:08 2007 @@ -26,4 +26,4 @@ d: 1048128 3144384 4.2BSD 2048 16384 416 # Cyl 1236 - 1647 e: 1048128 4192512 4.2BSD 2048 16384 416 # Cyl 1648 - 2059 f: 8387568 5240640 4.2BSD 2048 16384 480 # Cyl 2060 - 5356 - g: 4139682 13628208 4.2BSD 2048 16384 1 # Cyl 5357 - 6984* + g: 4139682 13628208 4.2BSD 0 0 1 # Cyl 5357 - 6984* ====== sd0 diffs (-OLD +NEW) ====== --- /var/backups/disklabel.sd0.current Wed May 30 01:32:08 2007 +++ /var/backups/disklabel.sd0 Fri Jun 1 ...
The cpg change is due to making newfs "cylinder unaware". Here you are running with a new kernel, but userland is still old. newfs is run, but it is still using the old struct partition format. We have seen some reports now on disappearing paritions. On sparc and sparc64, there were actual bugs that have been fixed now. For all platforms, the suspect new consistency checking code now been disabled until we find out what is causing the mishap, and (very) recent kernels should be back to normal. Please report with dikslabel info and dmesg if things are still going wrong. Preferable with fdisk (if applicable) and old disklabel information as well. -Otto
I have thinking a bit more about the problem, and it is very likely the following scenario happened: 1. Kernel upgrade by source. 2. Reboot 3. Kernel reads old disklabel format and converts it in-memory to the new v1 format. 4. Run a newfs using the old executable that does not know about the new disklabel format. newfs writes the block and fragment size info the old way, on a spot that is used in v1 labels to store the high 16 bits of the offset and size of a partition. The label is written with version = 1, since the in-memory copy is v1. 5. Reboot, the kernel now sees a v1 disklabel with very high offset and/or size, the new consistency code (which is now disabled) kicks in and marks the partition as unused. So the lesson here is: keep userland and kernel in sync, or use a snapshot to upgrade. -Otto
I believe that's exactly what happened the first time. The catch is that kernel and userland were being built from the same cvs update, and I thought I was keeping them in sync. In this case it would probably have been better to skip the reboot between building the kernel and the userland. I'll take newfs out of my build script (back to "rm -rf /usr/obj/*") and try to remember to use newfs before rebooting with a new kernel if I want to avoid the wait. Thanks again! Emilio
It might have been better to start a whole new thread, but it seemed
logical to believe that the problems might be related. Using recent
snapshots, last night's insecurity output showed another disklabel
change:
======
sd1 diffs (-OLD +NEW)
======
--- /var/backups/disklabel.sd1.current Fri Apr 20 01:31:19 2007
+++ /var/backups/disklabel.sd1 Fri Jun 8 01:31:55 2007
@@ -1,4 +1,4 @@
-# Inside MBR partition 0: type A6 start 63 size 71681967
+disklabel: warning, DOS partition table with no valid OpenBSD partition
# /dev/rsd1c:
type: SCSI
disk: da0s1
*----------------------------------------------------------------------*
The full output of disklabel and dmesg follow, but as I was getting
ready to send it, I remembered that this same disk had problems with the
disklabel changes last October. For some reason it was shown as having
a FreeBSD disklabel. Most of correspondence regarding it was off-list,
but involved several developers and ended with Ken Westerback suggesting
some tests before setting it to OpenBSD.
This was fdisk then:
Disk: sd1 geometry: 4462/255/63 [71682030 Sectors]
Offset: 0 Signature: 0xAA55
Starting Ending LBA Info:
#: id C H S - C H S [ start: size ]
------------------------------------------------------------------------
*0: A6 0 1 1 - 4461 254 63 [ 63: 71681967 ] OpenBSD
1: 00 0 0 0 - 0 0 0 [ 0: 0 ] unused
2: 00 0 0 0 - 0 0 0 [ 0: 0 ] unused
3: 00 0 0 0 - 0 0 0 [ 0: 0 ] unused
This is now:
Disk: sd1 geometry: 4462/255/63 [71687370 Sectors]
Offset: 0 Signature: 0xAA55
Starting Ending LBA Info:
#: id C H S - C H S [ start: size ]
------------------------------------------------------------------------
0: 00 0 0 0 - 0 0 0 [ 0: 0 ] unused
1: 00 0 0 0 - 0 0 0 [ 0: ...This is very odd on several fronts. First, someone has obviously been writing on the MBR for no good reason. I just tested an fdisk compiled to day and noticed no oddities on my i386. Second, the fact that you find a disklabel. Since we no longer store or look for disklabels in FreeBSD partitions it is being read from sector 1 if I recall the code correctly. But it should not have been writing the disklabel there when there was an OpenBSD partition to store it in. Do you know if this is exactly the same disklabel you were using before? Have you changed anything in the disklabel recently that would identify this as an artifact that just happened to be lying in sector 1 for a while? Can you copy the MBR and send it to me. There might be a clue as to what overwrote it. Then I would do "fdisk -i" and see what happens. This will move the OpenBSD partition to partition 3, but cover the entire disk as your original MBR did. Then see if the disklabel, which should be read from the OpenBSD partition says. .... Ken
Ah -- your 'c' partition does not start at 0. It's an old FreeBSD partition on your disk. That should not work; it is bunk. We are removing the code from the kernel that allows it to work, because it requires extra stupid checks all over the place to support an old 386BSD stupidity. I hope that our new disklabel command, upon re-writing that label, will repair that. Todd? That's the way to handle this, right?
It appears I have the very same issue, though with a much larger
offset. I created an OpenBSD partition on an existing partition table
towards the end of the drive.
jimmym@lappy:~> sudo fdisk wd0
Disk: wd0 geometry: 11978/255/63 [192426570 Sectors]
Offset: 0 Signature: 0xAA55
Starting Ending LBA Info:
#: id C H S - C H S [ start: size ]
------------------------------------------------------------------------
0: E8 15356 77 8 - 229721 118 4 [ 246698998: 3443776305 ] <Unknown ID>
1: 01 0 0 1 - 267349 89 4 [ 0: 0 ] DOS FAT-12
2: 00 0 0 0 - 0 0 0 [ 0: 0 ] unused
3: 3F 0 0 1 - 267349 89 4 [ 0: 0 ] <Unknown ID>
jimmym@lappy:~> sudo disklabel wd0
# /dev/rwd0c:
type: ESDI
disk: ad0s3
label:
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 255
sectors/cylinder: 16065
cylinders: 11978
total sectors: 192426570
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0 # microseconds
track-to-track seek: 0 # microseconds
drivedata: 0
8 partitions:
# size offset fstype [fsize bsize cpg]
a: 208845 155557395 4.2BSD 2048 16384 13 # Cyl 9683 - 9695
b: 4192965 155766240 swap # Cyl 9696 - 9956
c: 36869175 155557395 unused 0 0 # Cyl 9683 - 11977
d: 401625 159959205 4.2BSD 2048 16384 25 # Cyl 9957 - 9981
e: 20964825 160360830 4.2BSD 2048 16384 328 # Cyl 9982 - 11286
f: 11100915 181325655 4.2BSD 2048 16384 328 # Cyl 11287 - 11977
disklabel: warning, unused partition i: size 1413615339 offset -2147417768
disklabel: warning, unused partition j: size -1900006918 offset 402701520
disklabel: warning, unused partition k: size 503365533 offset 1463353529
disklabel: warning, unused partition l: size -1407327343 offset -1382830702
disklabel: warning, ...Other than reducing the size of the last partition a couple of months
I'll send the file attached to the next message, since I assume it would
be stripped from the mailing list.
After running fdisk -i sd1:
# fdisk sd1
Disk: sd1 geometry: 4462/255/63 [71687370 Sectors]
Offset: 0 Signature: 0xAA55
Starting Ending LBA Info:
#: id C H S - C H S [ start: size ]
------------------------------------------------------------------------
0: 00 0 0 0 - 0 0 0 [ 0: 0 ] unused
1: 00 0 0 0 - 0 0 0 [ 0: 0 ] unused
2: 00 0 0 0 - 0 0 0 [ 0: 0 ] unused
*3: A6 0 1 1 - 4461 254 63 [ 63: 71681967 ] OpenBSD
It's back as an OpenBSD disklabel, but the c partition still starts at
63 rather than 0:
# disklabel sd1
# Inside MBR partition 3: type A6 start 63 size 71681967
# /dev/rsd1c:
type: SCSI
disk: da0s1
label:
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 255
sectors/cylinder: 16065
cylinders: 4462
total sectors: 71687370
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0 # microseconds
track-to-track seek: 0 # microseconds
drivedata: 0
15 partitions:
# size offset fstype [fsize bsize cpg]
c: 71681967 63 unused 0 0 # Cyl 0*- 4461
d: 2104452 63 4.2BSD 2048 16384 132 # Cyl 0*- 130
e: 8385930 2104515 4.2BSD 2048 16384 328 # Cyl 131 - 652
f: 23294250 48387780 4.2BSD 2048 16384 328 # Cyl 3012 - 4461
h: 4112640 15936480 4.2BSD 2048 16384 256 # Cyl 992 - 1247
i: 2104515 40933620 4.2BSD 2048 16384 1 # Cyl 2548 - 2678
j: 18828180 20049120 4.2BSD 2048 16384 328 # Cyl 1248 - 2419
k: 5349645 43038135 4.2BSD 2048 16384 16 # Cyl 2679 - 3011
l: 2056320 ...Thanks for your info. After rebuilding kernel and userland the problem still exists, but now the affected partitions are /var, /home and /data. Hmm. Unmounting /data $ cat /etc/fstab /dev/wd0a / ffs rw 1 1 /dev/wd0d /tmp ffs rw,nodev,nosuid 1 2 /dev/wd0e /usr ffs rw,nodev 1 2 /dev/wd0f /var ffs rw,nodev,nosuid 1 2 /dev/wd0g /home ffs rw,nodev,nosuid 1 2 /dev/wd0h /data ffs rw,nodev,nosuid 1 2 /dev/wd1d /backup ffs rw,nodev,nosuid 1 2 with an actual kernel: $ sudo disklabel wd0 # /dev/rwd0c: type: ESDI disk: ESDI/IDE disk label: ST3120213A flags: bytes/sector: 512 sectors/track: 63 tracks/cylinder: 16 sectors/cylinder: 1008 cylinders: 16383 total sectors: 16514064 rpm: 3600 interleave: 1 trackskew: 0 cylinderskew: 0 headswitch: 0 # microseconds track-to-track seek: 0 # microseconds drivedata: 0 16 partitions: # size offset fstype [fsize bsize cpg] a: 1024128 0 4.2BSD 2048 16384 16 # Cyl 0 - 1015 b: 3072384 1024128 swap # Cyl 1016 - 4063 c: 234441648 0 unused 0 0 # Cyl 0 -232580 d: 2048256 4096512 4.2BSD 2048 16384 16 # Cyl 4064 - 6095 e: 20479536 6144768 4.2BSD 2048 16384 16 # Cyl 6096 - 26412 disklabel: partition c: partition extends past end of unit disklabel: partition e: partition extends past end of unit older kernel: $ sudo disklabel wd0 [...] 16 partitions: # size offset fstype [fsize bsize cpg] a: 1024128 0 4.2BSD 0 0 16 # Cyl 0 - 1015 b: 3072384 1024128 swap # Cyl 1016 - 4063 c: 234441648 0 unused 0 0 # Cyl 0 -232580 d: 2048256 4096512 4.2BSD 0 0 16 # Cyl 4064 - 6095 e: 20479536 6144768 4.2BSD 0 0 16 # Cyl 6096 - 26412 f: 4095504 26624304 4.2BSD 0 ...
^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ Your disk size and c partition size do not match. Can you send a dmesg, to see what the actual size of your disk is? This is really needed to see what is going on. If possible, please leave the disk as is, until we've done further diagnosis. If that is not possible, you can use the 'e' command in disklabel, to set the actual size of the disk to the size (in sectors) reported in the dmesg. You might need to adjust the 'c' partition as well. -Otto
After having sen your dmesg, I see that your disk size is really 234441648 sectors. The disklabel says 16514064 though. The new consistency checks did not like that. The consistency checks have been disabled in two steps (rev 1.44. and rev 1.66 of sys/kern/subr_disk.c). So a current kernel should not trip on this anymore. There remain two questions: how did the size end up being wrong in the disklabel, and how to repair. To the first question I can only guess; it could be you dd'ed an image from another disk, you edited the size by hand or we are seeing the results of a (old?) bug in disklabel handling that now surfaced because of the concistency checks. The second question I already answered: using the 'e' command in disklabel lets you set the size of the disk in the label. After that, things should be back to normal. Let us know how it goes. -Otto
After updating kernel sources and a patch from Theo now all looks normal
to me. Using the 'e' command in disklabel wasn't neccessary.
Many thanks to you guys for the great help!
Regards,
Markus
PS: sorry for the late reply. I had same mail problems recently.
