Hi Harold,
I just also discovered this problem independently, and when I tracked it
down to stty and googled for it, I found your post. In my test case, it
seems to get stuck in stty as run from the user's .bashrc (i.e., "su
user", where the user's .bashrc has the stty command). In my case, the
arguments to stty do not seem to matter (well, I've tried "-ixany" and
"echoctl" - same results). Also, the problem is made more reliable if a
sleep is done before the stty. E.g., here's my test .bashrc:sleep 2
stty -ixanyNote that if run from the console or a tty, having the user logged in
already seems to avoid the hang, but doing it within an xterm shows the
hang. Strange, since with my original [more complex] test case, it
seemed to require *not* running X (tty/console only).Most recent kernels show the issue - the only one that doesn't is
2.6.25-git17. I am running Gentoo. It does happen in a recent 2.6.26
git (an rc4 git from a couple of days ago).Doing "ps" while hung shows stty in the "T" state. "killall -9 stty"
releases it.-Joe
P.S. Please cc my address on reply.
--
Hi Joe,
Does strace give you the same output if you attach it to the blocking
stty (strace -p $pid)?ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo ...}) = ? ERESTARTSYS (To be restarted)
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---Regards
Harri
--
Yep, almost the same. I get (repeating):
ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo
...}) = ? ERESTARTSYS (To be restarted)
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ----Joe
--
Guys, you should test if "kill -CONT $pid" wakes the process up.
It might be possible that some obscure bug appeared in the tty
code resulting in SIGTTOU sometimes being sent to the caller,
although that seems rather strange :-/Willy
--
Not really. The task would get suspended if it attempted to change the
tty settings while not being session leader. This is part of the POSIX
and BSD job control. A race (either kernel or in something like
sshd/bash) would do that and could have been caused by any of the timing
changes recently.That would also explain why I can't duplicate it, and the sleep
observation.
--
I haven't heard about this new restriction, but it begs the observation
that stty, when forked from a shell (the usual case), is never a session
leader.
--
On Mon, 02 Jun 2008 18:31:34 +0930
Sorry I mean part of the current session. I was thinking about the
specific case of bash or the ssh->bash setup where the question would be
whether the shell was session leader.Someone who can dup this needs to instrument it in tty_ioctl really.
Alan
--
Alan, since I can get it to happen faithfully, I can try this - any
suggestions on where to instrument?Thanks, Joe
P.S. My stty process sits in "T" - did you say that it would be in "R"
if straced and that is correct?
--
T would be correct. I'll put together a small diff to printk useful stuff
when it happens and sent it you tonight/tomorrow--
--
Take control of enterprise infrastructure
Sign up for starfleet academy today
--
[Alan, thanks for the tips on where to instrument this]
What I have verified so far is that when the problem occurs, it gets to
this point in [tty_io.c] tty_check_change():1229 kill_pgrp(task_pgrp(current), SIGTTOU, 1);
1230 set_thread_flag(TIF_SIGPENDING);
1231 ret = -ERESTARTSYS;
1232 out:
1233 return ret;So the error that gets returned to set_termios() is -512.
Also, the various checks before this point (of course) did not pass
(current->signal->tty != tty, !tty->pgrp, task_pgrp(current) ==
tty->pgrp, is_ignored(SIGTTOU), is_current_pgrp_orphaned()). I have not
printed out the various values from these - let me know if this would be
helpful. I wanted to pass this info along now in case it is of help.-Joe
--
See what tty->pgrp is at that point when it hangs and that might identify
who is owning the tty and tty setup
--
tty = current->signal->tty = -142080000 or 0xf7880800
task->pgrg = -142405824 or 0xf7830f40-Joe
--
task->pgrp is a struct pid - you need the value it holds
--
Yeah, I figured later that giving you the addresses was rather useless. :)
Anyway, here is more info:
tty_check_change: current->signal->tty = f7880800
tty_check_change: tty = f7880800
tty_check_change: tty->pgrp = f7b99e40
tty->pgrp->count = 5
tty->pgrp->level = 0
tty->pgrp->numbers[0].nr = 6951
tty_check_change: task_pgrp(current) = f7b99d40
task_pgrp(current)->count = 1
task_pgrp(current)->level = 0
task_pgrp(current)->numbers[0].nr = 6952
tty_check_change: kill_pgrp called; returning -ERESTARTSYS
set_termios: error return value (-512) from tty_check_change
foo 6951 0.0 0.1 2332 1096 tty1 S+ 14:18 0:00 su foo
foo 6952 0.0 0.1 2988 1464 tty1 S 14:18 0:00 bashSo, looks like the tty->pgrp's process is the "su" command itself, and
the task_pgrp(current)'s process is "bash" - the shell started by the su.-Joe
--
If anyone has any tips for my further debugging of this, given the
above, let me know. I'd like to help resolve this.Thanks! Joe
--
I think knowing the pgrps of the above processes (there is possibly
one more involved, stty?) would be useful; try:$ ps -eo pid,pgrp,tpgid,user,args
..as this problem occurs because a process tries to change the
terminal settings (and subsequently gets suspended because of that)
while it's not the owner of the terminal.This can happen if you fork something off to the background, e.g. like
$ stty 9600 &
(which should immediately give you [1]+ Stopped stty 9600),
so can you please look for anything like that in your login scripts or
shell rc files?I don't know any other way to debug this further, sorry :-(
Thanks.
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
OK, I performed this test again (getting the su to hang), and here is
the info:tty_check_change: current->signal->tty = f7879800
tty_check_change: tty = f7879800
tty_check_change: tty->pgrp = f78639c0
tty->pgrp->count = 5
tty->pgrp->level = 0
tty->pgrp->numbers[0].nr = 7036
tty_check_change: task_pgrp(current) = f7863f00
task_pgrp(current)->count = 1
task_pgrp(current)->level = 0
task_pgrp(current)->numbers[0].nr = 7037
tty_check_change: kill_pgrp called; returning -ERESTARTSYS
set_termios: error return value (-512) from tty_check_changescorpius ~ # ps aux | grep 7036
foo 7036 0.0 0.1 2336 1100 tty1 S+ 19:30 0:00 su fooscorpius ~ # ps aux | grep 7037
foo 7037 0.0 0.1 2988 1460 tty1 S 19:30 0:00 bashscorpius ~ # ps -eo pid,pgrp,tpgid,user,args | grep 7036
6902 6902 7036 root /bin/login --
6922 6922 7036 root -bash
7036 7036 7036 foo su foo
7037 7037 7036 foo bash
7042 7037 7036 foo stty -ixanyscorpius ~ # ps -eo pid,pgrp,tpgid,user,args | grep 7037
7037 7037 7036 foo bash
7042 7037 7036 foo stty -ixanyscorpius ~ # ps aux | grep 7042
foo 7042 0.0 0.0 1608 376 tty1 T 19:30 0:00 stty -ixanyscorpius ~ # ps -eo pid,pgrp,tpgid,user,args | grep 7042
7042 7037 7036 foo stty -ixany(I omitted, of course, when grep found itself, and I compressed some
I do use stty in my .bashrc (that's why this happens), but I do not put
it in the background.Anyway, hope the additional info above is of help...
Thanks, Joe
--
So this clearly shows what's wrong; 7036 is the "controlling process"
group id. But only "su foo" is in this group, the bash and stty
processes have their own group, 7037.On my own system, when I do "su", I get this:
2891 2891 2892 root su temp
2892 2892 2892 temp bash...and here the "bash" process is in the right group, 2892, while "su"
is the one in the background!Can you try to run strace on the su to see where things go wrong, i.e.
$ strace -f -e trace=process su foo
...and we're only interested in what happens up to the point where it
hangs. That should hopefully tell us which process is doing the wrong
thing. In either case, as Alan pointed out, this seems unlikely to be
a kernel problem.Yeah, most likely the process that calls stty is first put in the
background itself (or never brought to the foreground?). But I don't
know why... when you get the trace, we can compare and find out where
it deviates.Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
OK, I attached this as a text file at the end. But (*bummer*), using
strace makes it impossible to reproduce the hang (figures, and I believe
someone earlier in the thread also had this problem).As for whether the kernel is at fault, not sure (i.e. does this hang
behavior implicate the kernel automatically or can a user-space process
cause itself such an issue?). But I *do* see different behavior
depending on the kernel version. There were a couple of git kernels in
which I could not reproduce it. Still, if it is a race or something, it
might be that the conditions were just slightly perturbed.I attached the strace log just in case it is of help.
-Joe
Yeah, but doesn't it loop indefinitely calling ioctl() and getting a
Yeah, a user-space process can do this, and it's the right behaviour
for the kernel. I did post a program that would "reproduce" what
you're seeing. I do now believe that it's something timing-related, as
Alan suggested initially. (But timing-related with your scripts, that
is. I must say, that "sleep 2" does look a bit suspicious; I have no
idea what that is supposed to do :-))I suppose it would be more useful to see a trace where you include a
few more system calls, can you try:# strace -e trace=process,ioctl,setpgid -f su foo
instead?
Just for the record, I'm probably not the best person to debug this,
so I'm just trying to figure it out as we go. On the other hand, I
don't see better suggestions from anybody else. Thank you for
persisting, though! :-)(And the fact that the results differ with the kernel versions does
make this relevant for LKML still.)Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
Ah, that is something I put in there to artificially make it more
reproducible. Here's the reason: when I first encountered the problem,
it was happening if the home dir of the user was on the "btrfs"
filesystem (the new checksumming one from Oracle). This made me suspect
btrfs initially. But I reproduced the problem [more sporadically] when
the home was on ext3 as well. Since btrfs has a different performance
profile, especially when first accessed after a mount (and it is a
filesystem still under development, so some optimizations are yet to
come), I figured it might be timing-related, and sure enough, adding the
"sleep 2" proved that.So without the sleep 2 and with a home of ext3, it rarely happens, since
it takes very little time to read the homedir files (.bashrc, etc.).
Putting in the sleep makes it almost always happen. It seems like theThanks for helping. Yes, this is the kind of nagging issue that really
bugs me, since it is intermittent and makes things feel unstable. If we
determine the problem is in something else (like stty or bash), then at
least I can file a bug with them.-Joe
I'm not sure it is. Try adding sleep 3 instead. Because I have the
"sleep 2" when I run "su foo" as well, and I _didn't_ put it there:[pid 6298] execve("/bin/sleep", ["sleep", "2"], [/* 47 vars */]
<unfinished ...>Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
I have done some more investigation on this problem, and I am posting
here my results in hope that someone can point me in the right direction
for further investigation...Summary: during the initialization of a new bash shell, the terminal
foreground process group often reverts back to that of the parent of the
bash shell (after being set *to* the bash shell pgrp by bash),
prohibiting commands like stty from being run by the init scripts. The
result is that the execution of these commands will hang until killed,
causing the bash prompt to not appear. Adding a delay in the script
(using sleep) increases the chance of this having time to happen.For example, putting the following in a user's .bashrc:
sleep 2
stty -ixanyis a good way to reproduce this. doing "su <user>" from root (note that
the fact that no password is required helps the timing) will then often
hang. Killing -9 stty will allow the bash prompt to appear.I have instrumented the bash source code in an attempt to see why this
is happeneing, partly because I suspected a bug in bash. What I have
found is this:1) bash calls tcsetpgrp() with the pgrp of the bash process (two times)
before starting to execute init scripts. This makes sense, since bash
needs to be the session leader. It is never called again until just
before the bash shell exits normally (at which time it returns control
to the parent).2) During the processing of the init scripts (sometimes .bashrc, but
sometimes a system script that is processed first), calling tcgetpgrp()
shows that the pgrp has reverted back to the "su <user>" process. It
does not appear that bash reverted it in my testing so far. Running
stty while in the reverted state causes a hang, since bash is not the
session leader.So here is the question: is there a way/reason the kernel would revert
the pgrp of the session leader after bash sets it? Is there some more
instrumenting in the kernel or in bash that might reveal what is going
on? I have hear...
In fact, in various laptops (Eeeepc, dell inspiron 1520, Dell inspiron
4000), I've got various tty screwups that have been introduced since
circa 2.6.19.The 6 year old inspiron 4000 gets stuck at stty erase ^? . Randomly, but
most of the time.All of my machines exhibit the ctrl-C being slower than ctrl-Z discussed
elswhere (I've almost developed a habit of typing ctrl-Z kill %1 <RET>).
Although even ctrl-Z recently has been reluctant to always work. I wonder
if this is the cause of dpkg recently not responding to ctrl-Z's? (debian
bug #486222). dpkg does respond to kill -STOPctrl-s doesn't always work anymore. Again, what prompted me to write this
email, was I couldn't pause dpkg. It's particularly unreliable at
stopping scrolling messages at bootup, and if I press it at the wrong time
at bootup (not a specific place - it can be starting up any number of
scripts), something deadlocks and won't resume upon a ctrl-q.
alt-sysrq-k is enough to kill whatever has deadlocked. I have a feeling,
but don't want to test on this system right now, that pressing scroll-lock
as opposed to ctrl-q once unlocked such a stuck display.In summary, something in tty is certainly screwed. Does anyone see a
connection between all of these?--
Electromagnetic pulse received (core dumped)
--
I have done more investigation, and I now know the cause of the
bash/stty problem. It appears to be a race condition in bash (well,
between two different bash shells, actually). I saw a post from a while
back about something similar by Ingo Molnar, so I have copied him here too.Here is the ps tree of the test case where stty has hung:
4704 ? S 0:00 \_ xterm
4706 pts/3 Ss 0:00 | \_ -bash
4739 pts/3 S 0:00 | \_ su
4742 pts/3 S 0:00 | \_ bash
4746 pts/3 S+ 0:00 | \_ su foo
4747 pts/3 S 0:00 | \_ bash
4752 pts/3 T 0:00 | \_ stty -ixanyWhat should happen is: when "su foo" (4746) is run, it spawns a bash
shell (4747) that then makes itself the session leader when it
initializes its job control. The stty command (in the child bash's
.bashrc) will then be able to work (and not hang).However, the hang happens when the parent bash (4742) interferes by
reverting the tty session leader back to its child (the "su foo"
process: 4746) shortly after the child bash (4747) becomes the leader.
The parent does this when it calls
execute_command_internal()->stop_pipeline()->give_terminal_to(). This
seems to happen at a slightly random time, making the issue intermittent
- it depends which one wins the race.In summary, when the bug does *not* occur, here is the approximate
sequence (note I am :1) parent bash (4742) runs 'su foo' (4746)
2) parent bash sets tty leader to 'su' (4746)
3) child bash (4747) initializes and sets itself to be the leader
4) stty command in .bashrc runs successfullyWhen the bug occurs, here is the sequence:
1) parent bash (4742) runs 'su foo' (4746)
2) child bash (4747) initializes and sets itself to be the leader
3) parent bash sets tty leader *back* to 'su' (4746)
4) stty command runs and fails/hangs because its parent is not leaderThe various calls to tcsetpgrp() that do this are interleaved from ...
That they don't happen for me - at all is the only one I can suggest ? Most
of your comments are also not ones I've seen reported before.Unfortunately 'works for me' doesn't tell me whether that is luck, distribution
specific, user configuration choices, gcc version, bugs in code , or whatever
and someone who sees the ^C problem is going to have to track it down.Alan
--
I cannot reproduce this with 2.6.25.9 (on Slackware 12.0)
--=20
left blank, right bald
Weird! OK, I tried it with "sleep 3" in .bashrc, and it says
"...execve("/usr/bin/sleep", ["sleep", "3"], [/* 30 vars */]) = 0".
This sounds like what I'd expect. I don't understand why you see a
sleep 2 when you did not have one in your config.....-Joe
--
Awesome; that would be great - thanks!
-Joe
--
Hi,
I have written a short test program that seems to reproduce it for me
(see attachment), even though the original su/stty stuff wouldn't.Basically, the strace shows this:
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo
...}) = ? ERESTARTSYS (To be restarted)
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo
...}) = ? ERESTARTSYS (To be restarted)
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
... (repeating)The exact code path triggering this seems to be:
tcsetattr() -> ioctl(TCSETS) -> set_termios() -> tty_check_change()
This is on a 2.6.24.5-85.fc8 kernel.
I don't know what's wrong, but I hope this helps.
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
This looks correct to me and in fact I see the behaviour you report on 2.6.23
when running it. If I tell it to ignore SIGTTOU that also then behaves as
expected.If
your pgrp is not the pgrp of the tty
and you are not ignoring TTOU
and you are not orphaned (as a group)Then we are *supposed* to send you SIGTTOU and kick you back
into touch.This is so that if you do
someapp
^Z
bg
otherappAnd someapp wants to change the tty settings it blocks back to the shell.
This is correct behaviour and behaviour we've had for years.
Alan
--
OK, I am still baffled. I've thought of several different theories,
wondering if bash does not have the right parent process, how there
could be a race in the kernel or elsewhere, but as far as I can tell,
things are in order. Here's the ps -ax --forest output while hung:6435 tty3 Ss 0:00 /bin/login --
7954 tty3 S 0:00 \_ -bash
7958 tty3 S+ 0:00 \_ su foo
7959 tty3 S 0:00 \_ bash
7964 tty3 T 0:00 \_ stty -ixanyI had logged into the tty as root (with shell set to bash), then su'd to
foo (with shell set to bash), so this tree makes sense. During the
sleep before the stty, sleep is under the final bash similar to the way
stty is while it is hung.Note that the stty is a child of bash (which, BTW, sometimes appears as
"-su" instead - I am not clear on that), and they all lead back to the
original tty, which I gather is the session leader (or is it the "su"?).Now, the debugging I did shows that the reason that tty_check_change()
returns an error is that the tty->pgrg != task_pgrp(current). The
former is the "su foo" process, and the latter is the bash child process.So I guess that when it does work, they are the same process, but why
would they be the same (or not, as it were)? Does something happen
during bash startup that causes bash to become the session leader?Please, please, someone who understands the mechanics better than I let
me know how I can explore this more deeply.Thanks, Joe
--
The error seems that tty_check_change() returns -ERESTARTSYS.
Shouldn't it be EINTR to allow the signal to be processed and let the
process decide whether to retry the tcsetattr()?Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
The signal is processed, and then application retries the tcsetattr and
gets another one. The default TTOU behaviour is to block and then fg
continues the call so RESTARTSYS is both correct and has been used for
years
--
Hm, yes, that seems correct. I'm sorry for the wrong suggestions.
I guess this still doesn't explain why TTOU doesn't block (IOW, stop
the process, right?) in this case, because my test program does not
touch it.Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
I see the parent process sleeping and the child taking TTOU and going to
state T. That again is correct.alan 3219 0.0 0.0 3652 384 pts/5 S 13:11 0:00 ./repro
alan 3220 0.0 0.0 3652 204 pts/5 T 13:11 0:00 ./reproIf you run it without any straces etc do you see it blocked in T or sitting
in R ?Alan
--
Without any straces, it is blocked in T. Like Joe's report.
With strace, it's in R.
Exactly as you said, correct and expected behaviour.
So this is not a kernel problem at all.
I'm sorry for having wasted your time :-(
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
Just tried this ("kill -CONT <pid>") - no luck.
BTW, it should be possible, I would think, for others to duplicate this
fairly easily. Just:1) make a user, "foo", with login shell set to /bin/bash
2) create a .bashrc in foo's home dir with contents:sleep 2
stty -ixany3) cp .bashrc .bash_profile (only needed to test "su - foo" too)
4) become root
5) type "su foo" (or "su - foo")Sometimes it takes a second try to get it to happen. If the su hangs,
check to see if the stty process is in state "T". Also, it may make a
difference if you are logged in already as foo or are using X. I first
noticed this with no users logged in (except root) and no X running (but
I can reproduce with X/xterm as well using this simple test case). It
seems timing is a factor, so it's worth trying various things.-Joe
--
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Andrew Morton | -mm merge plans for 2.6.23 |
| david | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| PJ Waskiewicz | [ANNOUNCE] ixgbe: Data Center Bridging (DCB) support for ixgbe |
| David Miller | Re: [GIT]: Networking |
