hello ,
I am talking about
a) Floating point operations
b) I/O operations
Why these are strict no to kernel programmers. Also Why there is no problem if floating operations are to be done from userland. Doesn't it uses same FPU or an emulator?
Dissociated Ambidexterous Association (not verified)
on
March 28, 2006 - 10:26am
now /I'm/ curious. :)
[though I have no idea what he means by i/o.. isnt i/o what the kernel is for?]
Googling gave some scattered results about FPINIT and proccessor states, things which I don't actually understand. All threads on the subject say "this has all been discussed many times before, search the archives" but searching the archives just brings up more messages saying to search the archives.
Perhaps someone with more knowledge of the appropriate keywords could provide a link to these fabled discussions?
Asking questions is fine ... asking someone to do your homework for you is not. So, if you're doing your homework and can't understand something in the docs, it's fine to ask someone to explain it - asking for snippets of code because you're too lazy to read the docs is not acceptable. I remember the good ol' BBS days when people would get flamed with 'RTFM'. I've noticed with a lot of posts that people asking questions obviously never even bothered to look up the docs.
I'm also unsure what the poster means about the IO.
But as far as the floating point is concerned:
No floating point or MMX
The FPU context is not saved; even in user context the FPU state
probably won't correspond with the current process: you would mess
with some user process' FPU state. If you really want to do this,
you would have to explicitly save/restore the full FPU state (and
avoid context switches). It is generally a bad idea; use fixed point
arithmetic first.
Source from the _Unreliable Guide to Hacking The Linux Kernel_ by
Paul Rusty Russel
I saw that same thing, or something very similar to it, when I was googling for it. But the question of "what the hell does that mean" is still one that's on my mind :)
That is: the kernel would need to save FPU state.. or whatever.. of user-space, but then why don't user-space programs need to do that when using threads or IPC?
I guess what I would want to see is "Well, normally all floating-point operations go through glibc, which makes a kernel syscall in order to lock the FPU for its own use until the operation is completed [essentially making the operation somewhat atomic]. Because the kernel can interupt at any time (unlike user programs), it would not respect these locks, and so it's best that it doesnt even touch that area"
That is a [completely] WILD GUESS [with absolutely nothing to back it up], based on pretty much nothing other than a post which I think was quoting the same book you just quoted, and also some dice.
Would that explanation be able to theoretically exist in a universe with similar physics to reality? [aka: am I completely 100% grade-A wrong, or am I just 99% not-even-remotely-close?]
saving and restoring the FPU state on every syscall and interrupt is just slow, so it was a design decision not to do that. architectures with more registers than i386 even decided to only use some of them in kernel code to avoid saving and restoring the whole set, or they have a second set of registers just for supervisor code. as with the FPU, code that needs the by default unsaved registers can always save them itself.
at least the i387 (and other FPUs like R2010, 68881, ...) was a second chip somewhat independent from the i386, it only received data or memory adresses and commands from the i386 and executed the operations itself, which in case of trigonometrical functions, which are coded in microcode (and which RISC architectures don't have), could take 1000 cycles or so (the successors still have these slow instructions, but I don't know, if they are still /that/ slow), while the i386 happily runs its own code, unless you wait for the FPU operation to be finished or catch a floating point exception. if syscalls save the FPU state, you lose this parallelity, up to the mentioned 1000 cycles. the only funny thing is, if the FPU takes its time to decide you did something illegal (like calculating log(0)), you get the floating point exception in kernel context, which you have to forward to the user space code which 'owns' the FPU. btw the next process doesn't have to 'lock' the FPU by a library call (most simple instructions are inlined in the normal program code), but the kernel can e.g. just arrange to get an exception in this case.
Yes and no ... you're right that it's a preempting kernel that is the problem. If a preemption occurs during an FPU operation (somewhere in the save / operate / restore previous sequence), and the other thread touches the FPU, at least one thread sees corrupted FPU state.
User programs handle this fine: any time they pop into kernel mode, the OS (optionally) saves FPU registers (optionally := if the thread has used the FPU), this code path always fires. Kernel threads are messier, there is no common syscall boundry to synchronize against, they can't make the skip-FPU-register optimization because kernel threads have to be generalized. The only way to use the FPU is some global FPU lock and that's just asking for races or lock contention. Besides, FPUs are slow compared to ALUs.
But if you haven't been studying OSes for a long time, you made a very fine guess. Applause. It's a pretty good way of solving the problem with user space threading.
Ok. Let's say that you tell your FPU to perform an operation. While the FPU is busy calculating the answer, your CPU has the opportunity to run non-floating-point code. If that code decides to do a syscall, the kernel must first wait for the FPU operation to finish before it can save the FPU state.
Think of it as 2 CPUs. When CPU #1 calls a syscall, it must first wait for CPU #2 to finish whatever it was doing. You lose teh parallelism.
Many frequently used syscalls simply have no use for floating point variables (like read(), write(), open(), etc...), so saving the FPU state on every syscall would be a waste.
If the CPU lets things get that far out of order, the CPU is seriously flawed, correct execution means the instructions must always commit in order!
You lose parallelism because FPU state is not saved on every kernel thread switch. It could be done, but it's much more expensive - both in terms of CPU time and in terms of memory usage. User mode can get away with it because FPU states can be captured at the system call boundary, which is already expensive.
Out-of-order execution is one of the most effective optimizations of a CPU with pipelines...
I.e. if you have this code:
(1)MOV B,A (Copy A to B)
(2)ADD C,A (Add A to C)
(3)MOV D,C (Copy C to D)
If you can do the instructions in parallel, and they take 5 clocks each (Normally it's 4 or 5), then (3) would have to wait for (2), and the total time would be 1+5+5=11 clocks. (This is not entirely true, since execution of (3) could probably start after the 4th clock of (2), but nevermind, that'd just make the optimization more effective).
If the processor detected that (1) could be done after (2) without anything going wrong then you'd save a clockcycle.
(2)ADD C,A (Add A to C)
(1)MOV B,A (Copy A to B)
(3)MOV D,C (Copy C to D)
4+1+5 = 10 clocks
In more advanced situations, this could save A LOT of clockcycles. It should be noted that a C compiler could do a little of the same, but since the CPU is executing it, it can detect more situations (For example when there's branching involved).
Was just going through in search of my answer and landed here. Felt there was some intensive discussion on y floating point is not supported in kernel mode. But my question starts from there on! Is there an alternative method which helps me get over this situation. I am writing some piece of code in kernel mode and have to use Floating point operation(RED algorithm implimentation). Can anybody help me with this please.
There's not a lot you can do to enable floating point operation. However, there's at least two tricks you can use to escape the need for floating point:
Use some form of fixed point representation. To do this, multiply all the numbers involved by a suitable constant on the way into the algorithm, and divide it out afterwards; for maximum efficiency, this constant should be a power of 2. This allows you to represent fractions, but doesn't give you the full range of floating point. For example, if I choose the constant 256 (2 to the power 8), 1.125 is represented as the integer 288, while 2.5 is represented as 640. If I add 288, 640 and 256, then divide by 256, I get 4.625 (which is the result of 1.125 + 2.5 + 1). Multiplication and division involve suitable corrections to keep the point in place (e.g. 1.125 * 2.5 is calculated as (288 * 640) / 256).
Convert the algorithm into a form where you don't need the floating point maths at all. Think about why you're using FP, and see if you can come up with an integer algorithm that works just as well. This is the tricker option, but tends to result in better code (since you're improving the algorithm itself.
It's also possible to manually represent FP numbers as mantissa and exponent, and emulate FP maths with integers, but this is messy and a sign of bad code.
farnz, i understood your suggestions on the possible ways to perform floating point calc using fixed point arithmetic, integer scaling operations.
i didn't understand your last comment that "manually representing FP numbers as mantissa and exponent and emulate FP maths with integers is messy and sign of bad code". May i know the specific reason for this. Is it because we have to handle overflow/underflow and other exceptions ourselves?
The reason that emulating FP maths with integers is a sign of bad code is simply that it usually indicates that you've not thought the algorithm through. There are very few (if any) algorithms that are useful in kernel mode, and can't be implemented using fixed-point or integer arithmetic; further, emulating FP math with integers is slower than using the CPU's FP hardware.
Thus, needing some form of FP (real or emulated) in kernel space is usually a sign that the code's author has not understood the algorithm they're implementing, nor thought through whether this code belongs in the kernel, or should actually be in userspace instead.
Home work
I suggest you do your homework yourself instead of asking other people to do it for you!
/peter
hey now..
now /I'm/ curious. :)
[though I have no idea what he means by i/o.. isnt i/o what the kernel is for?]
Googling gave some scattered results about FPINIT and proccessor states, things which I don't actually understand. All threads on the subject say "this has all been discussed many times before, search the archives" but searching the archives just brings up more messages saying to search the archives.
Perhaps someone with more knowledge of the appropriate keywords could provide a link to these fabled discussions?
do your *homework* !?
What the blankety blankety blank do you think the use of this forum is if not to ask questions?
questions ...
Asking questions is fine ... asking someone to do your homework for you is not. So, if you're doing your homework and can't understand something in the docs, it's fine to ask someone to explain it - asking for snippets of code because you're too lazy to read the docs is not acceptable. I remember the good ol' BBS days when people would get flamed with 'RTFM'. I've noticed with a lot of posts that people asking questions obviously never even bothered to look up the docs.
I'm also unsure what the post
I'm also unsure what the poster means about the IO.
But as far as the floating point is concerned:
No floating point or MMX The FPU context is not saved; even in user context the FPU state probably won't correspond with the current process: you would mess with some user process' FPU state. If you really want to do this, you would have to explicitly save/restore the full FPU state (and avoid context switches). It is generally a bad idea; use fixed point arithmetic first.Source from the _Unreliable Guide to Hacking The Linux Kernel_ by
Paul Rusty Russel
http://people.netfilter.org/~rusty/unreliable-guides/kernel-hacking/basi...
yeah, saw that same thing
I saw that same thing, or something very similar to it, when I was googling for it. But the question of "what the hell does that mean" is still one that's on my mind :)
That is: the kernel would need to save FPU state.. or whatever.. of user-space, but then why don't user-space programs need to do that when using threads or IPC?
I guess what I would want to see is "Well, normally all floating-point operations go through glibc, which makes a kernel syscall in order to lock the FPU for its own use until the operation is completed [essentially making the operation somewhat atomic]. Because the kernel can interupt at any time (unlike user programs), it would not respect these locks, and so it's best that it doesnt even touch that area"
That is a [completely] WILD GUESS [with absolutely nothing to back it up], based on pretty much nothing other than a post which I think was quoting the same book you just quoted, and also some dice.
Would that explanation be able to theoretically exist in a universe with similar physics to reality? [aka: am I completely 100% grade-A wrong, or am I just 99% not-even-remotely-close?]
it's just slow
saving and restoring the FPU state on every syscall and interrupt is just slow, so it was a design decision not to do that. architectures with more registers than i386 even decided to only use some of them in kernel code to avoid saving and restoring the whole set, or they have a second set of registers just for supervisor code. as with the FPU, code that needs the by default unsaved registers can always save them itself.
at least the i387 (and other FPUs like R2010, 68881, ...) was a second chip somewhat independent from the i386, it only received data or memory adresses and commands from the i386 and executed the operations itself, which in case of trigonometrical functions, which are coded in microcode (and which RISC architectures don't have), could take 1000 cycles or so (the successors still have these slow instructions, but I don't know, if they are still /that/ slow), while the i386 happily runs its own code, unless you wait for the FPU operation to be finished or catch a floating point exception. if syscalls save the FPU state, you lose this parallelity, up to the mentioned 1000 cycles. the only funny thing is, if the FPU takes its time to decide you did something illegal (like calculating log(0)), you get the floating point exception in kernel context, which you have to forward to the user space code which 'owns' the FPU. btw the next process doesn't have to 'lock' the FPU by a library call (most simple instructions are inlined in the normal program code), but the kernel can e.g. just arrange to get an exception in this case.
right cause, wrong mechanism
Yes and no ... you're right that it's a preempting kernel that is the problem. If a preemption occurs during an FPU operation (somewhere in the save / operate / restore previous sequence), and the other thread touches the FPU, at least one thread sees corrupted FPU state.
User programs handle this fine: any time they pop into kernel mode, the OS (optionally) saves FPU registers (optionally := if the thread has used the FPU), this code path always fires. Kernel threads are messier, there is no common syscall boundry to synchronize against, they can't make the skip-FPU-register optimization because kernel threads have to be generalized. The only way to use the FPU is some global FPU lock and that's just asking for races or lock contention. Besides, FPUs are slow compared to ALUs.
But if you haven't been studying OSes for a long time, you made a very fine guess. Applause. It's a pretty good way of solving the problem with user space threading.
@strcmp
i am not able to understand how is parellelism lost when we use FPU?
Ok. Let's say that you tell y
Ok. Let's say that you tell your FPU to perform an operation. While the FPU is busy calculating the answer, your CPU has the opportunity to run non-floating-point code. If that code decides to do a syscall, the kernel must first wait for the FPU operation to finish before it can save the FPU state.
Think of it as 2 CPUs. When CPU #1 calls a syscall, it must first wait for CPU #2 to finish whatever it was doing. You lose teh parallelism.
Many frequently used syscalls simply have no use for floating point variables (like read(), write(), open(), etc...), so saving the FPU state on every syscall would be a waste.
What?
If the CPU lets things get that far out of order, the CPU is seriously flawed, correct execution means the instructions must always commit in order!
You lose parallelism because FPU state is not saved on every kernel thread switch. It could be done, but it's much more expensive - both in terms of CPU time and in terms of memory usage. User mode can get away with it because FPU states can be captured at the system call boundary, which is already expensive.
Out of order is sometimes a good thing...
Out-of-order execution is one of the most effective optimizations of a CPU with pipelines...
I.e. if you have this code:
(1)MOV B,A (Copy A to B)
(2)ADD C,A (Add A to C)
(3)MOV D,C (Copy C to D)
If you can do the instructions in parallel, and they take 5 clocks each (Normally it's 4 or 5), then (3) would have to wait for (2), and the total time would be 1+5+5=11 clocks. (This is not entirely true, since execution of (3) could probably start after the 4th clock of (2), but nevermind, that'd just make the optimization more effective).
If the processor detected that (1) could be done after (2) without anything going wrong then you'd save a clockcycle.
(2)ADD C,A (Add A to C)
(1)MOV B,A (Copy A to B)
(3)MOV D,C (Copy C to D)
4+1+5 = 10 clocks
In more advanced situations, this could save A LOT of clockcycles. It should be noted that a C compiler could do a little of the same, but since the CPU is executing it, it can detect more situations (For example when there's branching involved).
Ah, finally! A reason of /why
Ah, finally! A reason of /why/ the decision was made. Thanks!
Any other alternatives?
hai,
Was just going through in search of my answer and landed here. Felt there was some intensive discussion on y floating point is not supported in kernel mode. But my question starts from there on! Is there an alternative method which helps me get over this situation. I am writing some piece of code in kernel mode and have to use Floating point operation(RED algorithm implimentation). Can anybody help me with this please.
thnks and regards
karthik.
Convert to fixed point or integer code
There's not a lot you can do to enable floating point operation. However, there's at least two tricks you can use to escape the need for floating point:
It's also possible to manually represent FP numbers as mantissa and exponent, and emulate FP maths with integers, but this is messy and a sign of bad code.
Emulating Fixed point arithmetic
farnz, i understood your suggestions on the possible ways to perform floating point calc using fixed point arithmetic, integer scaling operations.
i didn't understand your last comment that "manually representing FP numbers as mantissa and exponent and emulate FP maths with integers is messy and sign of bad code". May i know the specific reason for this. Is it because we have to handle overflow/underflow and other exceptions ourselves?
Not fully thought through code
The reason that emulating FP maths with integers is a sign of bad code is simply that it usually indicates that you've not thought the algorithm through. There are very few (if any) algorithms that are useful in kernel mode, and can't be implemented using fixed-point or integer arithmetic; further, emulating FP math with integers is slower than using the CPU's FP hardware.
Thus, needing some form of FP (real or emulated) in kernel space is usually a sign that the code's author has not understood the algorithm they're implementing, nor thought through whether this code belongs in the kernel, or should actually be in userspace instead.