On the discussion about GPUs for OGP: Here is an off the wall idea a friend and I were talking about a while back. I think that it would be of great use in a GPU. The idea is to create a "modularized" RISC processor. Here is a basic rundown: The CPU accepts 3 instructions: load store mov All operations are actually modules that expose one or more registers. A add module would look like this: addin (in) addins (in) addout (out) Every clock cycle the contents of addin and addins are added and the result is stored into addout. Now this doesn't sound all that good, except when you realize that in current processors the pipeline must sit idle while a result is being computed. When you get to multiplication this can be 3-4 cycles! So let's say we need to multiply three values. On a normal processor it would look like this: command (cycles) mov 1, reg1 (1) mov 2, reg 2 (1) mul reg1, reg2 (4) mov reg3, result (1) mov 1, reg1 (1) mov 2, reg 2 (1) mul reg1, reg2 (4) mov reg3, result2 (1) mov 1, reg1 (1) mov 2, reg 2 (1) mul reg1, reg2 (4) mov reg3, result3 (1) mov 1, reg1 (1) mov 2, reg 2 (1) mul reg1, reg2 (4) mov reg3, result4 (1) Total clock cycles: 28 Modularized RISC method: command (cycles all 1) mov 1, mulin1 mov 2, mulin2 mov 1, mulin1 mov 2, mulin2 mov 1, mulin1 #At this point the result from line 2 is ready so we move it out mov mulout, result1 mov 2, milin2 #And now we can move the result from line 4 mov mulout, result2 mov 1, mulin1 mov 2, mulin2 #Result from line 7 mov mulout, result3 mov 0, sink #Wait a cycle mov 0, sink #Wait a cycle mov 0, sink #Wait a cycle mov mulour result 3 Total clock cycles: 15 Now let's say that we could execute two move instructions at a time. Then the code would look like this mov 1, mulin1 : mov 2, mulin2 mov 1, mulin1 : mov 2, mulin2 mov 1, ...
This reminds me of a design someone was calling "MISC". (Minimal Instruction Set Computer) Basically, the only instructions were "moves". If you want to perform some computation, you move the source operands into special purpose registers and then move the result out of its special purpose register. Results queue up, so you can pop them out any time you like. I didn't go anywhere with this idea because it makes context switches hell, but for an embedded design with no interrupts, it would work fine. Before we get into implementation details, however, we really need to figure out what our requirements are. Based on what little I know about shaders, I don't know enough to decide between what you describe, a stack architecture, a 3-operand load/store machine, a 2-operand load/store, etc. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
If you want a shader-centric GPU, you want CISC with (a) special-purpose floating point and (b) vectorized instructions... Jeff _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
To this I ask why? Aren't GPUs vectorized RISCs? But yes, FPUs are a must. In fact, modern GPUs don't even use integers. They simply throw away the data after the decimal. Timothy _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
The ones I'm familiar with are, yes, but I think you can do better with CISC. Now that OpenGL is largely GLSL ("C for graphics"), there is a -lot- of room for optimization and improvement there. Jeff _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
So what benifits are you going to have in using CISC over RISC? Smaller code sizes I guess, at the the expense of complexity. I'm thinking for a GPU wider is going to be better than deeper. With FPGAs were are going to be limited by the clock. Why make it worse by going to CISC that relys on higher clock speeds? Why not make a simple core so that we can pack 16-24 of them in one FPGA instead of one that will take up half the CPU? Timothy _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Actually, deeper is always better unless you have control and dependency hazards. We can easily enough avoid the dependency hazards, so all we have to worry about are loops in shader code. When the shader loop is done with a pixel it can just toss it down the pipeline. We can unload a lot of computation from the microcode stage by doing it later in the pipeline. It doesn't matter how many stages don't do any useful work. The only time a GPU's pipeline becomes a real problem is when you have to read from something you've just written to, like if you've drawn to a texture and then want to use it as a texture. Then you have to flush the pipeline. You also have to worry about bitblts to and from the same surface (we can use some intelligence to avoid it sometimes). _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
In general... correct. But you can also do stuff like out-of-order execution to work around some dependencies, or add hardware multi-threading to enable multiple "simultaneous" rendering jobs. If the pipeline for one thread stalls, you can fill the pipeline with work I would certainly like to see such a beast, it would be very interesting. But ultimately I think deeper _and_ wider is the way to go ;) Jeff _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
But what is the use of deep CPU? It's all embarrassing parallel. I must be missing something here, because it always seemed to me that GPU were simply high end vector processors. On the other hand, making a GPU that could do hardware raytracing (e.g. http://www.artvps.com/page/15/pure.htm ) That would be killer. And I agree, for something like that CISC would rule. However, it still seems to me that simpler is better in the case of a GPU. On a side note, SGI years ago had a graphics processor (Was it the SE series?) That understood native OpenGL. From what I understand, the hardware it'self was OpenGL. All the driver did was re-package the data a bit. Timothy -- I think computer viruses should count as life. I think it says something about human nature that the only form of life we have created so far is purely destructive. We've created life in our own image. (Stephen Hawking) _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
No I think you are talking about VPro series (known also as Odyssey), SE series is a little harder (take a look to linux fb driver for Odyssey http://www.linux-mips.org/~skylark/ ) -- Pluralitas non est ponenda sine neccesitate Frustra fit per plura quod potest fieri per pauciora Entia non sunt multiplicanda praeter necessitatem Occam's Razor MiChele Carla` aKa Goldfinger <goldfinger@member.fsf.org> _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
--nextPart1456959.MVNjPkcNHL Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline It depends on how varied your task is. For graphics, I think quite a=20 small number of CISC-style operations would suffice. For example, you=20 are going to do loads and loads of dot products (matrix multiplies in=20 vertex shaders, normal calculations for per-pixel-lighting, bump=20 mapping, and so on). Doing all the adding and multiplying one at a time=20 means more instructions and more pressure on the scheduling hardware. Can't we combine this with Timothy's MISC idea? Have a "CPU" with=20 load/move/store, and a bunch of functional units that can each perform=20 a complex (think Altivec/3DNow!/SSE3 or even more complex than that,=20 like a dot product) instruction. Newer processors can simply have more=20 functional units, and could be backwards compatible with their=20 predecessors. Lourens --nextPart1456959.MVNjPkcNHL Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQBEQRD0vmNyqZHWDvURAp0uAKCFXAv4ypJfDBgoqWbSL3e4/d4crACcDN6c Jbnc7RnC/iKwDK9TN5Q94zo= =lFBE -----END PGP SIGNATURE----- --nextPart1456959.MVNjPkcNHL--
The idea I keep thinking about is to have a pipeline of general functional units. As a fragment passes down the pipeline, it's like executing instructions. If the number of instructions to be executed exceeds the pipeline length, the fragment gets forwarded back up to the beginning. Loops would get unrolled to the pipeline length; longer ones would work via the forwarding mechanism. Any sequence of instructions shorter than the pipeline length would get padded with NOOPs. The problem is that any more than a few general purpose registers would make every pipeline stage a massive amount of logic, limiting the number of stages. But the idea is to get great throughput at a low clock rate. We cannot design something to run at 500MHz. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
--nextPart2388207.Fao7WkK11i Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline So basically it would be a pipeline of processors. But that is possibly=20 less efficient than having a single MISC "scheduler" in the middle, and=20 a lot of functional units around it. Each processor in your pipeline=20 only ever does one instruction, and all the hardware for the other=20 functions it can perform is idle. In contrast, separate functional=20 units could all work at the same time, if they could get data quickly=20 enough. Perhaps there should be multiple MISC cores, they're likely to=20 You can still get high throughput with pipelined functional units. It=20 doesn't matter much if it takes ten cycles to multiply two numbers (or=20 vectors of numbers), as long as you can provide two new numbers to=20 multiply every cycle, and read out the result of the calculation that=20 started ten cycles ago. Throughput will still be ok (or at least as=20 good as it gets at the given clock rate). Lourens --nextPart2388207.Fao7WkK11i Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQBEQzPYvmNyqZHWDvURAuCpAKCZIo4TFZUNwCZG/I80GtjtzkyehQCfYbRw 6PzdU+RMrcC9/yMKWo88vKM= =avx2 -----END PGP SIGNATURE----- --nextPart2388207.Fao7WkK11i--
One of the things we're forgetting is that static scheduling is way behind the curve, but dynamic scheduling requires lots of extra hardware. Unless we hand-code most of what we run on this or have some massive peep-hole optimizer library, we're always going to get sub-optimal code. The only way to keep the computing units busy with a new fragment every cycle is to avoid data dependency hazards. We can only do that if we can overlap the processing for different fragments (like threads). Then we have to keep track of multiple processor states. Only slightly related, the statistics I have on branch delay slots say that they're only fillable about 60% of the time and they're only useful to the computation about 80% of the time when they're filled, making delay slots only useful about 50% of the time. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
I think one thing that we are missing here is the fact that we are thinking about a GPU design, not a full blown CPU design. Has anyone here ever written a shader before? Here is my vote, keep it simple. If you go with CISC or anything with a longer pipeline, you are going to have problems with data dependency, long pipelines, A MISC design is going to need two maybe three stages in the pipeline. Fetch, and Execute, maybe decode, but maybe not. Data dependency is not going to be an issue. It would be a blast programming a compiler for this sort of GPU, you could optimize the shaders to death. We have to stick with what is practical. And what will work well. Plus we are limited by the following restrictions: Low clock rate (200-300Mhz?) Small transistor space What ever we make must fit in these two restrictions. I do have a question though? Does the GPU on the current OGP design have direct access to the memory? Or does it contact the video memory through a memory controller of sorts. If, somehow we could give the GPU direct access to video memory, basically 64MB of registers. Then we would have a design that would give some powerful performance benefits. We could then design the MISC modules to accept memory locations. So you could say, "multiply 0x0004 with 0x01004 placing result in 0x02004 executing it 0x0010 times.". We find ourselves in a catch 22 here. I'm afraid that a RISC design is not going to be fast enough. We'll be trying to push too many instructions through the chip too fast. However, a CISC design is not going to be much better. We cannot go with Out-of-Order execution because of the complexity. But performance is going to suffer unless we can execute more than one instruction at a time. But what someone said here was right. We won't know how it works until we start trying to program it. That's the wonderful thing about OGP right? So when we get the first prototypes out, those of us who feel like it can program our own GPU on ...
Yes, since you are going to compile custom for each revision of OGA, you can do all the scheduling in the compiler. This will increase code size (more NOOPs), but it simplifies the hardware by not You cannot contact memory without some sort of memory controller. It's got to manage banks, row misses, refresh, etc. There is no such thing as "direct access to memory." Our memory controller, however, As it turns out, we have an odd case where the memory is at least as fast as the logic we can afford to control it with. Modern processors use lots of registers because memory is a horrible bottleneck. Our problem here is that although memory is relatively fast, there's still a significant latency between request and receipt of read data. Plus it's variable (row misses incur extra delays) and non-deterministic (memory refreshes appear random to the compute engine). The lesson I learned long ago with memory is to do as much batching as possible. Read requests get queued, as to the responses. That means the GPU has to be designed to absorb the latency. OGA has a fifo in the pipeline that sits between request and receipt stages just for that purpose. Writes are queued and forgotten. For performance, it's important that reads and writes all be allowed to complete out of order so that you can perform all accesses for one row before incurring the penalty to switch to another one. A sort of "memory barrier" is used to sync everything up when you need to read what you just wrote (fortunately a rare event in fixed-function pipelines, at least). The MISC approach is interesting, but all you're really doing is encoding part of the opcode into the register number. Six of one, half a dozen of the other. If it saves you something, do it. But I don't think it does. (With TROZ, I encoded the rendering command into the address, reducing the number of bus cycles necessary to initiate drawing.) Still, there are also plenty of things we could gain by this approach, including the ...
Timothy Miller wrote: Floating point isn't really RISC, you have a RISC processor with a FPU since the FPU couldn't execute stuff in one clock. And, then methods were developed to accelerate the FPU so that it could execute in one clock. What we really need to decide is how to handle operations (such as matrix math) which require a series of steps for the ALU to complete. We can use CISC like operations where the series of operations is in microcode. Or we can have a RISC like control of the ALU where we have SIMD that can operate on up to 4 32 bit floats in parallel and issue a series of instructions to do what one CISC like operation would do. But, you can combine these like the transputer and have the OP instruction which calls a macrocode subroutine to do these things. I can't see having more than one ALU per shader since it should have 4 32 bit float hardware multipliers. Since the standard shader operation is multiplying three 4x4 matrices, it is the hardware multipliers that is going to boost throughput. -- JRT _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
While I don't have much experience with this stuff I do remember someone at nekochan.net did optimize mplayer using the madd and nmsub instructions on the MIPS processors to optimize the matrix operations, gaining 300% optimization. The interesting read can be found here: http://forums.nekochan.net/viewtopic.php?t=2976 Erik -- http://www.ehtw.info (Dutch) Future of Enschede Airport Twente http://www.ehofman.com/fgfs FlightGear Flight Simulator _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
I don't think it's wise to use SIMD ALU here. All scalar code will use the SIMD FPU with 3 FMUL unit idle. Because everything is strongly parrallel, i think it's better to stay scalar. 32 bits flotting point instruction is the op the most used. So the performance will depend on the number of such unit and the efficiency of there use. Beside that complexe 128 bits data path is always harder to route, so it's mandatory slower than cpu core with 32 bits internal data path. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
If there are enough independent scalars that can be scheduled, you can pack them and run them in parallel. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
So you need the logic to detect that a pack is possible, and you need the switch that permit to connect the different register bank and the FPU. For what ? The only advantage over 4 cores depend on the size of control logic, it depend if it's negligeable in front of the size of a 32 bits FPU. In the other side, 32 bits switch could be big. The goal of the shader is to maximise the use of FPU or more precisely the FMUL instruction. So you could create un instruction word which look like this : - OPeration code 1+ addr Read register 1 + addr Read register 2 + addr write register 1 (this are for MOV, LOAD&STORE and maybe for logical op as "<" ">", so it could do FADD FSUB) - OPeration code 2 + addr Read register 4 + addr Read register 5 + addr write register 2 OPCODE2 could be small (FMUL, integer MUL, what else ?) There is 2 registers bank. So you could use 4 read and 1 write memory for the register bank. Each read could access the 2 bank but write could only access a dedicated bank. It's depend on the technology you could afford (full custom or not... 4 read and 2 write memory are maybe common nowadays) Then you add : - Precicat That's a very easy way to make small "if" statement without breaking the pipeline. (like CMOVE in x86). Predicat are access to a register that said this register is null or not. If the register is null, the current operations are cancelled. - Predicat + Imm8 That's the way to handle loops, jump and the repeat instruction of some DSP. If the register is not null, PC+IMM8 is performed with a delay slot of 1, otherwise PC+1 is used. The instruction world is big :) I have read somewhere that 32 registers of vec4 are needed. So you must have at least 128 32 bits registers. If you add some trick as R0 == 0, and some specific register, register address will need 8 bits. Depending how you encode the opcode, you will reached 80 bits for an instruction word. The "jump part" could manage "directly" the PC with a delay slot. The predicat could ...
This is a MIME-formatted message. If you see this text it means that your E-mail software does not support MIME-formatted messages. --=_kayan.duskglow.com-2918-1145573867-0001-3 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: base64 Content-Disposition: ...
So you want to pack and unpack SIMD register. That's a cuting edge technology very few used in normal computing (using SSE2 or altivec,...). Compiler are quite bad at it. But if you optimise the pack/unpack instruction, this will represente switch that are big and slow. FMUL is a one cycle operation. If you need 3 pack instruction before, See the real code posted here ! If i have understand shader correctly, load are only for texture, every thing else is transmited trough specific register. So this load are implicit. Basicaly DOT product take one cycle in vector arch, and 4 in a scalar LIW arch (like I and André Pouliot explain) if you could interleave correctly the instruction (7 instructions latency otherwise, with a 3 cycles latency FPU). So for DOT it's completly the same. Because a scalar cpu will be almost 4 times smaller than a vector shader, you could put 4 cores scalar where you put 1 vector core. When you look at the compiled code posted here, you see a lot of MOV, scalar MUL, etc... In this precise case, the SIMD unit is completly underused. I don't have access here to the ASM published. But this is the kind of code to optimise. I know that if only vector code is used, a vector shader will be faster (because there is less problem of read-after-write dependancies than in _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Ain't the later x86 RISC that "emulates" CISC?
RISC don't ned to reload data in the same manner as CISC has to, which in
The CISC commands uses a lot more cycles than what the RISC does, in
general they do take the same amount of time as CISC X-function*XX-cycles
while RISC XX-functions*X-cycles. The draw back for the CISC is that it
empties the registers and if next funxtion needs to use the same data this
has to be reloaded to the register, while the RISC will still have the
data in the register as it only empties the register when it needs to.
This is at least written in most RISC vs CISC pages I have read, none
What about VIM (Sparcs version of "Altivec/SSE") ? ;)
Something I did like with Transmeta was the possibility to load different
cores into it, sad they just made 586 and no PPC/.../...
Isn't a shortening of the pipelines a good idea? This is what has been
done on both Sparc and PowerPC, both has a lot shorter pipelines than what
AMD has on it's which is shorter than Intels. IMHO the old Apple video
called "The MHz Myth" shows the benefits of shorter pipelines.
Don't know how things would be with even shorter pipelines than IBM's and
Freescales PowerPC has (eXponental had longer pipelines and did less on
higher MHz than the ones from Freescale).
--
//Aho
------------------------------------------------------------------------
E-Mail: trizt@iname.com URL: http://www.kotiaho.net/~trizt/
ICQ: 13696780
System: Linux System (PPC7447/1000 AMD K7A/2000)
------------------------------------------------------------------------
EU forbids you to send spam without my permission
------------------------------------------------------------------------
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Smaller code size? _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
That is the primary advantage, although some RISC designs offer mechanisms to reduce code size (alternate instruction set or dynamic decompression). _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Not really. This entire thread oversimplifies the differences. x86 is a vastly different beast from traditional RISC. Further, modern production RISC processors sometimes approach CISC in their complexity. See e.g. the 'sqrt' instruction on any modern RISC processor. And then there's super-scalar execution differences, out of order execution, vastly different TLB behavior, ... Jeff _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Yes, "Reduced Instruction Set Computer" is a bit of a misnomer these days. Perhaps it would be better to call them "Simplified Instruction Set Computers." Many aspects of the design (not just instruction decode) are simplified by having completely uniform instruction formats. RISC processors were originally designed around the pipeline. That's changed a bit, because the instruction sets are now a bit more of an abstraction from the hardware, but there are still distinguishing features between RISC and CISC. In the late 80's, RISC was seen as the holy grail since it simplified processor designs and made room for significant improvements in performance. With the dominance of superscalar and OOO designs, that simplification is no longer as much of an advantage. At the same time, legacy instruction sets like x86 are even more suboptimal. Given the current state of processor designs, can we now design an instruction set and processor architecture that fits the new model more directly? Or course, we may already have those, with names like VLIW and EPIC. This is kinda off topic, but CPU designs have already fascinated me. And I wonder what approach we may take to programmable shaders. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Agreed on all counts. FWIW, I'm largely trying to combat falsehoods here, not trying to argue I think the "simplified processor design" part is the key for RISC. Justification for the "suboptimal" claim? IMO, x86-64 ISA seems to most closely match the operations that a compiler wants to generate. It combines the best of RISC (oodles of registers) with an instruction set that matches the basic operations This always sounds good in theory, but you run into a compiler barrier here. ia64 is a really smart, advanced EPIC architecture, but the compiler technology is still trying to catch up. If the software isn't capable to fully utilizing the hardware, you've IMO ideally what is needed is practical experience, to answer that question (which again requires time and money). One needs to work inside a feedback loop: 1. design the hardware, based on guesses 2. design the shader JIT, based on initial hardware ISA 3. profile to see where the hardware spends most of its time, based on likely-common usage workloads. 4. update JIT and hardware to reflect profile data 5. go to step 3, if you have the time and energy. 6. get some hardware out to the general public 7. find out all your workload assumptions were wrong, and go back to step 3. :) So for OGD, I would recommend the open source way: release early, release often. Design a very simple, just-to-get-going GPU that supports programmable shaders. _Just enough_ to get people working on the software. Ignore everyone's opinions on the mailing list [for now]. Then enter the feedback loop... Jeff _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Totalllly agreee !!! What about having a HI-Tech mailing-list dedicated only to OpenGraphics -- Michele Carla` <goldfinger@member.fsf.org> _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Are you complaining? This list is dedicated to Open Graphics. But since we are dependent on being profitable, it's important to explore what other options we have in terms of products to sell. (1) As soon as Open Audio becomes an official project, it'll get its own mailing list. Since no one has stepped up and volunteered to lead it, it's not an official project. (2) OGD1 is an integral part of the OGP timeline. But we are also dependent on it being used for many other things besides just graphics. As a result, we will discuss non-graphics things on this list. It's important that we do so. When we have the resources to be developing HDL full time for OGA, "on topic" for this list will shift. Rather than complaining, I suggest that you help us come up with profitable product ideas that will push us that much closer to being able to have a mailing list dedicated only to graphics. (I like it when the list is orderly and on-topic, but I have more important things to worry about, like getting OGD1 built.) _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
There is no such thing as x86 RISC. You are probably thinking of the x86 microarchitecture, which decodes CISC instructions into one or more RISC-like micro-ops inside the CPU. I don't think you know what you are talking about, here. Data dependencies exist regardless of RISC or CISC, which is why modern RISC and CISC processors all have prefetch instructions and other data Largely false, for the most common instructions. x86 instructions like 'mov' or common ALU instructions typically execute in a single cycle. In fact, with multiple ports, _multiple instructions_ can be executed in a single cycle. And don't forget that RISC blows your i-cache out of the water. x86 is essentially a compressed instruction set, with the most common instructions requiring only 8-bits to describe, versus 32 bits for a Registers are registers. They store values for use by multiple instructions, on either CISC or RISC. That's what registers do. 32-bit x86 even has _far more_ registers than the ISA suggests. Google for "register renaming" sometime. 64-bit x86 solved this problem, by adding a ton of registers to the ISA. Finally, optimal register usage is a function of the compiler and the code being compiled. Code with heavy data dependencies will use few registers, constantly loading new data into the same registers for calculations in loops and whatnot. There is a ton of literature on Again, you cannot make a blanket statement like "shorter pipelines are better." Jeff _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
One of the backside of MISC is that you can't be backwards compatible. You needs more space to define new register. You need to define different latency. Basicaly the instruction world look like a 2 registers, one read, one write. You could add a bit to the input register for immediat numbers. If you use 64 registers, it take 6 bits. Then a single move use 12 bits + 1 bits at least. But you could be vliw, and do n move at a time, 128 bits instruction word could integrate 10 MOVE instructions. GPU don't need to be backwards compatible like CPU. MISC instructions could be used to really have the best performance by hiding latency. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
--nextPart1648883.9UkAvLdjsT Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline You are right, I hadn't thought of that. If there were room for=20 extensibility in the opcode format, and you didn't improve upon the=20 functional units themselves, it could be made backward compatible, but=20 as you say, it's not necessary. Lourens --nextPart1648883.9UkAvLdjsT Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQBERIVUvmNyqZHWDvURAk7sAJ0RinJ5dS3NVRXt1Vy9gcg64WVK6wCcDnzE A9xqVXD0uDSB4bSpWbeJGsE= =CS64 -----END PGP SIGNATURE----- --nextPart1648883.9UkAvLdjsT--
If you keep room for few more adresse, you waste memory for the code. And you save half of the problem. If you make a vliw design, you could pass from 4 to 8 instructions in the same word but if you really do 4 instructions, you add some constraints that make the design less _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
I knew it wasn't our original idea, but I didn't remember the name of it. Basicly, current GPUs are nothing more than smart vector processors. I'm thinking that a design like the one I mentioned would allow the processor to be used for a GPU, or a CPU. Since we are dealing with embedded designs (like you mentioned) we don't have to worry about threading issues. The think I love is that this design is extremely flexible. It would be very simple to allow a developer to modify it by adding or removing arithmetic units. Here is another weird plus. It seems to me, that a compiler could (if it knew enough about the underlying hardware) be created to translate from one revision of the processor to another. So, let's say that a driver for a revision of the chip that has two mov units would have optimized code. In the other example I gave, the code that was 15 instructions in length, could be split into a series of blocks, these blocks could then be resolved into the code that was 8 instructions in length. It's this idea of the modularization that I find so attractive. CPUs are too liner, and GPUs are not linear enough. If we could find the sweet spot in the middle we may be on to something. Timothy _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
This is a MIME-formatted message. If you see this text it means that your E-mail software does not support MIME-formatted messages. --=_kayan.duskglow.com-828-1145322139-0001-3 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: base64 Content-Disposition: ...
That is the basic idea behind MISC. Everything is done via special-purpose registers. But as I said before, all you're really doing is encoding the opcode into the register index. And think about this for a moment: In a MISC design, if you want to add, you specify the source operands (probably general purpose registers) to copy into "input registers" for your adder. That's two moves (which you can do in one instruction). Later, you can pop out the result and move it back to a GPR. (Another move.) So here's your code: mov rA0 <- r1, rA1 <- r2 ... mov r3 <- rA2 Let's say you have a register space of 256 registers, so each instruction takes 16 bits. The two together require 32 bits. Now, let's consider a RISC design. In this case, you don't need so many registers, just the GPRs: add r3 <- r1,r2 If the add opcode is 4 bits and the three operands are each 4 bits, then you need 16 bits to encode this. The point to take is that you need twice as many bits to encode the same "instruction". With misc, all you've done is move the upper nybble of the add ports (0xA) from the three operands into the one opcode. It may very well be worth it to use the extra bits (something I've hinted at earlier), but keep in mind where your redundancies are and make sure they're a net gain. Oh, one other thing: Compilers have a hard time with special-purpose registers. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
MISC are a false good idea. Managing pipelined unit and latency are an horrible task : if your FMAC is a pipelines unit, you have exactly one time slot to read the output ! If i remember correctly, shader program use only ~256 instructions. So you could design an instruction set that look like µcode or vliw code. Such shader should be small and fast. If you need more power put 2 of them. For me the instruction set, look like a part for JUMP/LOOP management, a part for computation (very complete : ALU, dot product, MIN/MAX, CLAMP..., all in 1 cycle) you could put 4 register read and 2 write here, such ILP could be easly find in typical code, and you add a third part that manage load&store with complexe adressing mode (for 1D, 2D and 3D data,...). You could also add some bit for predicat calculus. This introduice a cheap way for if-clause. At the end, you will have a long instruction word (>100 bits), few register set (one with a least 32*4 floating 32 bits word, one for adresse calculation, one maybe for managing data read from the memory (a write port is very costly)) 1 instruction could perform a load or a store, a calcul and a jump. It's important to have the calcul unit always produicing a usefull work. Normaly, it's the largest unit of the shader design. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
So? The compiler should be able to schedule this with no problem. I could write the compiler to do this myself. If you can do 4 movs per clock cycle, it should be no issue whatsoever. If you find my comments in error, give me and example (and code to go with it) where the MISC idea fails. Okay, this has got me thinking now. What we need is a simple program for testing how different processors function. What we need is a simple program that can describe (in software) the characteristics of the processor (latency, etc.) and run a simulation based on these characteristics. Give me a week and I'll see what I can hash out... -- I think computer viruses should count as life. I think it says something about human nature that the only form of life we have created so far is purely destructive. We've created life in our own image. (Stephen Hawking) _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
MISC design was discuss a lot in the f-cpu project. It was abandonned because it's impossible to keep backward compatibility. That's not a problem here. From my point of view, the problem is how to manage the output of the unit. RISC like instruction permit to give an output register. So you could schedule instruction. In MISC, most read of the output will be at a fixed place. Mul.a <- Reg1 || Mul.b <- Reg2 Reg3 <- Mul.res || Mul.a <- Reg4 || Mul.b <- Reg 5 Reg6 <- Mul.res It's hard to do better. From a hardware point of view the "big switch" will be slow. But the killer are long latency instruction. Some could be predicted. Imagine how to schedule the read of a 32 cycles divide. Even worst : how do you manage the variable latency of a memory read ? For me, a shader instruction look like µinstruction : PC control, calcul and load&store in the same instruction world. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Two ways: (1) Stall on attempt to read from empty read response queue. (2) Branch on status flag _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
To make an active wait ? _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Why would you implement it this way? If you have an outstanding read, Yes. You probably wouldn't want to write an algorithm that did something different depending on whether or not data was available, so you'd end up just spinning on the branch instruction. When data is streaming, that's a wasted instruction for every read (because you always have to check). It's better to just schedule the read as far after the request as you can and just stall when it's not available yet. Also if we're not careful, we'll think too single-threaded here. Every fragment shader needs to be split into two pipelined threads. One thread's output is the input to the other one, and there's a fifo in between them. Either thread can do anything, but generally, when there are memory reads to be done (texture stuff, etc.), one thread's job is to make requests, while the other is the consumer of that requested data. All other work should just get split intelligently. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
The other question I have, is how much variance is there going to be in a read? Sure on a CPU allot of it depends on if the data is in the L1 or L2 Cache or memory. But since we are only going to be going to memory (please tell me no one here was actually thinking of putting a cache on a GPU). What will the variance be? -- I think computer viruses should count as life. I think it says something about human nature that the only form of life we have created so far is purely destructive. We've created life in our own image. (Stephen Hawking) _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Lots. A row miss takes one amount of time; refreshes cause random delays. The worst are video reads which have the highest priority and can tie up the memory for extended periods. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
The video refresh reads means that we almost have to have a cache. However, this doesn't have to be a general purpose cache, it can even be a cache which requires explicit instructions to manage. -- JRT _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
You can not see the output register of the load and store unit as a read register. But as a stack that is empty for each read. That's different for It look so much easier to just write the result when it's available in a destination register with a scoreboard that block any read of this precise register... So could easly hide the latency without any software _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
That's kinda what I had in mind. I don't want too sophisticated of a scoreboard, because I'd like most scheduling static, but if we do it right, we can allow some things to complete out of order so as to reduce the impact of read latency. Also, if we do the fifo thing I mentioned, it'll become a non-issue for many algorithms. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
--nextPart6015061.gvkg9c2TSu Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline What if there is more than one adder, or multiplier? It's lower level=20 than that, you don't just say which operation you want, but also where=20 it should be executed. Lourens --nextPart6015061.gvkg9c2TSu Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQBERU1EvmNyqZHWDvURAqgxAJ9BpqUJD0cbrT92Lva2nH2PzJrhywCcCqfA dtnkmMZsaP80ZDeuQN6R/y4= =ZTXm -----END PGP SIGNATURE----- --nextPart6015061.gvkg9c2TSu--
You could do that too with VLIW instruction world. You don't need out of order _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
--nextPart2097051.Xtrrf3Wpr4 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline What if all the functional units had the same latency, that is, they all=20 have a fifo on their output that increases their latency to some common=20 maximum M. Scheduling would become trivial, you just generate the=20 instructions in order, and then interleave M copies of the code. There=20 are always M identical instructions in a row so you only need to load a=20 new instruction every M clock cycles. ILP could be achieved through=20 having multiple MISC cores, if the compiler makes sure that they don't=20 access the same functional unit at the same time. Lourens --nextPart2097051.Xtrrf3Wpr4 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQBERckJvmNyqZHWDvURAgXYAJ9JGF3h9vFcMQp7CMg//yEDZLlHlgCghLgJ TqxSwnIAgPZQEFs3K1fWAdo= =pFLC -----END PGP SIGNATURE----- --nextPart2097051.Xtrrf3Wpr4--
Latency is usualy a killer. Pipeline is used to keep the speed high. Imagine a 1 cycle, 3 clock latency FMUL beside a 32 cycles divider. Then you could add the problem with loop and if condition. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
--nextPart1576674.HSUu2FVkAa Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Ah, I hadn't thought of branching. Never mind then... Lourens --nextPart1576674.HSUu2FVkAa Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQBERiYHvmNyqZHWDvURAvHHAKCvpKpP/b6/RX26WD0QPK/lPeKHawCgpQZT mB7EZfp315jDkNactx3QIRs= =mxSR -----END PGP SIGNATURE----- --nextPart1576674.HSUu2FVkAa--
This is a MIME-formatted message. If you see this text it means that your E-mail software does not support MIME-formatted messages. --=_kayan.duskglow.com-32290-1145487249-0001-3 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: base64 Content-Disposition: inline T24gNC8xOS8wNiwgTG91cmVucyBWZWVuIDxsb3VyZW5zQHJhaW5ib3dkZXNlcnQubmV0PiB3cm90 ZToKPgo+IFtzbmlwXQo+IEFoLCBJIGhhZG4ndCB0aG91Z2h0IG9mIGJyYW5jaGluZy4gTmV2ZXIg bWluZCB0aGVuLi4uCgoKV2hhdCdzIHRoZSBkaWZmZXJlbmNlPyAgUEMgaXMganVzdCBhbm90aGVy IGxvY2F0aW9uIGluIG1lbW9yeS4KClRvbQo= --=_kayan.duskglow.com-32290-1145487249-0001-3 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: base64 Content-Disposition: inline PGRpdj48c3BhbiBjbGFzcz0iZ21haWxfcXVvdGUiPk9uIDQvMTkvMDYsIDxiIGNsYXNzPSJnbWFp bF9zZW5kZXJuYW1lIj5Mb3VyZW5zIFZlZW48L2I+ICZsdDs8YSBocmVmPSJtYWlsdG86bG91cmVu c0ByYWluYm93ZGVzZXJ0Lm5ldCI+bG91cmVuc0ByYWluYm93ZGVzZXJ0Lm5ldDwvYT4mZ3Q7IHdy b3RlOjwvc3Bhbj48YmxvY2txdW90ZSBjbGFzcz0iZ21haWxfcXVvdGUiIHN0eWxlPSJib3JkZXIt bGVmdDogMXB4IHNvbGlkIHJnYigyMDQsIDIwNCwgMjA0KTsgbWFyZ2luOiAwcHQgMHB0IDBwdCAw LjhleDsgcGFkZGluZy1sZWZ0OiAxZXg7Ij4KW3NuaXBdPGJyPkFoLCBJIGhhZG4ndCB0aG91Z2h0 IG9mIGJyYW5jaGluZy4gTmV2ZXIgbWluZCB0aGVuLi4uPC9ibG9ja3F1b3RlPjxkaXY+PGJyPldo YXQncyB0aGUgZGlmZmVyZW5jZT8mbmJzcDsgUEMgaXMganVzdCBhbm90aGVyIGxvY2F0aW9uIGlu IG1lbW9yeS4gPGJyPjwvZGl2PjwvZGl2Pjxicj5Ub208YnI+Cg== --=_kayan.duskglow.com-32290-1145487249-0001-3--
One other thing I'm thinking about: [a] We're going to be wanting to process some number of pixels in parallel. [b] We're going to have trouble scheduling instructions to make best use of functional units. So, let's take advantage of that. Let's assume we can have data dependencies that make different pixels require different instruction flow. We can pull a Niagara and feed instructions for four threads through a smaller number of execution units. So, our add/mul units are capable of both vector and scalar computations, so we have two such units (or two of each type; whatever) and can schedule two vector computations per clock or some arbitrary assortment of scalars on one or both. On empirical analysis of resource contention, we may add some functional units later, but the idea is to remain reasonably small. Just like with Niagara, we have lots of opportunities to avoid control and data hazzards, so we don't need to account for them. (We may want to have some locks in place, but we can afford to just stall.) For each pixel, even the effective memory read latency is smaller. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Don't get too carried away with Niagara comparisons. A GPU has to execute exactly the same shader program for every pixel in a given triangle/primitive. There is a small amount of data that varies for each primitive: the coords/normal/ tex coords at vertex level; color/texcoords for fragments; which is about a dozen 4x32 bit registers at most. There's a K or two of OpenGL state that the shader can read but not write to as well, plus a K (?) or so of app state with the same restriction. Now that shaders have branches it's not guaranteed that they all execute in lockstep, but there is a very high probability that all the execution units will need to read from the same memory location at the same time. Brute force replication might work better than dynamic scheduling. -- Hugh Fisher DCS, ANU _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
You have a point. Aside from a small possibility for variation in instruction sequence, if one pixel's shader needs the vector multiplier, then they all do, at the same time. But what I was thinking was that if they all needed the vmul unit on one cycle but not on the next, then two of the threads' instructions could be scheduled on one cycle and two on the next. What are the chances that we'll get a long stream of vmuls all in a row with no breaks? In that case, it would definitely be better to have four completely independent functional units. There are definitely some things we would want to do about multiple threads accessing the same (or nearby) memory locations. _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Four vmuls (actually dot products) in a row is very common for matrix multiplies. The sample shaders I've got, from the OpenGL Shading Language book and GPU Gems, are all very math intensive. I doubt you're going to be able to share ALUs between threads. On the other hand, condition/branch logic probably could be. But on the gripping hand any statistics from generation 1 and 2 shaders are going to be biased in favour of math ops because that was before branches became widespread. So it is possible that shader code will have an instruction mix more like generic C/C++ over the next few years. I'd You'll probably get some sequential access patterns across threads rather than within them. If a horizontal span of fragments is being done in parallel by 2/4/N threads, it's quite likely (especially for a 2D GUI) that thread #0 will need texel P+0, thread #1 P+1, ... Sheesh, I'm glad I'm a software person and don't have to worry about designing and building this stuff :-) -- Hugh Fisher DCS, ANU _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
For matrix multiplies I might suggest APL, but it's Greek to me. :-) _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
A "normal" cpu are design to handle interrupt and context switch, that's a
udge constraint for a lot of possible optimisation. GPU look more like DSP
ABS v v absolute value
ADD v,v v add
ARL s a address register load
DP3 v,v ssss 3-component dot product
DP4 v,v ssss 4-component dot product
DPH v,v ssss homogeneous dot product
DST v,v v distance vector
EX2 s ssss exponential base 2
EXP s v exponential base 2 (approximate)
FLR v v floor
FRC v v fraction
LG2 s ssss logarithm base 2
LIT v v compute light coefficients
LOG s v logarithm base 2 (approximate)
MAD v,v,v v multiply and add
MAX v,v v maximum
MIN v,v v minimum
MOV v v move
MUL v,v v multiply
POW s,s ssss exponentiate
RCP s ssss reciprocal
RSQ s ssss reciprocal square root
SGE v,v v set on greater than or equal
SLT v,v v set on less than
SUB v,v v subtract
SWZ v v extended swizzle
XPD v,v v cross product
That's mainly fp multiplication. So the design must be done to use the
FMUL at each cycle. Or we could choose to have a 2 cycle FMUL but a
smaller one, and use more core (the compiled code show a lots of MOV
instruction during the time).
_______________________________________________
Open-graphics mailing ...This is a multi-part message in MIME format. --------------060907010700030603020102 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit bloc. On those 5 unit one is dedicated for memory management load store register and data mouvement, the other make a single execution unit for vector operation or 4 distinct unit for scalar operation. Each unit could do only a subset of the scalar operation, they don't all need to be able to do the same one, also it will help reduce the overal size of each unique execution bloc. With such an architecture and since the code to run is rather small, it will probaly be possible to optimise the order for the operation for doing most stuff in parallel. Also since all the unit, work at the same time. We just need to define a rather large instruction memory on chip, it dosn't need to be deep since for first generation shader program couldn't depass 255 instruction for a basic program. So one instruction line feed a the time 5 operation. It look like a little bit like a dsp architecture. After that you could reproduce the meta bloc many time depending on the performance you want. But with more than one bloc you will need a kind of dispatcher(hardware or software with the driver) to divide the work. Since it is a small processor if you don't have to much dependancy betwen different instruction and you know the number of clock for execution you could have multiple instruction executing at the same time by pipelining the operation. --------------060907010700030603020102 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> <title></title> </head> <body bgcolor="#ffffff" text="#000000"> <a class="moz-txt-link-abbreviated" href="mailto:nico@seul.org">nico@seul.org</a> wrote: <blockquote ...
That's basicaly a LIW or VLIW instruction set. If you put many operation in parrallel, you need also a lot of register read/write per cycle. Read and write in a register bank is quite a slow operation (LEON 3 processor, a 7 stage pipeline cpu, use a complete cycle to do it). If you add more register port you slow done the acces time of the bank. So the speed-up must be higher than this loose. But to acheive high throughput you need to fill well the instruction slot. That's a hard stuff (see the Intel problem with itanium compiler) In my previous proposal, i use LIW trick for managing the Program Counter and use predicat which read an other register bank. You could also multiply register bank to avoid slowdown but then you will have problem for exchanging the data. This could add few more MOV instruction. If you want more speed on a task for a given silicon technology, you could use VLIW technique, SIMD one, scoreboard, out of order, etc... All of this is used in todays big cpu. But you could also be multi-cpu, multi-core. In a tradionnal computing, the killer is communication between core. That's why a 1000 cpu computer could be slower than a 6 vector cpu (from Cray or Nec). But from a raw power point, the 1000 cpu is faster. But in real world exemple, it is slower. In shader, there is no communication. So, multicore is from my point of view the better way to do it. Nicolas Boulay _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
So you can't pipeline anything or you will have few cycle latency for each _______________________________________________ Open-graphics mailing list Open-graphics@duskglow.com http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
