Re: [Open-graphics] Re: Looking towards the future: Graphics technology

Previous thread: [Open-graphics] product idea: networked display by Daniel Rozsnyó on Thursday, April 13, 2006 - 10:30 am. (11 messages)

Next thread: [Open-graphics] Free hardware conference by Lourens Veen on Thursday, April 13, 2006 - 11:49 am. (1 message)
From: Timothy Baldridge
Date: Thursday, April 13, 2006 - 10:56 am

On the discussion about GPUs for OGP:

Here is an off the wall idea a friend and I were talking about a while
back. I think that it would be of great use in a GPU. The idea is to
create a "modularized" RISC processor. Here is a basic rundown:

The CPU accepts 3 instructions:
load
store
mov

All operations are actually modules that expose one or more registers.
A add module would look like this:

addin (in)
addins (in)
addout (out)

Every clock cycle the contents of addin and addins are added and the
result is stored into addout. Now this doesn't sound all that good,
except when you realize that in current processors the pipeline must
sit idle while a result is being computed. When you get to
multiplication this can be 3-4 cycles! So let's say we need to
multiply three values. On a normal processor it would look like this:

command               (cycles)

mov 1, reg1           (1)
mov 2, reg 2          (1)
mul reg1, reg2       (4)
mov reg3, result     (1)

mov 1, reg1           (1)
mov 2, reg 2          (1)
mul reg1, reg2       (4)
mov reg3, result2    (1)

mov 1, reg1           (1)
mov 2, reg 2          (1)
mul reg1, reg2       (4)
mov reg3, result3    (1)

mov 1, reg1           (1)
mov 2, reg 2          (1)
mul reg1, reg2       (4)
mov reg3, result4    (1)

Total clock cycles: 28

Modularized RISC method:

command            (cycles all 1)

mov 1, mulin1
mov 2, mulin2
mov 1, mulin1
mov 2, mulin2
mov 1, mulin1
#At this point the result from line 2 is ready so we move it out
mov mulout, result1
mov 2, milin2
#And now we can move the result from line 4
mov mulout, result2
mov 1, mulin1
mov 2, mulin2
#Result from line 7
mov mulout, result3
mov 0, sink #Wait a cycle
mov 0, sink #Wait a cycle
mov 0, sink #Wait a cycle
mov mulour result 3

Total clock cycles: 15

Now let's say that we could execute two move instructions at a time.
Then the code would look like this

mov 1, mulin1 : mov 2, mulin2
mov 1, mulin1 : mov 2, mulin2
mov 1, ...
From: Timothy Miller
Date: Thursday, April 13, 2006 - 11:55 am

This reminds me of a design someone was calling "MISC".  (Minimal
Instruction Set Computer)  Basically, the only instructions were
"moves".  If you want to perform some computation, you move the source
operands into special purpose registers and then move the result out
of its special purpose register.  Results queue up, so you can pop
them out any time you like.  I didn't go anywhere with this idea
because it makes context switches hell, but for an embedded design
with no interrupts, it would work fine.

Before we get into implementation details, however, we really need to
figure out what our requirements are.  Based on what little I know
about shaders, I don't know enough to decide between what you
describe, a stack architecture, a 3-operand load/store machine, a
2-operand load/store, etc.
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Jeff Garzik
Date: Thursday, April 13, 2006 - 12:58 pm

If you want a shader-centric GPU, you want CISC with (a) special-purpose 
floating point and (b) vectorized instructions...

	Jeff


_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Baldridge
Date: Thursday, April 13, 2006 - 1:05 pm

To this I ask why? Aren't GPUs vectorized RISCs?

But yes, FPUs are a must. In fact, modern GPUs don't even use
integers. They simply throw away the data after the decimal.

Timothy

_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Jeff Garzik
Date: Friday, April 14, 2006 - 1:13 pm

The ones I'm familiar with are, yes, but I think you can do better with 
CISC.

Now that OpenGL is largely GLSL ("C for graphics"), there is a -lot- of 
room for optimization and improvement there.

	Jeff



_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Baldridge
Date: Friday, April 14, 2006 - 1:23 pm

So what benifits are you going to have in using CISC over RISC?
Smaller code sizes I guess, at the the expense of complexity.

I'm thinking for a GPU wider is going to be better than deeper. With
FPGAs were are going to be limited by the clock. Why make it worse by
going to CISC that relys on higher clock speeds? Why not make a simple
core so that we can pack 16-24 of them in one FPGA instead of one that
will take up half the CPU?

Timothy

_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Miller
Date: Friday, April 14, 2006 - 4:06 pm

Actually, deeper is always better unless you have control and
dependency hazards.  We can easily enough avoid the dependency
hazards, so all we have to worry about are loops in shader code.  When
the shader loop is done with a pixel it can just toss it down the
pipeline.  We can unload a lot of computation from the microcode stage
by doing it later in the pipeline.  It doesn't matter how many stages
don't do any useful work.

The only time a GPU's pipeline becomes a real problem is when you have
to read from something you've just written to, like if you've drawn to
a texture and then want to use it as a texture.  Then you have to
flush the pipeline.  You also have to worry about bitblts to and from
the same surface (we can use some intelligence to avoid it sometimes).
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Jeff Garzik
Date: Friday, April 14, 2006 - 4:17 pm

In general... correct.  But you can also do stuff like out-of-order 
execution to work around some dependencies, or add hardware 
multi-threading to enable multiple "simultaneous" rendering jobs.  If 
the pipeline for one thread stalls, you can fill the pipeline with work 

I would certainly like to see such a beast, it would be very 
interesting.  But ultimately I think deeper _and_ wider is the way to go ;)

	Jeff



_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Baldridge
Date: Friday, April 14, 2006 - 6:00 pm

But what is the use of deep CPU? It's all embarrassing parallel. I
must be missing something here, because it always seemed to me that
GPU were simply high end vector processors.

On the other hand, making a GPU that could do hardware raytracing
(e.g. http://www.artvps.com/page/15/pure.htm ) That would be killer.
And I agree, for something like that CISC would rule. However, it
still seems to me that simpler is better in the case of a GPU.

On a side note, SGI years ago had a graphics processor (Was it the SE
series?) That understood native OpenGL. From what I understand, the
hardware it'self was OpenGL. All the driver did was re-package the
data a bit.

Timothy



--
I think computer viruses should count as life. I think it says
something about human nature that the only form of life we have
created so far is purely destructive. We've created life in our own
image. (Stephen Hawking)
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Michele Carla`
Date: Monday, April 17, 2006 - 3:42 am

No I think you are talking about VPro series (known also as Odyssey), SE
series is a little harder (take a look to linux fb driver for Odyssey
http://www.linux-mips.org/~skylark/ )

-- 
Pluralitas non est ponenda sine neccesitate
Frustra fit per plura quod potest fieri per pauciora
Entia non sunt multiplicanda praeter necessitatem

                                   Occam's Razor

MiChele Carla` aKa Goldfinger <goldfinger@member.fsf.org>
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Lourens Veen
Date: Saturday, April 15, 2006 - 8:27 am

--nextPart1456959.MVNjPkcNHL
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline


It depends on how varied your task is. For graphics, I think quite a=20
small number of CISC-style operations would suffice. For example, you=20
are going to do loads and loads of dot products (matrix multiplies in=20
vertex shaders, normal calculations for per-pixel-lighting, bump=20
mapping, and so on). Doing all the adding and multiplying one at a time=20
means more instructions and more pressure on the scheduling hardware.

Can't we combine this with Timothy's MISC idea? Have a "CPU" with=20
load/move/store, and a bunch of functional units that can each perform=20
a complex (think Altivec/3DNow!/SSE3 or even more complex than that,=20
like a dot product) instruction. Newer processors can simply have more=20
functional units, and could be backwards compatible with their=20
predecessors.

Lourens

--nextPart1456959.MVNjPkcNHL
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQBEQRD0vmNyqZHWDvURAp0uAKCFXAv4ypJfDBgoqWbSL3e4/d4crACcDN6c
Jbnc7RnC/iKwDK9TN5Q94zo=
=lFBE
-----END PGP SIGNATURE-----

--nextPart1456959.MVNjPkcNHL--
From: Timothy Miller
Date: Saturday, April 15, 2006 - 9:21 am

The idea I keep thinking about is to have a pipeline of general
functional units.  As a fragment passes down the pipeline, it's like
executing instructions.  If the number of instructions to be executed
exceeds the pipeline length, the fragment gets forwarded back up to
the beginning.  Loops would get unrolled to the pipeline length;
longer ones would work via the forwarding mechanism.  Any sequence of
instructions shorter than the pipeline length would get padded with
NOOPs.

The problem is that any more than a few general purpose registers
would make every pipeline stage a massive amount of logic, limiting
the number of stages.  But the idea is to get great throughput at a
low clock rate.  We cannot design something to run at 500MHz.
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Lourens Veen
Date: Sunday, April 16, 2006 - 11:21 pm

--nextPart2388207.Fao7WkK11i
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline


So basically it would be a pipeline of processors. But that is possibly=20
less efficient than having a single MISC "scheduler" in the middle, and=20
a lot of functional units around it. Each processor in your pipeline=20
only ever does one instruction, and all the hardware for the other=20
functions it can perform is idle. In contrast, separate functional=20
units could all work at the same time, if they could get data quickly=20
enough. Perhaps there should be multiple MISC cores, they're likely to=20

You can still get high throughput with pipelined functional units. It=20
doesn't matter much if it takes ten cycles to multiply two numbers (or=20
vectors of numbers), as long as you can provide two new numbers to=20
multiply every cycle, and read out the result of the calculation that=20
started ten cycles ago. Throughput will still be ok (or at least as=20
good as it gets at the given clock rate).

Lourens

--nextPart2388207.Fao7WkK11i
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQBEQzPYvmNyqZHWDvURAuCpAKCZIo4TFZUNwCZG/I80GtjtzkyehQCfYbRw
6PzdU+RMrcC9/yMKWo88vKM=
=avx2
-----END PGP SIGNATURE-----

--nextPart2388207.Fao7WkK11i--
From: Timothy Miller
Date: Monday, April 17, 2006 - 5:29 am

One of the things we're forgetting is that static scheduling is way
behind the curve, but dynamic scheduling requires lots of extra
hardware.  Unless we hand-code most of what we run on this or have
some massive peep-hole optimizer library, we're always going to get
sub-optimal code.

The only way to keep the computing units busy with a new fragment
every cycle is to avoid data dependency hazards.  We can only do that
if we can overlap the processing for different fragments (like
threads).  Then we have to keep track of multiple processor states.

Only slightly related, the statistics I have on branch delay slots say
that they're only fillable about 60% of the time and they're only
useful to the computation about 80% of the time when they're filled,
making delay slots only useful about 50% of the time.
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Baldridge
Date: Monday, April 17, 2006 - 7:25 am

I think one thing that we are missing here is the fact that we are
thinking about a GPU design, not a full blown CPU design. Has anyone
here ever written a shader before?

Here is my vote, keep it simple. If you go with CISC or anything with
a longer pipeline, you are going to have problems with data
dependency, long pipelines,

A MISC design is going to need two maybe three stages in the pipeline.
Fetch, and Execute, maybe decode, but maybe not. Data dependency is
not going to be an issue. It would be a blast programming a compiler
for this sort of GPU, you could optimize the shaders to death.

We have to stick with what is practical. And what will work well. Plus
we are limited by the following restrictions:

Low clock rate (200-300Mhz?)
Small transistor space

What ever we make must fit in these two restrictions.

I do have a question though? Does the GPU on the current OGP design
have direct access to the memory? Or does it contact the video memory
through a memory controller of sorts.

If, somehow we could give the GPU direct access to video memory,
basically 64MB of registers. Then we would have a design that would
give some powerful performance benefits. We could then design the MISC
modules to accept memory locations. So you could say, "multiply 0x0004
with 0x01004 placing result in 0x02004 executing it 0x0010 times.".

We find ourselves in a catch 22 here. I'm afraid that a RISC design is
not going to be fast enough. We'll be trying to push too many
instructions through the chip too fast. However, a CISC design is not
going to be much better. We cannot go with Out-of-Order execution
because of the complexity. But performance is going to suffer unless
we can execute more than one instruction at a time.

But what someone said here was right. We won't know how it works until
we start trying to program it. That's the wonderful thing about OGP
right? So when we get the first prototypes out, those of us who feel
like it can program our own GPU on ...
From: Timothy Miller
Date: Tuesday, April 18, 2006 - 7:05 am

Yes, since you are going to compile custom for each revision of OGA,
you can do all the scheduling in the compiler.  This will increase
code size (more NOOPs), but it simplifies the hardware by not

You cannot contact memory without some sort of memory controller. 
It's got to manage banks, row misses, refresh, etc.  There is no such
thing as "direct access to memory."  Our memory controller, however,

As it turns out, we have an odd case where the memory is at least as
fast as the logic we can afford to control it with.  Modern processors
use lots of registers because memory is a horrible bottleneck.  Our
problem here is that although memory is relatively fast, there's still
a significant latency between request and receipt of read data.  Plus
it's variable (row misses incur extra delays) and non-deterministic
(memory refreshes appear random to the compute engine).

The lesson I learned long ago with memory is to do as much batching as
possible.  Read requests get queued, as to the responses.  That means
the GPU has to be designed to absorb the latency.  OGA has a fifo in
the pipeline that sits between request and receipt stages just for
that purpose.  Writes are queued and forgotten.  For performance, it's
important that reads and writes all be allowed to complete out of
order so that you can perform all accesses for one row before
incurring the penalty to switch to another one.  A sort of "memory
barrier" is used to sync everything up when you need to read what you
just wrote (fortunately a rare event in fixed-function pipelines, at
least).

The MISC approach is interesting, but all you're really doing is
encoding part of the opcode into the register number.  Six of one,
half a dozen of the other.  If it saves you something, do it.  But I
don't think it does.  (With TROZ, I encoded the rendering command into
the address, reducing the number of bus cycles necessary to initiate
drawing.)  Still, there are also plenty of things we could gain by
this approach, including the ...
From: James Richard Tyrer
Date: Wednesday, April 19, 2006 - 11:24 pm

Timothy Miller wrote:

Floating point isn't really RISC, you have a RISC processor with a FPU 
since the FPU couldn't execute stuff in one clock.  And, then methods 
were developed to accelerate the FPU so that it could execute in one clock.

What we really need to decide is how to handle operations (such as 
matrix math) which require a series of steps for the ALU to complete. 
We can use CISC like operations where the series of operations is in 
microcode.  Or we can have a RISC like control of the ALU where we have 
SIMD that can operate on up to 4 32 bit floats in parallel and issue a 
series of instructions to do what one CISC like operation would do. 
But, you can combine these like the transputer and have the OP 
instruction which calls a macrocode subroutine to do these things.

I can't see having more than one ALU per shader since it should have 4 
32 bit float hardware multipliers.  Since the standard shader operation 
is multiplying three 4x4 matrices, it is the hardware multipliers that 
is going to boost throughput.

-- 
JRT
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Erik Hofman
Date: Thursday, April 20, 2006 - 1:25 am

While I don't have much experience with this stuff I do remember someone 
at nekochan.net did optimize mplayer using the madd and nmsub 
instructions on the MIPS processors to optimize the matrix operations, 
gaining 300% optimization. The interesting read can be found here:

http://forums.nekochan.net/viewtopic.php?t=2976

Erik

-- 
http://www.ehtw.info (Dutch)	Future of Enschede Airport Twente
http://www.ehofman.com/fgfs	FlightGear Flight Simulator
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: nico
Date: Thursday, April 20, 2006 - 1:43 am

I don't think it's wise to use SIMD ALU here. All scalar code will use the
SIMD FPU with 3 FMUL unit idle. Because everything is strongly parrallel,
i think it's better to stay scalar.

32 bits flotting point instruction is the op the most used. So the
performance will depend on the number of such unit and the efficiency of
there use.

Beside that complexe 128 bits data path is always harder to route, so it's
mandatory slower than cpu core with 32 bits internal data path.



_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Miller
Date: Thursday, April 20, 2006 - 5:00 am

If there are enough independent scalars that can be scheduled, you can
pack them and run them in parallel.
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: nico
Date: Thursday, April 20, 2006 - 6:20 am

So you need the logic to detect that a pack is possible, and you need the
switch that permit to connect the different register bank and the FPU.

For what ?

The only advantage over 4 cores depend on the size of control logic, it
depend if it's negligeable in front of the size of a 32 bits FPU. In the
other side, 32 bits switch could be big.

The goal of the shader is to maximise the use of FPU or more precisely the
FMUL instruction.

So you could create un instruction word which look like this :
- OPeration code 1+ addr Read register 1 + addr Read register 2 + addr
write register 1
(this are for MOV, LOAD&STORE and maybe for logical op as "<" ">", so it
could do FADD FSUB)
- OPeration code 2 + addr Read register 4 + addr Read register 5 + addr
write register 2
OPCODE2 could be small (FMUL, integer MUL, what else ?)

There is 2 registers bank. So you could use 4 read and 1 write memory for
the register bank. Each read could access the 2 bank but write could only
access a dedicated bank. It's depend on the technology you could afford
(full custom or not... 4 read and 2 write memory are maybe common
nowadays)

Then you add :
- Precicat
That's a very easy way to make small "if" statement without breaking the
pipeline. (like CMOVE in x86). Predicat are access to a register that said
this register is null or not. If the register is null, the current
operations are cancelled.
- Predicat + Imm8
That's the way to handle loops, jump and the repeat instruction of some
DSP. If the register is not null, PC+IMM8 is performed with a delay slot
of 1, otherwise PC+1 is used.

The instruction world is big :) I have read somewhere that 32 registers of
vec4 are needed. So you must have at least 128 32 bits registers. If you
add some trick as R0 == 0, and some specific register, register address
will need 8 bits. Depending how you encode the opcode, you will reached 80
bits for an instruction word.

The "jump part" could manage "directly" the PC with a delay slot. The
predicat could ...
From: Tom Cook
Date: Thursday, April 20, 2006 - 3:47 pm

This is a MIME-formatted message.  If you see this text it means that your
E-mail software does not support MIME-formatted messages.

--=_kayan.duskglow.com-2918-1145573867-0001-3
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: base64
Content-Disposition: ...
From: nico
Date: Friday, April 21, 2006 - 1:57 am

So you want to pack and unpack SIMD register. That's a cuting edge
technology very few used in normal computing (using SSE2 or altivec,...).
Compiler are quite bad at it. But if you optimise the pack/unpack
instruction, this will represente switch that are big and slow.
FMUL is a one cycle operation. If you need 3 pack instruction before,

See the real code posted here !


If i have understand shader correctly, load are only for texture, every
thing else is transmited trough specific register. So this load are
implicit.

Basicaly DOT product take one cycle in vector arch, and 4 in a scalar LIW
arch (like I and André Pouliot explain) if you could interleave correctly
the instruction (7 instructions latency otherwise, with a 3 cycles latency
FPU). So for DOT it's completly the same. Because a scalar cpu will be
almost 4 times smaller than a vector shader, you could put 4 cores scalar
where you put 1 vector core.

When you look at the compiled code posted here, you see a lot of MOV,
scalar MUL, etc... In this precise case, the SIMD unit is completly
underused.

I don't have access here to the ASM published. But this is the kind of
code to optimise.

I know that if only vector code is used, a vector shader will be faster
(because there is less problem of read-after-write dependancies than in


_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: J.O. Aho
Date: Sunday, April 16, 2006 - 3:31 am

Ain't the later x86 RISC that "emulates" CISC?
RISC don't ned to reload data in the same manner as CISC has to, which in 

The CISC commands uses a lot more cycles than what the RISC does, in 
general they do take the same amount of time as CISC X-function*XX-cycles 
while RISC XX-functions*X-cycles. The draw back for the CISC is that it 
empties the registers and if next funxtion needs to use the same data this 
has to be reloaded to the register, while the RISC will still have the 
data in the register as it only empties the register when it needs to.

This is at least written in most RISC vs CISC pages I have read, none 

What about VIM (Sparcs version of "Altivec/SSE") ? ;)

Something I did like with Transmeta was the possibility to load different 
cores into it, sad they just made 586 and no PPC/.../...



Isn't a shortening of the pipelines a good idea? This is what has been 
done on both Sparc and PowerPC, both has a lot shorter pipelines than what 
AMD has on it's which is shorter than Intels. IMHO the old Apple video 
called "The MHz Myth" shows the benefits of shorter pipelines.
Don't know how things would be with even shorter pipelines than IBM's and 
Freescales PowerPC has (eXponental had longer pipelines and did less on 
higher MHz than the ones from Freescale).


-- 
      //Aho

  ------------------------------------------------------------------------
   E-Mail: trizt@iname.com            URL: http://www.kotiaho.net/~trizt/
      ICQ: 13696780
   System: Linux System                        (PPC7447/1000 AMD K7A/2000)
  ------------------------------------------------------------------------
             EU forbids you to send spam without my permission
  ------------------------------------------------------------------------
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Dieter
Date: Sunday, April 16, 2006 - 2:33 am

Smaller code size?
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Miller
Date: Sunday, April 16, 2006 - 2:43 pm

That is the primary advantage, although some RISC designs offer
mechanisms to reduce code size (alternate instruction set or dynamic
decompression).
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Jeff Garzik
Date: Sunday, April 16, 2006 - 2:48 pm

Not really.  This entire thread oversimplifies the differences.

x86 is a vastly different beast from traditional RISC.  Further, modern 
production RISC processors sometimes approach CISC in their complexity. 
  See e.g. the 'sqrt' instruction on any modern RISC processor.

And then there's super-scalar execution differences, out of order 
execution, vastly different TLB behavior, ...

	Jeff



_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Miller
Date: Sunday, April 16, 2006 - 2:57 pm

Yes, "Reduced Instruction Set Computer" is a bit of a misnomer these
days.  Perhaps it would be better to call them "Simplified Instruction
Set Computers."  Many aspects of the design (not just instruction
decode) are simplified by having completely uniform instruction
formats.  RISC processors were originally designed around the
pipeline.  That's changed a bit, because the instruction sets are now
a bit more of an abstraction from the hardware, but there are still
distinguishing features between RISC and CISC.

In the late 80's, RISC was seen as the holy grail since it simplified
processor designs and made room for significant improvements in
performance.  With the dominance of superscalar and OOO designs, that
simplification is no longer as much of an advantage.  At the same
time, legacy instruction sets like x86 are even more suboptimal. 
Given the current state of processor designs, can we now design an
instruction set and processor architecture that fits the new model
more directly?

Or course, we may already have those, with names like VLIW and EPIC.

This is kinda off topic, but CPU designs have already fascinated me. 
And I wonder what approach we may take to programmable shaders.
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Jeff Garzik
Date: Sunday, April 16, 2006 - 3:22 pm

Agreed on all counts.

FWIW, I'm largely trying to combat falsehoods here, not trying to argue 

I think the "simplified processor design" part is the key for RISC. 

Justification for the "suboptimal" claim?

IMO, x86-64 ISA seems to most closely match the operations that a 
compiler wants to generate.  It combines the best of RISC (oodles of 
registers) with an instruction set that matches the basic operations 

This always sounds good in theory, but you run into a compiler barrier 
here.  ia64 is a really smart, advanced EPIC architecture, but the 
compiler technology is still trying to catch up.

If the software isn't capable to fully utilizing the hardware, you've 

IMO ideally what is needed is practical experience, to answer that 
question (which again requires time and money).  One needs to work 
inside a feedback loop:

	1. design the hardware, based on guesses
	2. design the shader JIT, based on initial hardware ISA
	3. profile to see where the hardware spends most of its time,
	   based on likely-common usage workloads.
	4. update JIT and hardware to reflect profile data
	5. go to step 3, if you have the time and energy.
	6. get some hardware out to the general public
	7. find out all your workload assumptions were wrong,
	   and go back to step 3.  :)

So for OGD, I would recommend the open source way:  release early, 
release often.  Design a very simple, just-to-get-going GPU that 
supports programmable shaders.  _Just enough_ to get people working on 
the software.  Ignore everyone's opinions on the mailing list [for now]. 
  Then enter the feedback loop...

	Jeff



_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Michele Carla`
Date: Monday, April 17, 2006 - 3:54 am

Totalllly agreee !!!

What about having a HI-Tech mailing-list dedicated only to OpenGraphics
-- 
Michele Carla` <goldfinger@member.fsf.org>
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Miller
Date: Monday, April 17, 2006 - 5:23 am

Are you complaining?  This list is dedicated to Open Graphics.  But
since we are dependent on being profitable, it's important to explore
what other options we have in terms of products to sell.

(1) As soon as Open Audio becomes an official project, it'll get its
own mailing list.  Since no one has stepped up and volunteered to lead
it, it's not an official project.

(2) OGD1 is an integral part of the OGP timeline.  But we are also
dependent on it being used for many other things besides just
graphics.  As a result, we will discuss non-graphics things on this
list.  It's important that we do so.  When we have the resources to be
developing HDL full time for OGA, "on topic" for this list will shift.

Rather than complaining, I suggest that you help us come up with
profitable product ideas that will push us that much closer to being
able to have a mailing list dedicated only to graphics.  (I like it
when the list is orderly and on-topic, but I have more important
things to worry about, like getting OGD1 built.)
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Jeff Garzik
Date: Sunday, April 16, 2006 - 2:43 pm

There is no such thing as x86 RISC.

You are probably thinking of the x86 microarchitecture, which decodes 
CISC instructions into one or more RISC-like micro-ops inside the CPU.


I don't think you know what you are talking about, here.

Data dependencies exist regardless of RISC or CISC, which is why modern 
RISC and CISC processors all have prefetch instructions and other data 

Largely false, for the most common instructions.

x86 instructions like 'mov' or common ALU instructions typically execute 
in a single cycle.  In fact, with multiple ports, _multiple 
instructions_ can be executed in a single cycle.

And don't forget that RISC blows your i-cache out of the water.  x86 is 
essentially a compressed instruction set, with the most common 
instructions requiring only 8-bits to describe, versus 32 bits for a 


Registers are registers.  They store values for use by multiple 
instructions, on either CISC or RISC.  That's what registers do.

32-bit x86 even has _far more_ registers than the ISA suggests.  Google 
for "register renaming" sometime.  64-bit x86 solved this problem, by 
adding a ton of registers to the ISA.

Finally, optimal register usage is a function of the compiler and the 
code being compiled.  Code with heavy data dependencies will use few 
registers, constantly loading new data into the same registers for 
calculations in loops and whatnot.  There is a ton of literature on 

Again, you cannot make a blanket statement like "shorter pipelines are 
better."

	Jeff


_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Nicolas Boulay
Date: Monday, April 17, 2006 - 2:04 pm

One of the backside of MISC is that you can't be backwards compatible. You 
needs more space to define new register. You need to define different 
latency. Basicaly the instruction world look like a 2 registers, one read,  
one write. You could add a bit to the input register for immediat numbers. If 
you use 64 registers, it take 6 bits. 

Then a single move use 12 bits + 1 bits at least. But you could be vliw, and 
do n move at a time, 128 bits instruction word could  integrate 10 MOVE 
instructions.

GPU don't need to be backwards compatible like CPU. 

MISC instructions could be used to really have the best performance by hiding 
latency.
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Lourens Veen
Date: Monday, April 17, 2006 - 11:21 pm

--nextPart1648883.9UkAvLdjsT
Content-Type: text/plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline


You are right, I hadn't thought of that. If there were room for=20
extensibility in the opcode format, and you didn't improve upon the=20
functional units themselves, it could be made backward compatible, but=20
as you say, it's not necessary.

Lourens

--nextPart1648883.9UkAvLdjsT
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQBERIVUvmNyqZHWDvURAk7sAJ0RinJ5dS3NVRXt1Vy9gcg64WVK6wCcDnzE
A9xqVXD0uDSB4bSpWbeJGsE=
=CS64
-----END PGP SIGNATURE-----

--nextPart1648883.9UkAvLdjsT--
From: nico
Date: Tuesday, April 18, 2006 - 12:37 am

If you keep room for few more adresse, you waste memory for the code. And
you save half of the problem. If you make a vliw design, you could pass
from 4 to 8 instructions in the same word but if you really do 4
instructions, you add some constraints that make the design less


_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Baldridge
Date: Thursday, April 13, 2006 - 1:02 pm

I knew it wasn't our original idea, but I didn't remember the name of it.

Basicly, current GPUs are nothing more than smart vector processors.
I'm thinking that a design like the one I mentioned would allow the
processor to be used for a GPU, or a CPU. Since we are dealing with
embedded designs (like you mentioned) we don't have to worry about
threading issues.

The think I love is that this design is extremely flexible. It would
be very simple to allow a developer to modify it by adding or removing
arithmetic units.

Here is another weird plus. It seems to me, that a compiler could (if
it knew enough about the underlying hardware) be created to translate
from one revision of the processor to another. So, let's say that a
driver for a revision of the chip that has two mov units would have
optimized code.

In the other example I gave, the code that was 15 instructions in
length, could be split into a series of blocks, these blocks could
then be resolved into the code that was 8 instructions in length.

It's this idea of the modularization that I find so attractive.

CPUs are too liner, and GPUs are not linear enough. If we could find
the sweet spot in the middle we may be on to something.

Timothy

_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Tom Cook
Date: Monday, April 17, 2006 - 5:52 pm

This is a MIME-formatted message.  If you see this text it means that your
E-mail software does not support MIME-formatted messages.

--=_kayan.duskglow.com-828-1145322139-0001-3
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: base64
Content-Disposition: ...
From: Timothy Miller
Date: Tuesday, April 18, 2006 - 7:16 am

That is the basic idea behind MISC.  Everything is done via
special-purpose registers.

But as I said before, all you're really doing is encoding the opcode
into the register index.  And think about this for a moment:

In a MISC design, if you want to add, you specify the source operands
(probably general purpose registers) to copy into "input registers"
for your adder.  That's two moves (which you can do in one
instruction).  Later, you can pop out the result and move it back to a
GPR.  (Another move.)  So here's your code:

mov rA0 <- r1, rA1 <- r2
...
mov r3 <- rA2

Let's say you have a register space of 256 registers, so each
instruction takes 16 bits.  The two together require 32 bits.

Now, let's consider a RISC design.  In this case, you don't need so
many registers, just the GPRs:

add r3 <- r1,r2

If the add opcode is 4 bits and the three operands are each 4 bits,
then you need 16 bits to encode this.

The point to take is that you need twice as many bits to encode the
same "instruction".  With misc, all you've done is move the upper
nybble of the add ports (0xA) from the three operands into the one
opcode.

It may very well be worth it to use the extra bits (something I've
hinted at earlier), but keep in mind where your redundancies are and
make sure they're a net gain.


Oh, one other thing:  Compilers have a hard time with special-purpose registers.
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: nico
Date: Tuesday, April 18, 2006 - 7:57 am

MISC are a false good idea. Managing pipelined unit and latency are an
horrible task : if your FMAC is a pipelines unit, you have exactly one
time slot to read the output !

If i remember correctly, shader program use only ~256 instructions. So you
could design an instruction set that look like µcode or vliw code.

Such shader should be small and fast. If you need more power put 2 of them.

For me the instruction set, look like a part for JUMP/LOOP management, a
part for computation (very complete : ALU, dot product, MIN/MAX, CLAMP...,
all in 1 cycle) you could put 4 register read and 2 write here, such ILP
could be easly find in typical code, and you add a third part that manage
load&store with complexe adressing mode (for 1D, 2D and 3D data,...). You
could also add some bit for predicat calculus. This introduice a cheap way
for if-clause.

At the end, you will have a long instruction word (>100 bits), few
register set (one with a least 32*4 floating 32 bits word, one for adresse
calculation, one maybe for managing data read from the memory (a write
port is very costly))

1 instruction could perform a load or a store, a calcul and a jump. It's
important to have the calcul unit always produicing a usefull work.
Normaly, it's the largest unit of the shader design.



_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Baldridge
Date: Tuesday, April 18, 2006 - 11:18 am

So? The compiler should be able to schedule this with no problem. I
could write the compiler to do this myself. If you can do 4 movs per
clock cycle, it should be no issue whatsoever. If you find my comments
in error, give me and example (and code to go with it) where the MISC
idea fails.


Okay, this has got me thinking now. What we need is a simple program
for testing how different processors function. What we need is a
simple program that can describe (in software) the characteristics of
the processor (latency, etc.) and run a simulation based on these
characteristics. Give me  a week and I'll see what I can hash out...





--
I think computer viruses should count as life. I think it says
something about human nature that the only form of life we have
created so far is purely destructive. We've created life in our own
image. (Stephen Hawking)
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Nicolas Boulay
Date: Tuesday, April 18, 2006 - 12:09 pm

MISC design was discuss a lot in the f-cpu project. It was abandonned because 
it's impossible to keep backward compatibility. That's not a problem here.

From my point of view, the problem is how to manage the output of the unit. 
RISC like instruction permit to give an output register. So you could 
schedule instruction. In MISC, most read of the output will be at a fixed 
place.

Mul.a <- Reg1    ||  Mul.b <- Reg2
Reg3 <- Mul.res || Mul.a <- Reg4 || Mul.b <- Reg 5
Reg6 <- Mul.res

It's hard to do better. 

From a hardware point of view the "big switch" will be slow. But the killer 
are long latency instruction. Some could be predicted. Imagine how to 
schedule the read of a 32 cycles divide.
Even worst : how do you manage the variable latency of a memory read ?

For me, a shader instruction look like µinstruction : PC control, calcul and 
load&store in the same instruction world.

_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Miller
Date: Tuesday, April 18, 2006 - 1:45 pm

Two ways:

(1) Stall on attempt to read from empty read response queue.
(2) Branch on status flag
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Nicolas Boulay
Date: Tuesday, April 18, 2006 - 1:53 pm

To make an active wait ?
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Miller
Date: Tuesday, April 18, 2006 - 2:26 pm

Why would you implement it this way?  If you have an outstanding read,

Yes.

You probably wouldn't want to write an algorithm that did something
different depending on whether or not data was available, so you'd end
up just spinning on the branch instruction.  When data is streaming,
that's a wasted instruction for every read (because you always have to
check).  It's better to just schedule the read as far after the
request as you can and just stall when it's not available yet.

Also if we're not careful, we'll think too single-threaded here. 
Every fragment shader needs to be split into two pipelined threads. 
One thread's output is the input to the other one, and there's a fifo
in between them.  Either thread can do anything, but generally, when
there are memory reads to be done (texture stuff, etc.), one thread's
job is to make requests, while the other is the consumer of that
requested data.  All other work should just get split intelligently.
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Baldridge
Date: Tuesday, April 18, 2006 - 2:43 pm

The other question I have, is how much variance is there going to be
in a read? Sure on a CPU allot of it depends on if the data is in the
L1 or L2 Cache or memory. But since we are only going to be going to
memory (please tell me no one here was actually thinking of putting a
cache on a GPU). What will the variance be?






--
I think computer viruses should count as life. I think it says
something about human nature that the only form of life we have
created so far is purely destructive. We've created life in our own
image. (Stephen Hawking)
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Miller
Date: Tuesday, April 18, 2006 - 4:06 pm

Lots.  A row miss takes one amount of time; refreshes cause random
delays.  The worst are video reads which have the highest priority and
can tie up the memory for extended periods.
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: James Richard Tyrer
Date: Wednesday, April 19, 2006 - 11:52 pm

The video refresh reads means that we almost have to have a cache. 
However, this doesn't have to be a general purpose cache, it can even be 
a cache which requires explicit instructions to manage.

-- 
JRT
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Nicolas Boulay
Date: Wednesday, April 19, 2006 - 1:13 am

You can not see the output register of the load and store unit as a read
register. But as a stack that is empty for each read. That's different for

It look so much easier to just write the result when it's available in a
destination register with a scoreboard that block any read of this precise
register... So could easly hide the latency without any software


_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Miller
Date: Wednesday, April 19, 2006 - 4:45 am

That's kinda what I had in mind.  I don't want too sophisticated of a
scoreboard, because I'd like most scheduling static, but if we do it
right, we can allow some things to complete out of order so as to
reduce the impact of read latency.

Also, if we do the fifo thing I mentioned, it'll become a non-issue
for many algorithms.
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Lourens Veen
Date: Tuesday, April 18, 2006 - 1:33 pm

--nextPart6015061.gvkg9c2TSu
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline


What if there is more than one adder, or multiplier? It's lower level=20
than that, you don't just say which operation you want, but also where=20
it should be executed.

Lourens

--nextPart6015061.gvkg9c2TSu
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQBERU1EvmNyqZHWDvURAqgxAJ9BpqUJD0cbrT92Lva2nH2PzJrhywCcCqfA
dtnkmMZsaP80ZDeuQN6R/y4=
=ZTXm
-----END PGP SIGNATURE-----

--nextPart6015061.gvkg9c2TSu--
From: Nicolas Boulay
Date: Tuesday, April 18, 2006 - 1:54 pm

You could do that too with VLIW instruction world. You don't need out of order 
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Lourens Veen
Date: Tuesday, April 18, 2006 - 10:22 pm

--nextPart2097051.Xtrrf3Wpr4
Content-Type: text/plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline


What if all the functional units had the same latency, that is, they all=20
have a fifo on their output that increases their latency to some common=20
maximum M. Scheduling would become trivial, you just generate the=20
instructions in order, and then interleave M copies of the code. There=20
are always M identical instructions in a row so you only need to load a=20
new instruction every M clock cycles. ILP could be achieved through=20
having multiple MISC cores, if the compiler makes sure that they don't=20
access the same functional unit at the same time.

Lourens

--nextPart2097051.Xtrrf3Wpr4
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQBERckJvmNyqZHWDvURAgXYAJ9JGF3h9vFcMQp7CMg//yEDZLlHlgCghLgJ
TqxSwnIAgPZQEFs3K1fWAdo=
=pFLC
-----END PGP SIGNATURE-----

--nextPart2097051.Xtrrf3Wpr4--
From: Nicolas Boulay
Date: Wednesday, April 19, 2006 - 1:07 am

Latency is usualy a killer. Pipeline is used to keep the speed high.
Imagine a 1 cycle, 3 clock latency FMUL beside a 32 cycles divider. Then
you could add the problem with loop and if condition.



_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Lourens Veen
Date: Wednesday, April 19, 2006 - 4:59 am

--nextPart1576674.HSUu2FVkAa
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline


Ah, I hadn't thought of branching. Never mind then...

Lourens

--nextPart1576674.HSUu2FVkAa
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQBERiYHvmNyqZHWDvURAvHHAKCvpKpP/b6/RX26WD0QPK/lPeKHawCgpQZT
mB7EZfp315jDkNactx3QIRs=
=mxSR
-----END PGP SIGNATURE-----

--nextPart1576674.HSUu2FVkAa--
From: Tom Cook
Date: Wednesday, April 19, 2006 - 3:43 pm

This is a MIME-formatted message.  If you see this text it means that your
E-mail software does not support MIME-formatted messages.

--=_kayan.duskglow.com-32290-1145487249-0001-3
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: base64
Content-Disposition: inline

T24gNC8xOS8wNiwgTG91cmVucyBWZWVuIDxsb3VyZW5zQHJhaW5ib3dkZXNlcnQubmV0PiB3cm90
ZToKPgo+IFtzbmlwXQo+IEFoLCBJIGhhZG4ndCB0aG91Z2h0IG9mIGJyYW5jaGluZy4gTmV2ZXIg
bWluZCB0aGVuLi4uCgoKV2hhdCdzIHRoZSBkaWZmZXJlbmNlPyAgUEMgaXMganVzdCBhbm90aGVy
IGxvY2F0aW9uIGluIG1lbW9yeS4KClRvbQo=
--=_kayan.duskglow.com-32290-1145487249-0001-3
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: base64
Content-Disposition: inline

PGRpdj48c3BhbiBjbGFzcz0iZ21haWxfcXVvdGUiPk9uIDQvMTkvMDYsIDxiIGNsYXNzPSJnbWFp
bF9zZW5kZXJuYW1lIj5Mb3VyZW5zIFZlZW48L2I+ICZsdDs8YSBocmVmPSJtYWlsdG86bG91cmVu
c0ByYWluYm93ZGVzZXJ0Lm5ldCI+bG91cmVuc0ByYWluYm93ZGVzZXJ0Lm5ldDwvYT4mZ3Q7IHdy
b3RlOjwvc3Bhbj48YmxvY2txdW90ZSBjbGFzcz0iZ21haWxfcXVvdGUiIHN0eWxlPSJib3JkZXIt
bGVmdDogMXB4IHNvbGlkIHJnYigyMDQsIDIwNCwgMjA0KTsgbWFyZ2luOiAwcHQgMHB0IDBwdCAw
LjhleDsgcGFkZGluZy1sZWZ0OiAxZXg7Ij4KW3NuaXBdPGJyPkFoLCBJIGhhZG4ndCB0aG91Z2h0
IG9mIGJyYW5jaGluZy4gTmV2ZXIgbWluZCB0aGVuLi4uPC9ibG9ja3F1b3RlPjxkaXY+PGJyPldo
YXQncyB0aGUgZGlmZmVyZW5jZT8mbmJzcDsgUEMgaXMganVzdCBhbm90aGVyIGxvY2F0aW9uIGlu
IG1lbW9yeS4gPGJyPjwvZGl2PjwvZGl2Pjxicj5Ub208YnI+Cg==
--=_kayan.duskglow.com-32290-1145487249-0001-3--
From: Timothy Miller
Date: Wednesday, April 19, 2006 - 4:48 pm

One other thing I'm thinking about:

[a] We're going to be wanting to process some number of pixels in parallel.
[b] We're going to have trouble scheduling instructions to make best
use of functional units.

So, let's take advantage of that.  Let's assume we can have data
dependencies that make different pixels require different instruction
flow.  We can pull a Niagara and feed instructions for four threads
through a smaller number of execution units.  So, our add/mul units
are capable of both vector and scalar computations, so we have two
such units (or two of each type; whatever) and can schedule two vector
computations per clock or some arbitrary assortment of scalars on one
or both.  On empirical analysis of resource contention, we may add
some functional units later, but the idea is to remain reasonably
small.

Just like with Niagara, we have lots of opportunities to avoid control
and data hazzards, so we don't need to account for them.  (We may want
to have some locks in place, but we can afford to just stall.)  For
each pixel, even the effective memory read latency is smaller.
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Hugh Fisher
Date: Wednesday, April 19, 2006 - 6:14 pm

Don't get too carried away with Niagara comparisons. A GPU
has to execute exactly the same shader program for every
pixel in a given triangle/primitive. There is a small amount
of data that varies for each primitive: the coords/normal/
tex coords at vertex level; color/texcoords for fragments;
which is about a dozen 4x32 bit registers at most. There's
a K or two of OpenGL state that the shader can read but not
write to as well, plus a K (?) or so of app state with
the same restriction.

Now that shaders have branches it's not guaranteed that
they all execute in lockstep, but there is a very high
probability that all the execution units will need to read
from the same memory location at the same time. Brute
force replication might work better than dynamic scheduling.

-- 
	Hugh Fisher
	DCS, ANU
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Timothy Miller
Date: Wednesday, April 19, 2006 - 6:59 pm

You have a point.  Aside from a small possibility for variation in
instruction sequence, if one pixel's shader needs the vector
multiplier, then they all do, at the same time. But what I was
thinking was that if they all needed the vmul unit on one cycle but
not on the next, then two of the threads' instructions could be
scheduled on one cycle and two on the next.  What are the chances that
we'll get a long stream of vmuls all in a row with no breaks?  In that
case, it would definitely be better to have four completely
independent functional units.

There are definitely some things we would want to do about multiple
threads accessing the same (or nearby) memory locations.
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Hugh Fisher
Date: Wednesday, April 19, 2006 - 7:44 pm

Four vmuls (actually dot products) in a row is very common
for matrix multiplies. The sample shaders I've got, from
the OpenGL Shading Language book and GPU Gems, are all very
math intensive. I doubt you're going to be able to share
ALUs between threads. On the other hand, condition/branch
logic probably could be.

But on the gripping hand any statistics from generation
1 and 2 shaders are going to be biased in favour of math
ops because that was before branches became widespread. So
it is possible that shader code will have an instruction
mix more like generic C/C++ over the next few years. I'd

You'll probably get some sequential access patterns across
threads rather than within them. If a horizontal span of
fragments is being done in parallel by 2/4/N threads, it's
quite likely (especially for a 2D GUI) that thread #0 will
need texel P+0, thread #1 P+1, ...

Sheesh, I'm glad I'm a software person and don't have to
worry about designing and building this stuff :-)

-- 
	Hugh Fisher
	DCS, ANU
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: Dieter
Date: Wednesday, April 19, 2006 - 12:41 pm

For matrix multiplies I might suggest APL, but it's Greek to me.  :-)
_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: nico
Date: Thursday, April 20, 2006 - 4:16 am

A "normal" cpu are design to handle interrupt and context switch, that's a
udge constraint for a lot of possible optimisation. GPU look more like DSP

      ABS            v       v        absolute value
      ADD            v,v     v        add
      ARL            s       a        address register load
      DP3            v,v     ssss     3-component dot product
      DP4            v,v     ssss     4-component dot product
      DPH            v,v     ssss     homogeneous dot product
      DST            v,v     v        distance vector
      EX2            s       ssss     exponential base 2
      EXP            s       v        exponential base 2 (approximate)
      FLR            v       v        floor
      FRC            v       v        fraction
      LG2            s       ssss     logarithm base 2
      LIT            v       v        compute light coefficients
      LOG            s       v        logarithm base 2 (approximate)
      MAD            v,v,v   v        multiply and add
      MAX            v,v     v        maximum
      MIN            v,v     v        minimum
      MOV            v       v        move
      MUL            v,v     v        multiply
      POW            s,s     ssss     exponentiate
      RCP            s       ssss     reciprocal
      RSQ            s       ssss     reciprocal square root
      SGE            v,v     v        set on greater than or equal
      SLT            v,v     v        set on less than
      SUB            v,v     v        subtract
      SWZ            v       v        extended swizzle
      XPD            v,v     v        cross product

That's mainly fp multiplication. So the design must be done to use the
FMUL at each cycle. Or we could choose to have a 2 cycle FMUL but a
smaller one, and use more core (the compiled code show a lots of MOV
instruction during the time).

_______________________________________________
Open-graphics mailing ...
From: André Pouliot
Date: Thursday, April 20, 2006 - 6:46 pm

This is a multi-part message in MIME format.
--------------060907010700030603020102
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

bloc.  On those 5 unit one is dedicated for memory management load store
register and data mouvement, the other make a single execution unit for
vector operation or 4 distinct unit for scalar operation. Each unit
could do only a subset of the scalar operation, they don't all need to
be able to do the same one, also it will help reduce the overal size of
each unique execution bloc.

With such an architecture and since the code to run is rather small, it
will probaly be possible to optimise the order for the operation for
doing most stuff in parallel. Also since all the unit, work at the same
time. We just need to define a rather large instruction memory on chip,
it dosn't need to be deep since for first generation shader program
couldn't depass 255 instruction for a basic program. So one instruction
line feed a the time 5 operation. It look like a little bit like a dsp
architecture. After that you could reproduce the meta bloc many time
depending on the performance you want. But with more than one bloc you
will need a kind of dispatcher(hardware or software with the driver) to
divide the work. Since it is a small processor if you don't have to much
dependancy betwen different instruction and you know the number of clock
for execution you could have multiple instruction executing at the same
time by pipelining the operation.

--------------060907010700030603020102
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
  <title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
<a class="moz-txt-link-abbreviated" href="mailto:nico@seul.org">nico@seul.org</a> wrote:
<blockquote
 ...
From: nico
Date: Friday, April 21, 2006 - 2:24 am

That's basicaly a LIW or VLIW instruction set. If you put many operation
in parrallel, you need also a lot of register read/write per cycle. Read
and write in a register bank is quite a slow operation (LEON 3 processor,
a 7 stage pipeline cpu, use a complete cycle to do it). If you add more
register port you slow done the acces time of the bank.

So the speed-up must be higher than this loose. But to acheive high
throughput you need to fill well the instruction slot. That's a hard stuff
(see the Intel problem with itanium compiler)

In my previous proposal, i use LIW trick for managing the Program Counter
and use predicat which read an other register bank. You could also
multiply register bank to avoid slowdown but then you will have problem
for exchanging the data. This could add few more MOV instruction.

If you want more speed on a task for a given silicon technology, you could
use VLIW technique, SIMD one, scoreboard, out of order, etc... All of this
is used in todays big cpu.

But you could also be multi-cpu, multi-core. In a tradionnal computing,
the killer is communication between core. That's why a 1000 cpu computer
could be slower than a 6 vector cpu (from Cray or Nec). But from a raw
power point, the 1000 cpu is faster. But in real world exemple, it is
slower.

In shader, there is no communication. So, multicore is from my point of
view the better way to do it.

Nicolas Boulay


_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
From: nico
Date: Thursday, April 20, 2006 - 1:45 am

So you can't pipeline anything or you will have few cycle latency for each

_______________________________________________
Open-graphics mailing list
Open-graphics@duskglow.com
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)
Previous thread: [Open-graphics] product idea: networked display by Daniel Rozsnyó on Thursday, April 13, 2006 - 10:30 am. (11 messages)

Next thread: [Open-graphics] Free hardware conference by Lourens Veen on Thursday, April 13, 2006 - 11:49 am. (1 message)