On Fri, 2008-05-16 at 16:22 +0200, Arnd Bergmann wrote:
Ah, I see what you are getting at. That we would be collecting
profiling data on a a CPU (i.e. a PPU thread) and SPU cycle profiling
data at the same time. No, due to hardware constraints on SPU cycle
profiling, the hardware cannot be configured to do both PPU profiling
and SPU CYCLE profiling at a time.
I am working on SPU event profiling. The hardware will only support
event profiling on one SPU per node at a time. I will have to think
about it more but my initial thought is supporting PPU profiling and SPU
event profiling will again not work due to hardware constraints.
That is a good thought. It should help to keep the timing more accurate
and prevent the issue mentioned below about processing multiple trace
buffers of data before processing the context switch info.
I did see the problem initially. The first version of
oprofile_add_value() did not take the CPU buffer argument. Rather
internally the oprofile_add_value() stored the data to the current CPU
buffer by calling smp_processor_id(). When processing the trace buffer,
the function would extract the PC value for each of the 8 SPUs on the
node, then store them in the current CPU buffer. It would do this for
all of the samples in the trace buffer, typically about 250 (Maximum is
1024). So you would put 8 * 350 samples all into the same CPU buffer
and nothing in the other CPU buffers for that node. In order to ensure
I don't drop any samples, I had to increase the size of each CPU buffer
more then when I distribute them to the two CPU buffers in each node. I
changed the oprofile_add_value() to take a CPU buffer to better utilize
the CPU buffers and to avoid the ordering issue with the SPU context
switch data. I didn't do a detailed study to see exactly how much more
the CPU buffer size needed to be increased as the SPU context switch
ordering was the primary issue I was trying to fix. The default CPU
buffer size must be increased because four processors (SPUS) worth of
data is being stored in each CPU buffer and each entry takes two
locations to store (the special escape code, then the actual data).
Yes, this would be avoided by processing the trace buffer on a context
switch. We have the added benefit that it should help minimize the skew
between the data collection and context switch information.
Yes, forgot about that little detail, again. Argh!
If we flush on SPU context switches, we are just left with how best to
manage storing the data. I see two choice, allocating more memory so we
can store data into a per SPU buffer then figure out how to flush the
data to the kernel buffer. Or just increase the per cpu buffer size so
we can store all of the nodes SPU data into a single CPU buffer. Either
way we use more memory. In the first approach, I would need to allocate
SPU arrays large enough to store data for the worst case, the trace
buffer was full. This would be 8*1024 entries. Typically only a 1/3
would be needed as the hrtimer is setup to go off at about a 1/3 full.
This gives some buffer in case it takes time to get the timer function
call scheduled. The second case doubling the size of the per cpu
buffers to handle the typical case of the trace buffer being 1/3 full
would correspond to allocating an additional 2*1024/3 = 682 entries.
This would be less memory then for the first approach and be simpler to
implement.
Given the code, it would be easy to measure what the minimum number of
buffers needed is with the current patch that spreads the entries evenly
across all of the buffers. Then it is just a one line hack to put the
data all into the current CPU. Then re-tune to find the minimum number
of CPU buffers. I will do the experiments as I might be helpful to see
what the memory cost would be.
Yup, not a big deal. I had thought about doing it for completeness sake
but then opted not to figuring I would get dinged for unnecessarily
grabbing a lock and doing extra work. I will put it in. That is what I
get for trying to second guess the reviewers. :-)
--