PHOTOOG Photography writings by Olivier Giroux


Stopped Dead in My Tracks : Report on A Sobering Experience

Lately I’ve been scaling up one our fine-grained analysis tools to make it useful on application/game rendering data. It’s no small feat because between a "simple GPU test" and "Far Cry" there’s at least a 10K~100K difference in data volume. After several weeks of work doing profiling, refactoring and plain redesigning for both raw speed and work avoidance, I’m now at the "leverage multi-cores" step of my performance plan.

First you need to know that the processing done by this tool is highly serial in nature.

There is a strict ordering in the data that cannot be violated. The tool in question is a causality analyzer of sorts, so if you change the order data is processed in you might as well just use a random-number generator.

Parallel design update

The idea: refactor the application into a processing pipeline where chained serial tasks are split into sub-domains where there is little/no feedback, then executed in separate threads linked with asynchronous buffers.

I identified 3 major subdomains, here annotated with costs (figures from profiler):

20%: Find, read and decompress data from disk
30%: Interpret source data, search causal dependencies, assign causal dependencies
50%: Walk pre-computed dependency tree, produce inputs for visualization tool

My gut tells me this should be at least a little faster with a 2nd CPU core if I split the domains successfully. I’m not claiming 2x, but maybe 1.3~1.7x? Let’s see…

Qualitative Study

To coarsely evaluate the success of my refactoring, I implemented a simple concurrency/contention trace in my model:

(. is input file I/O, + is a front-end packet, - is a back-end packet, ! is a back-end starved and ? is front-end stalled)


As you can see with all the -+? sequences, this process is back-end loaded (as expected) in the heavier 50% task. So far so good.

Quantitative Study

As previously expected ~45% of all CPU cycles spent are to walk the pre-computed dependency tree. That used to be 50% before I added some thread-related overhead, so it shrunk from the total. Ok -- but it’s got its own CPU core all to itself now… it should be ~1.9x faster, that's what the data is saying.

So is it faster now?

Not at all. It runs precisely at the same speed as before. Actually it’s ~5% slower with the overhead. It's as if the work - in unchanged code - magically inflated as more resources became available.

I’m quite deflated by this.


I must come to the conclusion that I did not, in fact, make more resources available. Or rather, the CPU was not the resource I needed to increase. I find myself with CPU to burn in the end.

With a bit more research, I appear to be suffering from congestion at the memory interface. My application was already memory-latency-bound before I started parallelizing it. Since I have not increased the availability of this resource, I got no speedup.

As I engaged others on this topic, someone said:

The lesson should be that memory accesses are at least as important as instructions, and you should make sure your data structures match your access patterns.

Which sounds to me like good general wisdom, with a difficult twist. The one thing that makes life hard for me, is that some of my most important data structures come from the C++ STL. They were carefully chosen for their algorithmic guarantees and stunning implementation quality.

I have a lot to think about.

Filed under: Uncategorized Leave a comment
Comments (2) Trackbacks (0)
  1. It might be memory bound now, but as time goes by the memory bandwidth will get bigger which will eventually make your parallelization useful.

    But there is a lesson to be learned here, I’ll be sure to keep it in mind.

  2. The ratio between memory speed and CPU speed has been getting worse every year for the last 20 years. That’s why we have L1/L2 caches, and why they have to keep growing in size.

    As CPUs become more parallel, this problematic trend will accelerate. That’s because the different threads on the CPU are all running different software which need different pages of memory.

    If you want to know: this is why GPUs are 100x faster than CPUs at some problems. It’s not because they have more math units (which they do), it’s because we go crazy on memory-coherency-milking hardware.

    There was an interesting paper about multi-cores decreasing performance of memory bound applications
    published recently (commentary by Jon ‘Hannibal’ Stokes).

    We’re also a bit defenseless against this kind of problem. Non-algorithmic changes in the code, like changing the size of a temporary buffer, can make a huge difference in performance. For instance by changing the asynchronous queue between my threads from 1Mb to 32Kb I get 15% better performance — even though now my threads collide and are forced to sleep more often.

Leave a comment

No trackbacks yet.