In the many-core (100+) era, if some prophecies are true, the automatic CPU-managed L-N caches will be universally replaced by explicit hierarchical stores. This model of explicit store control can be seen today with the Cell processor, and the CTM / CUDA programming interfaces (*). It's not for the faint of heart.
In particular, you should know that none of your favorite languages were designed with this execution model in mind, so the only way to program such these machines effectively is to talk to the hardware in its own control language (in assembly, or some flavor of C with annotations like CUDA). In comparison, for C/C++, the compiler would have to inject DMA instructions at every pointer dereference, it would suck monumentally.
Now pause for a moment to consider how badly these explicit programmer-controlled local stores violate universally-good things like encapsulation… they require you to know the low-level layout of every data structure you’ll ever touch (a pre-requisite for you to load them into the store). Do you know how std::string is laid out? Do you want to?
Friendly Neighborhood Ninjas
In this hypothetical future, Code Ninjas (people who choose to write assembly) will be the ones to write all inner loops and dictating all key data structures, because they’ll be the only ones capable of getting decent speed out of the hardware. That is, either Ninjas will write all these loops, or regular people using Ninja code generator software. Here I’m referring to software like self-tuning BLAS implementations.
But aren’t compilers just big Ninja code generators too? Maybe, but automatic parallelization research has mostly lead nowhere. Well it did show that it’s virtually impossible.
Compilers = Ninja Kryptonite
Looking back, Ninjas used to write all the loops long ago. Before the optimizing compiler all high-performance programmers were Ninjas. Optimizing compilers and automatic caches changed this by allowing just about anyone to write inner loops and expect reasonable efficiency in return.
Hence a democratization of software development was the result of the “free lunch” era. Beyond this, we also got the chance to write nicer-looking, “cleaner” code. Ninja code is almost always easily distinguishable by its cryptic – i.e. un-maintainable – nature.
For the past 2 decades Ninjas were relegated to writing rare isolated functions (e.g. memcpy, strcmp) or peripheral libraries (e.g. BLAS, zlib, jpeg, GIL) that each perform very narrow tasks. General applications are built out of these libraries, but each application defines the domain in which they are put to use (particularly “standard” libraries).
The Ninjas Strike Back
In some cases these libraries are the inner loops of the program (when large structures are passed to them), but the granularity at which they are applied in general is too fine for the TLP requirements of the future. The cost of a TLP version of memset would be greater than the benefit, on average. Here’s the pickle then: rich domain knowledge is required to write the next layer in the application (the one calling the libraries) in a TLP-friendly manner.
From a development perspective then, this calls for an entire new layer of libraries: separate domain-specific Ninja frameworks for all domains. Whatever it is you’re doing today, in the future you will need a Ninja to write a library or code generator that will sit between your code and your libraries.
Look to libraries that cater to narrow domain spaces for examples of what these will look like: Direct3D, ODBC. There are some world-class Ninjas working on these libraries. But are the Ninjas working on your problem today? Are all problem domains receiving their fare share of Ninja attention?
Do you feel the need to hire/train Ninjas into your organization?
Alas, VC++ doesn’t have an inline assembler anymore. Think how much harder it will be for the world to produce new Ninjas now. Long ago I started on my own nun-chuck skills with the VC++ inline assembler, because to do the same in MASM was just needlessly painful in the beginning - still today I far prefer inline assembly for convenience.
(*) : I think CUDA rocks. I think you would be hard pressed to match the performance of an algorithm's CUDA implementation with a multi-core implementation. This said, it's totally a Code Ninja platform today - we need to work on this a bit more to democratize it.