My ideas was to stop doing stupid raytracing of spheres and planes and move forward to the real thing. I did read a couple of papers on how real men do raytracing, and I quickly learnt about modern raytracing, and their kd-trees, bih or bvh acceleration structures, and their super fast raytracing power, which where a couple million rays per second (per core) in modern CPUs. That was the time where some people started exploring raytracing in the GPU and reached some pretty decent 10 to 50 million rays per second. However I wanted to keep doing it in the CPU, as GPUs at the time had no more than one gigabyte of memory, which was completelly insuficient for the type of 3d models I wanted to render. In fact I was quite motivated to be able to move the medium-size and big size models I had at the time at work, whicih where in the order of 100 to 200 million triangles.
Following Wald's and others work I quickly got similar results to theirs, and I could render these massive models at 20 frames per second in quadcore machines. Luckily, I also had access to higher performance computers with up to 32 cores udner Windows 64 bit, so in those machines I could really push screen resolutions up. In fact I got perfectly linear scalability up to 64 cores, which I was very proud of, as that's not as a trivial thing to do as it seems to be.
I implemented my own API to write shaders for the tracer (light and surface shaders) as well as custom primitives. In the end the project was pretty nice, but it had on big problem, which was shared with all the other high performance raytracers at the time (and still today, as far as I know): it was very fast for primary and shadow rays, but very slow for other type of rays needed for montecarlo effects (ambient occlusion, global illumination, fuzzy reflections, depth of field, etc).
The big issue with current raytracers is that they are designed for casting highly coherent rays only. This means that the tracer assumes that when casting a series of rays(individually or in a group) they will all follow similar paths through the acceleration structure (cause they have the same origin and similar directions, so they more or less hit the same objects and nodes of the tree). That assumption is done so that some cache and memory related optimizations can be done, because nor that surprisingly, modern tracers are memory bandwidth bound, not computation power limited. The thing is that when it comes to montecarlo the coherence is broken because usually one uses random rays to achieve the lighitng integration, so the caches get trashed all the time and the performance of the raytracer simply drops a lot.
Anyway, the screenshot below are made with that raytracer I made in 2005.
Ambient occlusion on the Atrium. Click to get enlarge.
Realtime raytracing (2005). Click to get enlarge.