Optimizing for Intel
Posted by Colin
I hear a lot of things said about Intel’s chips with regard to how fast or slow they are. Myself I don’t really care for the Intel vs. PowerPC debate. The PowerPC has always been a faster in theory chip. Faster in theory does not translate to real world performance, and the Intel chips are faster in the present tense, which is what really counts with my code.
What I find more interesting is the AMD vs. Intel debate. Some people think AMD chips run faster the Intel chips, and they would be right. Others think Intel is faster than AMD, and they’re also right. The key is very much in how you write your code. AMD’s are better at handling general code. But port over your existing PowerPC code right, and you can make it really fly on Intel, and run faster then it would on an AMD system. And writing fast code makes your users happy.
The good news is the Intel chips take very well to existing optimizations you’ve probably already made for PowerPC. Mac users have had to live with multiple processors for years now. If you’ve threaded your code for multiple CPU’s, you will notice very good performance gains on even single core Intel chips. For a few years now Intel has been pushing a technology called HyperThreading. Think of it as basically a virtual second core. By tricking Mac OS X into thinking it is running on a dual core system, you get more efficient scheduling, and your multithreaded code runs faster. The gains can be astounding. Running CineBench on my P4 3.8 against a neighbors Athlon64 3800+ had the Athlon beating my Pentium in a single threaded bench. As soon as CineBench moved on my machine to a multiple threaded mode my Pentium thrashed the Athlon64. Given, the Athlon 3800+ is not the high end of AMD’s line anymore, but then again, my P4 3.8 isn’t either, as the P4 Prescott has already been replaced with the Pentium D.
So the lesson is thread your code. The speed gains are usually somewhere in the ballpark of %30. If you’re running a DTK you can play with your hyperthreading performance by toggling it on and off in the CPU menu. One trick I have been using is loading any necessary resources at launch time in a second thread, and establishing locks so that the main thread will wait up for the resources to be loaded in case the user tries to request them early. This relies on the user not being quick enough to actually get to functions which require these resources, and therefore not noticing the loading time required for initializing these runtime resources.
The real kicker is that the performance gain of doing this on a HyperThreading system is so great that resources nearly load instantly in the second thread due to the improved scheduling, much faster than the resources would have loaded in the main thread while the rest of the app was starting. Your app appears to launch instantly to the user (under half a bounce on my Intel system), meantime the bulk of the initialization happens in a second thread so fast that the user can’t move fast enough in the UI to access anything that would hit that thread lock and force the program to wait. By the time the user has any chance of accessing those resources, the second virtual CPU has already loaded them.
The other good news is that Intel has been transitioning from HyperThreading to dual core processors. This means that instead of running on a second emulated core, you’ll be running on a real second core, making your code built for HyperThreading even faster. You’ll have covered all the optimization bases for Intel. Even the forthcoming Pentium M that Apple is most likely to use is Dual Core, with at least all currently shipping Pentium chips supporting HyperThreading. The Pentium D Extreme Edition even offer two cores, each with HyperThreading, for a total of 4 logical cores.
The second set of optimizations is SSE, which I won’t cover in since I never explicitly use it. The Pentium is built much more like the G4 than the G5, gaining a lot of performance based on it’s special vector processing units. If you optimized your code for the Accelerate framework to get your code running well on the G4 this is good news. The bad news is if you wrote your code using AltiVec instructions instead of with Accelerate.framework you have to re-write all that code using SSE instructions.