Photon Propagation Code in Assembly
Sources

<--asm-->

ppc.asm dat.asm ini.asm
pro.asm rot.asm gsx.asm
<--awk-->
geo.awk ice.awk rnd.awk
<--cxx-->
ppc.cxx Makefile ice.cxx
<--datafiles-->
highsafeprimesupto2tothe32.txt icemodel.dat
geo-f2k wv.dat icemodel.par
click here for the previous version
Most improvement is observed on Pentium M (my laptop). details
flasherf2k muon
c++:1.001.00
asm:
Core i7 2.67 GHz:1.382.92
Intel Xeon 2.4 GHz:1.492.30
Intel Xeon 3.2 GHz:1.642.51
Pentium M 2.0 GHz:2.163.45
AMD:
Opteron 2.0-2.4 GHz:1.602.33

Tareq's test run took 31.9 minutes on Core i7 (32-bit asm). Considering that 4 threads can be run simultaneously on a single CPU, this is further reduced to 8 min. Compared to 1.22 min. on 9800 GT GPU, this is only a factor of 6.5x slower. More recent GPUs could increase this by ~2.5 to 16x.
There is a number of differences between the ppc in Assembly and ppc in c++ implementations:

c++Assembly
calculation precision:double-precision everywherelimited precision: mostly single precision or in some cases even lower (in the direction-vector normalization calculation)
wavelength dependence:full 6-parameter ice modeltabulated in 10 nm bins
random number generator:rand() of stdlib32-bit base multiply-with-carry, 223 different (normalized!) numbers
Several minor differences in conditional statements
(start or end on an OM, do not exit OMs: only enter, etc.)

Despite these differences, the agreement between both versions is very strong:

Main ppc page. Compiled static executable is here.