Photon Propagation Code for the GPU
Sources

<--c++-->

ini.cxx Makefile f2k.cxx
<--cu-->
ppc.cu src pro.cu
<--datafiles-->
highsafeprimesupto2tothe32.txt icemodel.dat
geo-f2k wv.dat icemodel.par

<--fast (revised) version of c++ ppc-->
ppc.cxx Makefile ice.cxx
Performance comparison on Core i7 2.67 GHz. details
flasherf2k muon
c++:1.001.00
fast c++:1.331.87
Assembly:2.393.43
GTX 295 GPU:142.263.
SimProd GPU:414.435.

Assembly numbers improved compared to the previous version used in this study. However, the 64-bit code appears to run faster when compiled and run on the newer computer (cudatest), and is taken as the new 1.0 reference. On a 1.296 GHz GeForce GTX 295 GPU Tareq's test run takes 18.69 seconds, 91.5 times faster than the Assembly code on 1 CPU node. i3mcml achieves a comparable level of performance on this GPU.
The GPU version of ppc is very similar in implementation to the ppc in Assembly and to the "fast c++" version listed in the above tables.

The agreement between both versions is very strong:

The following is a GPU resources usage summary (v13):
  • 62496 bytes of the 64k GPU constant (cached) memory are used to hold the geometry of up to 5200 in-ice sensors and several constants.
  • 16296 bytes of the 16k GPU shared per-multiprocessor memory are used to hold geometry cell-association look-up tables (10x10x20 cells), absorption and scattering coefficients in up to 180 layers (34 ice tables are calculated for different wavelengths and are loaded in different execution blocks, possibly simultaneously on different multiprocessors), as well as some constants and pointers to input/output structures.
  • Program uses only 46 registers per thread and supports running up to 320 threads on a single multiprocessor. With little impact to speed (by commenting out "#define ACCL1") the register usage can be decreased to 40 (thus supporting up to 384 threads per multiprocessor).
  • Program uses 0 bytes of the slower local memory.
Main ppc page. Readme file.