Photon Propagation Code for the GPU

Sources <--c++-->
ini.cxx	Makefile	f2k.cxx
<--cu-->
ppc.cu	src	pro.cu
<--datafiles-->
highsafeprimesupto2tothe32.txt		icemodel.dat
geo-f2k	wv.dat	icemodel.par
<--fast (revised) version of c++ ppc-->
ppc.cxx	Makefile	ice.cxx

Performance comparison on Core i7 2.67 GHz. details
	flasher	f2k muon
c++:	1.00	1.00
fast c++:	1.33	1.87
Assembly:	2.39	3.43
GTX 295 GPU:	142.	263.
SimProd GPU:	414.	435.
Assembly numbers improved compared to the previous version used in this study. However, the 64-bit code appears to run faster when compiled and run on the newer computer (cudatest), and is taken as the new 1.0 reference. On a 1.296 GHz GeForce GTX 295 GPU Tareq's test run takes 18.69 seconds, 91.5 times faster than the Assembly code on 1 CPU node. i3mcml achieves a comparable level of performance on this GPU.

The GPU version of ppc is very similar in implementation to the ppc in Assembly and to the "fast c++" version listed in the above tables.

The agreement between both versions is very strong:

The following is a GPU resources usage summary (v13):

62496 bytes of the 64k GPU constant (cached) memory are used to hold the geometry of up to 5200 in-ice sensors and several constants.
16296 bytes of the 16k GPU shared per-multiprocessor memory are used to hold geometry cell-association look-up tables (10x10x20 cells), absorption and scattering coefficients in up to 180 layers (34 ice tables are calculated for different wavelengths and are loaded in different execution blocks, possibly simultaneously on different multiprocessors), as well as some constants and pointers to input/output structures.
Program uses only 46 registers per thread and supports running up to 320 threads on a single multiprocessor. With little impact to speed (by commenting out "#define ACCL1") the register usage can be decreased to 40 (thus supporting up to 384 threads per multiprocessor).
Program uses 0 bytes of the slower local memory.