Photon Propagation Code for the GPU

Sources <--c++-->
ini.cxx	Makefile	f2k.cxx
<--cu-->
ppc.cu	src	pro.cu
<--datafiles (SPICE MIE)-->
rnd.txt	cfg.txt	icemodel.dat
geo-f2k	wv.dat	icemodel.par
tilt.dat	tilt.par	as.dat
<--fast version of ppc--> simply compile with "make cpu"

Performance comparison on Core i7 2.67 GHz. details

	flasher	f2k muon
Original:	1/1.85	1/2.47
fast c++:	1.00	1.00
Assembly:	1.24	1.34
GTX 295:	140.	123.
Assembly numbers improved compared to the previous version used in this study. The GPU code compiled for the CPU (the new "fast c++") is taken as the new 1.0 reference. These tests were run on the cudatest computer. On a 1.296 GHz GeForce GTX 295 GPU Tareq's test run takes 18.22 seconds, 91.9 times faster than the Assembly code on 1 CPU node. i3mcml achieves a comparable level of performance on this GPU.

The GPU version of ppc is very similar in implementation to the ppc in Assembly and to the "fast c++" version listed in the above tables.

The agreement between both versions is very strong:

The following is a GPU resources usage summary (v27):

62460 bytes of the 64k GPU constant (cached) memory are used to hold the geometry of up to 5200 in-ice sensors and several constants.
9592 bytes of the 16k GPU shared per-multiprocessor memory are used to hold geometry cell-association look-up tables (21x19 cells), absorption and scattering coefficients in up to 180 layers (33 ice tables are calculated for different wavelengths and are loaded in different execution blocks, possibly simultaneously on different multiprocessors), ice tilt data (from 6 dust logs), as well as some constants and pointers to input/output structures.
Program uses 37 registers per thread and supports running up to 384 threads on a single multiprocessor.
Program uses 0 bytes of the slower local memory.