Photon Propagation Code for IceSim

ppc (Photon Propagation Code) is an icetray module that creates Cherenkov photons along the muon tracks and secondary cascades (and other interactions) and propagates them in ice with layered scattering and absorption (with tilting as surveyed by the dust loggers) till they get absorbed or hit an OM. Thus, ppc replaces photonics, photonics-interface, and hit-constructor.

The icetray version of ppc shares code with a stand-alone version, that is used for ice properties studies (fits to flasher data) and quick checks on f2k muon data.

ppc has several substantial advantages over the established photonics-based simulation:

ppc does not require building or manging of large tables
photon propagation is performed directly for each photon, thus no table interpolation issues
on a GPU ppc is hundreds of times faster than an equivalent photonics-based production
ice tilting is fully implemented
the code used to fit the SPICE model is exactly the code used for Cherenkov photon simulation

The basic version is enabled by default, when compiling icetray version of the ppc. It is possible to compile the GPU-accelerated version, to be run on CUDA GPUs (of the recent NVidia video cards, series 8000 and up), of the CPU-only version of the GPU-accelerated code, both in the ppc-gpu directory of the module.

The following table compares run times of the 3 ppc variants available within the icetray module (processing of 1 file of set 2972):

basic (default) 12h 1m 37s

CPU-only of ppc-gpu 3h 8m 38s

GPU version of ppc-gpu 0h 3m 45s

here is a detailed summary:

basic: real 721m37.433s user 718m51.836s sys 0m32.642s

CPU-only: real 188m37.917s user 187m21.671s sys 0m18.441s

GPU: Device time: 78644.6 [ms] real 3m44.556s user 2m20.385s sys 0m2.896s

As indicated above, only 78.6 seconds are spent on the actual photon propagation, most of the rest (140 seconds) are spent on other simulation modules (I3PMTSimulator, I3DOMsimulator, I3SMTrigger, etc.). A small portion of the CPU time (~10 seconds) is spent by the ppc module itself.

The ppc-gpu was tested on a cudatest computer with 6 GPUs and 4 CPU cores (capable of running 8 threads). Details on execution times on this computer are given in the Appendix. As a summary, this computer can process a neutrino-generator file 127 times faster than an average CPU node used for the equivalent photonics-based production.

Given that neutrino-generator alone, excluded from this simulation chain (pre-calculated on a cluster of CPUs) takes ~ 7.5 times longer per file than processing of the file by ppc-gpu and the rest of the simulation chain, the cudatest computer can be well matched with over 45 CPU nodes.

The 6 GPUs of the cudatest computer were used at ~35% capacity as the ppc-gpu has to wait for other modules to finish before processing more photons. Thus, the acceleration factor can be improved even further by one of the following techniques that might be considered:

Running several simulation chains on the same GPU. This, however, has already hit a limit with the RAM. The cudatest has 12 Gb of RAM, running 6 threads (one for each GPU) occupies all of this memory as each simulation thread takes ~ 2 Gb of RAM.
Running only the ppc-gpu part of the simulation chain the cudatest. The simprod has an advertised capacity to run different parts of the simulation on different computer systems. This would be an ideal application for such a capacity.
Writing a ppc-gpu server that would handle photon propagation requests from the CPU nodes of a cluster. There was a similar idea for photonics, which also required specialized hardware (a computer with large amounts of RAM).

In the first quarter of 2010 NVidia promises to release CUDA-capable video cards that will be more than 2 times faster than the existing hardware (as implemented within the cudatest computer). From the information currently available servers built around both current and next-generation GPUs might have similar performance/price ratio. Either way, custom-built computers appear to be much more cost-efficient than the pre-configured servers (that also need a host CPU system) as sold by NVidia.

Appendix

run times:

processing f2k files of corsika set 1540:

per file

photonics-based (including corsika), per CPU 3h 19m 7s

ppc-based (excluding corsika), on a 6-GPU computer 0h 1m 23s

here is a detailed summary of ppc-gpu:

693 corsika files:
Device time: 144401066.7 [ms]

real    956m40.579s
user    3118m37.174s
sys     42m15.550s

               total [s]   per file [s]
Device time:  144401.0667    208.371
real time:     57400.579      82.829
real time x6: 344403.474     496.975
user time:    187117.174     270.010
sys  time:      2535.55        3.659

processing i3 files of nugen set 2972 (cf. full production set 2114):

per file

neutrino-generator only 0h 26m 31s

photonics-based (including neutrino-generator) 1h 55m 13s

ppc-based (excluding neutrino-generator), on a 6-GPU computer 0h 0m 42s

here is a detailed summary of ppc-gpu:

500 neutrino-generator files:
Device time: 31790709.4 [ms]

real    352m19.215s
user    1096m8.342s
sys     22m49.046s

                total [s]   per file [s]
Device time:   31790.7094     63.5814
real time:     21139.215      42.278
real time x5: 105696.075     211.392
user time:     65768.342     131.537
sys  time:      1369.046       2.738

Files of sets 1540 and 2972 processed with ppc-gpu are available in /data/ana/IC40/ppc/.


basic (default)		12h 1m 37s
CPU-only of ppc-gpu		3h 8m 38s
GPU version of ppc-gpu		0h 3m 45s

		per file
photonics-based (including corsika), per CPU		3h 19m 7s
ppc-based (excluding corsika), on a 6-GPU computer		0h 1m 23s

		per file
neutrino-generator only		0h 26m 31s
photonics-based (including neutrino-generator)		1h 55m 13s
ppc-based (excluding neutrino-generator), on a 6-GPU computer		0h 0m 42s