.. IceCube DNN reconstruction .. _train_model: Train Model *********** Now that we have created our training data, we can move on to training the neural network. We will have to perform two steps: create a data transformation model and then train the neural network. The necessary scripts for these steps are located in the main directory of the |dnn_reco| software package. As previously mentioned, we need to define settings in our central configuration file. We will copy and edit a template configuration file for the purpose of this tutorial. .. code-block:: bash # Define the directory, where we will store the training configuration file export CONFIG_DIR=$DNN_HOME/configs/training # create the configuration directory mkdir --parents $CONFIG_DIR # copy config template over to our newly created directory cp $DNN_HOME/repositories/dnn_reco/configs/tutorial/getting_started.yaml $CONFIG_DIR/ We now need to edit the keys: ``training_data_file``, ``trafo_data_file``, ``validation_data_file``, ``test_data_file``, so that they point to the paths of our training data. To train our model we are going to use all hdf5 files ending in 0 for the validation set and all other files for training. The transformation model will be built by using the same files as we used for the training data. We can make these changes by hand or by executing the following command which will replace the string '{insert_DNN_HOME}' with our environment variable $DNN_HOME: .. code-block:: bash sed -i -e 's,{insert_DNN_HOME},'"$DNN_HOME"',g' $CONFIG_DIR/getting_started.yaml Keep in mind that you have to point to a different path, if you are using the data in ``/data/user/mhuennefeld/DNN_reco/tutorials/training_data``, or if your data is located elsewhere. Cross-Check input data ====================== For convenience, there is a script that will count the number of events provided in the input files defined in the keys ``training_data_file``, ``trafo_data_file``, ``validation_data_file``, ``test_data_file``. If you have a broad idea of how many events to expect, or if you simply want to check the stats you can run: .. code-block:: bash # cd into the dnn_reco directory cd $DNN_HOME/repositories/dnn_reco/dnn_reco # count number of events python count_number_of_events.py $CONFIG_DIR/getting_started.yaml to count the number of events that are found for the provided keys. The output will look something like this: .. code-block:: bash [...] =============================== = Completed Counting Events: = =============================== Found 214559 events for 'test_data_file' Found 214559 events for 'validation_data_file' Found 1928037 events for 'training_data_file' Found 1928037 events for 'trafo_data_file' For advanced users: one can add filters to apply when loading input data via the ``filter_*`` keys in the configs. The file counting currently does *not* take these filters into consideration, i.e. it counts all events available in the files. Create Data Transformation Model ================================ The training files that we will use in this tutorial were created with the ``pulse_summmary_clipped`` input format (see :ref:`Create Training Data` for more info), which means that we reduced the pulses of each DOM to the following summary values: 1. Total DOM charge 2. Charge within 500ns of first pulse. 3. Charge within 100ns of first pulse. 4. Relative time of first pulse. (relative to total time offset) 5. Charge weighted quantile with q = 0.2 6. Charge weighted quantile with q = 0.5 (median) 7. Relative time of last pulse. (relative to total time offset) 8. Charge weighted mean pulse arrival time 9. Charge weighted std of pulse arrival time The input tensor which is fed into our network therefore has the shape (-1, 10, 10, 60, 9) for the main IceCube array and (-1, 8, 60, 9) for the DeepCore strings. It is helpful to transform the input data as well as the labels. A common transformation is to normalize the data to have a mean of zero and a standard deviation of 1. Additionally, the logarithm should be applied to features and labels that span over several decades. The software framework includes a data transformer class that takes care of all of these transformations. All that is necessary is to define the settings of the transformer class in the configuration file. We are going to highlight a few options in the following: ``trafo_data_file``: Defines the files that will be used to compute the mean and standard deviation. Usually we will keep this the same as the files used for training the neural network (``training_data_file``). ``trafo_num_jobs``: This defines the number of CPU workers that will be used in parallel to load the data ``trafo_num_batches``: The number of batches of size ``batch_size`` to iterate over. We should make sure, that we compute the mean and standard deviation over enough events. ``trafo_model_path``: Path to which the transformation model will be saved. ``trafo_normalize_dom_data``/ ``trafo_normalize_label_data``/ ``trafo_normalize_misc_data``: If true, the input data per DOM, labels, and miscellaneous data will be normalized to have a mean of zero and a standard deviation of one. ``trafo_log_dom_bins``: Defines whether or not the logarithm should be applied to the input data of each DOM. This can either be a bool in which case the logarithm will be applied to the whole input vector if set to True, or you can define a bool for each input feature. The provided configuration file applies the logarithm to the first three input features. You are free to change this as you wish. ``trafo_log_label_bins``: Defines whether or not to apply the logarithm to the labels. This can be a bool, a list of bool, or a dictionary in which you can define this for a specific label. The default value will be False, if a dictionary is passed, e.g. the logarithm will not be applied to any labels that are not contained in the dictionary. Once we are certain that we filled in the correct values, we can create the data transformation model. This step needs to process data as defined in the ``trafo_data_file`` key, because the mean and standard deviation depend on the data. .. code-block:: bash # cd into the dnn_reco directory cd $DNN_HOME/repositories/dnn_reco/dnn_reco # create the transformation Model python create_trafo_model.py $CONFIG_DIR/getting_started.yaml .. note:: If you only created one training file you will not have enough training data to generate 100 batches of 32 events. As a result, the above will fail with a ``StopIteration`` error. You will either have to process a few more training data files, or lower the number of batches that you would like to use to create the transformation model. You can do this by setting the ``trafo_num_batches`` key in ``$CONFIG_DIR/getting_started.yaml`` to a lower value such as 20. Upon successful completion this should print: .. code-block:: php ======================================= = Successfully saved trafo model to: = ======================================= '../data/trafo_models/dnn_reco_11883_tutorial_trafo_model.npy' Train Neural Network Model ========================== The network architecture that will be used in this tutorial is the ``GeneralIC86CNN`` architecture which is defined in the module ``dnn_reco.modules.models.general_IC86_cnn``. In our ``getting_started.yaml`` configuration file, we defined a smaller convolutional neural network with 4 convolutional layers for the upper and 8 convolutional layers for the lower DeepCore part. 8 convolutional layers are performed over the main IceCube array. Every convolutional layer uses 10 kernels. The three output tensors of each of these convolutional blocks are then concatenated and fed into a fully connected sub network of 2 layers. Additionally, we define a second fully connected sub network of 2 layers, that is used to predict the uncertainties on each of the reconstructed quantities. You may change the architecture by modifying the settings below :: #---------------------- # NN Model Architecture #---------------------- in the configuration file. You can also define your own neural network architecture, by changing the keys ``model_class`` to point to your newly defined NN class. Note that this class must inherit from the ``BaseModel`` class in the ``dnn_reco.modules.models.base_model`` module. During training, we can provide weights to each of the labels. That way we can force the training to focus on the labels that we care about. In this tutorial we will focus on reconstructing the visible energy in the detector (``EnergyVisible``), while also providing a smaller weight to the primary energy of the neutrino (``PrimaryEnergy``). For throughgoing muons, ``EnergyVisible`` is the energy of the muon as it enters the detector. For starting muons, this is the sum of the deposited energy by the cascade plus the energy of the outgoing muon. There are several ways how we can define the weights for all labels. The key ``label_weight_initialization`` defines the default weight for the labels. We can specify the weight of certain variables with the ``label_weight_dict`` key. .. note:: If certain variables are included in the logarithm/exponential transformation of the data transformer, but not trained, e.g. weights set to zero, then it can happen that the values for these drift out of bound leading to NaNs. If this happens, you can also set the weights of the affected variables to very small positive weights such as 0.00001 Other important settings for the training procedure are the ``batch_size`` and the choice of loss functions and minimizers which are defined in the ``model_optimizer_dict``. Here, we will use a Gaussian Likelihood as the loss function for the prediction and uncertainty estimate. The structure of the setting ``model_optimizer_dict`` is a bit complicated, but it is very powerful. We can define as many optimizers with as many loss functions as we like. A few basic loss functions are already implemented in ``dnn_reco.modules.loss``. Amongst others, these include the Mean Squared Error (MSE) and cross-entropy for classification tasks. Similar to the NN model, you can utilize custom loss functions by adjusting the ``loss_class`` key to point to your custom loss function. Other more advanced features are available such as defining learning rate schedulers, but these are not covered in this tutorial. Sometimes the Gaussian Likelihood can be quite sensitive, especially when the values are initially random. Limiting the value range of the uncertainty output can help, or one can also start with a more robust loss function such as MSE or the tukey loss (https://arxiv.org/abs/1505.06606), which is more robust to outliers. The learning rate of 0.001 with the Adam optimizer are almost always good choices. To start training we run: .. code-block:: bash # If on a system with multiple GPUs, we can define the GPU device that we # want to use by setting the CUDA_VISIBLE_DEVICES to the the device number # In this case, we will run on GPU 0. CUDA_VISIBLE_DEVICES=0 python train_model.py $CONFIG_DIR/getting_started.yaml .. note:: Running this on one of the cobalts should work, but will be extremely slow. In addition, tensorflow will distribute the workload on all CPUs it can find. This can be changed, but isn't currently a setting for the training (just for the I3Module). Hence, we can run this for a few iterations on the cobalts for debugging purposes, but it shouldn't run for longer amount of times. When debugging, make sure to keep an eye on the usage via ``htop`` to ensure that the cluster is usable for others. Training on a GPU is highly recommended. NPX isn't suited well for training, since the job ideally needs 1 GPU in addition to multiple CPUs. However, this may be difficult to obtain on NPX. Reducing the number of requested CPUs may help. In this case, the number of worker jobs for the data input pipeline should be reduced by setting the ``num_jobs`` key in the configuration. More info on how to run this on an interactive GPU session is provided :ref:`further below`. If possible, it is recommended to run this on other resources, if available. This will run for ``num_training_iterations`` many iterations or until we kill the process via ``ctrl + c``. The current model is saved every ``save_frequency`` (default value: 500) iterations, so you may abort and restart at any time. Every call of ``train_model.py`` will keep track of the number of training iterations as well as the configuration options used. This means that you do not have to keep track yourself. Moreover, the currently installed python packages and the git revision is logged. This information will be exported together with the model, to ensure reproducibility. The keys ``model_checkpoint_path`` and ``log_path`` define where the model checkpoints and the tensorboard log files will be saved to. The ``model_checkpoint_path`` also defines the path from which the weights of the neural network will be recovered from in a subsequent call to ``train_model.py`` if ``model_restore_model`` is set to True. If you wish to start from scratch, you can set ``model_restore_model`` to False or manually delete the checkpoint and log directory of your model. In order not to get models mixed up, you should make sure that each of your trained models has a unique name as defined in the key ``unique_name``. The easiest way to achieve this is to have a separate configuration file for each of your models. .. note:: Many more configuration options are available of which some are documented in :ref:`Configuration Options`. The software framework is meant to provide high flexibility. Therefore you can easily swap out modules and create custom ones. We have briefly touched the option to create your own neural network architecture here as well as the option to add custom loss functions. More information on the exchangeable modules is provided in :ref:`Code Documentation`. Running in interactive GPU session ================================== .. _train_model_interactive_gpu: Although not ideal, it is possible to run this on NPX. Here we will show how to obtain an interactive GPU session with 4 CPUs and 6GB of RAM. We will then start the training in this interactive session. First, we need to ask for an interactive job. For this we must log on to the submit node (submitter.icecube.wisc.edu). Then we will define our requirements and submit the request via: .. code-block:: bash condor_submit -i -a request_cpus=4 -a request_gpus=1 -a request_memory=6GB This may take a while, depending on how busy the cluster is. Reducing the number of requested CPUs and RAM may help to get a free slot quicker. In this case, the input data pipeline must be adjusted to use less worker nodes and possibly a smaller input queue. If the job suddenly closes, this is often due to larger memory usage than requested. When we have successfully obtained a job, we can now activate the environment and start training: .. code-block:: bash # Recreate environment variable export DNN_HOME=/data/user/${USER}/DNN_tutorial # load virtual environment (we don't need icecube env for this) eval $(/cvmfs/icecube.opensciencegrid.org/py3-v4.3.0/setup.sh) source ${DNN_HOME}/py3-v4.3.0_tensorflow2.14/bin/activate # add paths to CUDA installation so that we can use the GPU export CUDA_HOME=/data/user/mhuennefeld/software/cuda/cuda-11.8 export PATH=$PATH:${CUDA_HOME}/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${CUDA_HOME}/lib64 # we may need to turn file locking off export HDF5_USE_FILE_LOCKING='FALSE' # go into directory cd $DNN_HOME/repositories/dnn_reco/dnn_reco # now we can start training # condor will have already set `CUDA_VISIBLE_DEVICES` to the # appropriate GPU that is meant for us. Therefore, we do not # need to prepend this as done further above in the tutorial. python train_model.py $DNN_HOME/configs/training/getting_started.yaml