Testing DAQ

The IceCube DAQ from Jan, 2005 to ???

Standard Operating Procedures

Mark Krasberg, Feb 7, 2005

This document briefly describes standard operating procedures for running the IceCube Testing DAQ, which was integrated at the South Pole in January/February of 2005. Common problems and their solutions are outlined.

  • Pre-Requisites
  • MOAT and lcchain.py and STF
  • Running DomCal
  • Generating steering files
  • Domhub-app Configuration File
  • Starting a run
  • What happens during a run?
  • Stopping a run
  • Taking multimon data
  • How to tell if a run is good
  • Pre-Requisites

    Both the DOMhub and the DOMs are expected to be pre-installed with the necessary software.

    The DOMhub installation procedure (see http://docushare.icecube.wisc.edu/docushare/dsweb/View/Collection-979 ) consists of using an automated "minimal-install" process, followed by a secondary post-installation process. The initial process consists of using a specialized kickstart file to install the operating system (Red Hat 9) at the end of which a "minimal install" scripts runs. The secondary process consists of installing additional files from a large zip file.

    The DOM software installation is also a two-step process. It consists of uploading the chosen release to the DOMs, followed by uploading the latest version of DomCal to the DOMs.

    To upload the chosen release: (takes about 15 minutes)

    Issue command 'ps -ef' to verify that no programs are running which might hold on to the DOM device ports (look for and terminate programs like domterm and dtsx and domhub-app).
    Issue command 'ldall release.hex'. At the time of this writing the chosen releases could be found in /home/testdaq/releases/release308 and /home/testdaq/releases/release-pole-fb-01. The latter release has all the functionality of release308, and additionally includes flasherboard support.
    Issue command ' echo "0" > /proc/driver/domhub/blocking '
    Issue command ' off all '

    To upload DomCal: (takes about 3 minutes)
    Issue command 'on all'
    Issue command 'gotoiceboot'
    Issue command 'upload_domcal'. This will upload the file /home/testdaq/domcal5.bin.gz to all accessible DOMs.

    MOAT and lcchain.py and STF

    There are two main aspects of commissioning a DOM. The first is a series of low-level tests on the communications, using a program called MOAT. The second is a series of higher level tests which test, for example, a test of the local coincidence signals or the functionality of the mainboard or the PMT or the flasherboard. One test you need to do as part of DOM commissioning is to run lcchain.py. This is present in the bin directory of each DOMhub.

    To run lcchain.py:

    Issue command 'ps -ef' to verify that no programs are running which might hold on to the DOM device ports (look for and terminate programs like domterm and domserv and dtsx and domhub-app).
    Issue command 'off all'
    Issue command 'on all'
    Issue command 'gotoiceboot'
    Issue command 'dtsxall'

    lcchain.py arguments indicate the top and/or bottom DOM in the chain you wish to test.

    for example:
    lcchain.py -h localhost -s 001 -e 710
    The above command will test the local coincidence signals for a complete DOMhub. "001" represents card 0, wire pair 0, DOM B (the topmost DOM on the DOMhub), and "710" represents card 7, wire pair 1, DOM A (the bottomost DOM on the DOMhub).

    IceTop has a special set of "circular" LC hookups. An example of how to use lcchain.py on an IceTop station (if you are logged into sps-ithub-cont01) is:

    lcchain.py -i -h localhost -s 000

    The above command will test all DOMs within the station, which spanse DOR cards 0 and 1. To test the next station you would run lcchain.py again with the command:

    lcchain.py -i -h localhost -s 200

    Another set of higher level tests which you need to do both as part of DOM commissioning and then again at regular intervals thereafter is to run the Simple Test Framework (STF). To run STF on a DOMhub: (takes about 30 minutes)
    Issue command 'ps -ef' to verify that no programs are running which might hold on to the DOM device ports (look for and terminate programs like domterm and dtsx and domhub-app).
    Issue command 'off all'
    Issue command 'source setclasspath /home/testdaq/work-stf/'
    Issue command 'java icecube.daq.stf.STF'

    A box will appear (if it does not then probably you cannot connect to the database on sps-dbs).

    Use pull-down menu 'Connect --> Open direct DOR session'
    This will turn on all DOMs and put them into the "STF" state.
    Use pull-down menu 'File --> Load tests'
    Click once on the directory "All-tests" (do not double click)
    Use pull-down menu 'Start --> Select all DOMs'
    Use pull-down menu 'Start --> Select all Tests'
    Use pull-down menu 'Start --> Run'

    You now have to answer three questions:
    How many iterations? Answer between 1 and 5
    Enter DOM temperature. Put in your best guess
    Test integrated DOMs. Answer "Y".

    STF will now open up a much larger window where you can see a grid of green (for passing) and red (for failing) boxes as the tests are run on each DOM.

    You can double click on the boxes to get information about individual tests. Additionally, there is a file on sps-stringproc01 called /mnt/data/testdaq/useful/mysql.txt which shows you some example queries you could use to extract STF information from the database at a later date.

    If a particular test fails for a DOM then that test should be repeated several times to verify whether or not the DOM has a problem.

    If you have trouble running STF because of a badly communicating DOM, you can try this (no guarantees here): (note that you may want to verify that the flash download was successful - ie put the DOM into iceboot, for example).
    "on all"
    turn off the wire pair for the wire pair giving you problems. If card 4, wire pair 0 is the culprit then the command would be "off 4 0".
    then try to run STF again. If that still doesn't work then do the above, followed by "gotoiceboot", and then run STF one more time.

    Running DomCal


    DomCal actually runs on the DOM itself and performs important calibrations (timing, pedestal pattern, high voltage). It communicates its results to the surface. This information can be saved as xml files. Additionally, the DomCal surface interface automatically overwrites the FAT-->domtune (MySQL) database with whatever 10^7 gain high voltage DomCal has determined for the DOM in question.

    Prior to running DomCal, DOMs should already be at their nominal operating temperature. If they are not, it is necessary to warm them up. Do this by turning them on and putting them in iceboot mode (they draw more current in iceboot mode than in configboot mode, and hence they warm up faster). A cold DOM probably takes about two hours to warm up properly.

    To run DomCal: (takes about 30 minutes)

    On each DOMhub for which you wish to calibrate DOMs:
    Issue command "ps -ef" to verify that no programs are running which might hold on to the DOM device ports (look for and terminate programs like domterm and dtsx and domhub-app).
    Issue command "off all"
    Issue command "on all"
    Issue command "gotoiceboot"
    Issue command "dtsxall"

    Then, from the appropriate string processor:
    Issue command "cd /mnt/data/testdaq/domcal"
    Issue command "nohup java icecube.daq.domcal.DOMCal DOMHUB-MACHINE-NAME 5000 64 /mnt/data/testdaq/domcal/ calibrate dom calibrate hv &". At the pole, DOMHUB-MACHINE-NAME is either sps-ichub-cont01, sps-ichub-cont02 or sps-ithub-cont01. Feel free to run DomCal on multiple DOMhubs in parallel.
    At the end (use "ps -ef" to see if DomCal is still running, make sure that all DomCal files have been successfully updated in the /mnt/data/testdaq/domcal output directory. Do "ls -altr" and "ls | grep domcal -c" to verify this.

    Problems? Check /mnt/data/testdaq/domcal/nohup.out. Database access errors can cause DomCal to hang. Problem PMTs can cause individual HV calibrations to fail. If you suspect an HV calibration problem, try running DomCal on one DOM without the argument "calibrate hv" - and make sure you get the port numbers right!
    It is important that there be one DomCal file for each DOM that the string processor is going to take data with. Hence, my advice is to have up-to-date DomCal files for all DOMs on sps-stringproc01 (the InIce string processor), and DomCal files for only IceTop doms on sps-icetop01 (the IceTop string processor). The up-to-date DomCal files must reside in the directory /mnt/data/testdaq/domcal on each machine.

    Generating Steering Files

    Steering files for 60 DOMs are typically about 200 kBytes. It is somewhat impractical to write these by hand. There is a program called autogen-wrapper which will generate several different types of steering files in the directory from which you run it from (so be careful where you run it from). Then you can select the steering files you want to use, and make whatever small modifications you deem necessary.

    To run autogen-wrapper:

    On each DOMhub for which you wish to include DOMs in your runs:
    Issue command "ps -ef" to verify that no programs are running which might hold on to the DOM device ports (look for and terminate programs like domterm and dtsx and domhub-app).
    Issue command "off all"
    Issue command "on all"
    Issue command "gotoiceboot"
    Issue command "dtsxall"

    Then, from the appropriate string processor:
    Edit the file /mnt/data/testdaq/bin/autogen-wrapper to include the DOMhubs for which you wish to generate steering files for.
    Make any modifications you need to make at the bottom of /mnt/data/testdaq/bin/autogen-steering-LC in order to make a flasherboard steering file for the DOM of your choice.
    Issue command "autogen-wrapper" (remember to be careful about which directory you are in when you run it!)

    Now you are ready to modify the steering file parameters. "executionTime" is a typical parameter you may wish to modify - I advise you not to set the executionTime to be less than 45 seconds, because of configuration timing issues. Also, a 10 minute (600 second) run can produce a very large .hit file (400MBytes plus). If the .hit file is going to be copied over the satellite, I advise you not to use executionTimes above around 600 seconds.

    Domhub-app Configuration

    Domhub-app is the application which runs on the DOMhub, and is the interface between the DOMs and the data-collector (which typically runs on the string processor). The Domhub-app configuration file is /usr/local/etc/dh.properties. This file exists on each domhub.
    One aspect of dh.properties which you will likely need to make use of is the "exclude-a-DOM" feature. A badly communicating DOM can cause all TestDAQ runs to fail, and in this case it might be necessary to exclude one or more DOMs from all runs. For example, if you wanted to exclude the DOMs on DOR card 3, wire pair 1, you would change
    ignoreDOM=000
    to
    ignoreDOM=31A,31B

    Starting a Run

    Create a directory to hold the steering files. (eg /mnt/data/testdaq/special_steering_files/iniceicetop/ )
    Put the steering files you wish to run into this directory.
    Edit /mnt/data/testdaq/bin/automate to point to this directory
    Edit /mnt/data/testdaq/bin/automate to point to the hubs you wish to use. Note that sps-ichub-cont01 and sps-ichub-dat01 are the same machines, but the latter represents a much faster ethernet card (Gbit/s instead of 10Mbit/s).

    On each DOMhub for which you wish to include DOMs in your runs:
    Issue command "ps -ef" to verify that no programs are running which might hold on to the DOM device ports (look for and terminate programs like domterm and dtsx and domhub-app).
    Issue command "off all"
    Issue command "ready". You should see the following line at the bottom of domhubapp.log, which should be being tailed on each hub as a result of the "ready" command (therefore, Cntrl-C will not stop domhub-app):

    [main] DEBUG (DOMHub.java:123) - Waiting for RMI method calls

    Then, from the appropriate string processor:
    Issue command "ps -ef" to verify that no programs are running which might be in conflict. You do not want multiple copies of "automate" or testdaq" running.
    Issue command "go"

    InIce runs should be taken from SPS-STRINGPROC01
    IceTop runs should be taken from SPS-ICETOP01
    Combined InIce-IceTop runs should be taken from SPS-STRINGPROC01. Combined should have the phrase "IniceIcetop" in the steering file name.

    What happens during a run?

    "go" starts a program called "automate" which is put into the background. /mnt/data/testdaq/nohup.out contains the script output of "automate", which is often a good starting point if you think things are not working properly.
    "automate" calls a program called "increment_run_number.pl" which increments a counter in /usr/local/etc/.run_number
    A directory called /mnt/data/testdaq/outputXXXXXXX is created. "XXXXXXX" represents the newly assigned run number. This directory can be found on the string processor's local disk (for optimal file transfer).
    Testdaq starts up, and selects the first steering file (by alphabet) to use (TestDAQ will loop through all steering files indefinitely). You should see the DOMs powering up in the tailed domhubapp.log files.
    A file called /mnt/data/testdaq/outputXXXXXXX/testdaq.log is created. This file is extremely useful for diagnosing problems - especially grep for the keywords "ERROR" and "Exception".
    File names which are associated with the string processor name, the run number, and the steering file name are created in /mnt/data/testdaq/outputXXXXXXX. The TestDAQ file name extensions are ".hit", ".mon", ".tcal", ".extern" and ".xml" (the ".xml" file is the parsed steering file). An example file name is SPS-DAQ-STRINGPROC01_run0000499_LocalCoincidence-IniceIcetop-ATWD0.hit
    Data is accumulated. At the end of the run a directory on the big terrabyte disk is created to store the run output. Additionally, a program called "background_it.pl" is "nohup"ped and "nice"d. "background_it.pl" responsibilities include:
    copy the TestDAQ data files onto the newly created directory on the large multi-Terrabyte disk. The data is arranged according to the naming convention /data/exp/IceCube/year/TestDAQ/month/data/...
    copy the TestDAQ log file to the directory above and rename it to $header.testdaqcontrollog, where $header includes the string processor name and also the run number.
    copy the DomCal .xml files from /mnt/data/testdaq/domcal to the above directory, and create a zip archive of them.
    Create a file called "TestDAQcompleted.txt" in the above directory to signal that TestDAQ has completed.
    run the DataQualityTest program on the files still in the local output directory.
    copy the results of the DataQualityTest program to the directory on the multi-terrabyte disk.
    run the Monolith program on the data files still in the local directory. Monolith selects different configuration parameters depending on which machine the data was taken from (SPS-STRINGPROC01 or SPS-ICETOP01) and also what the steering file name includes (for example, "DarkNoise" or "LocalCoincidence" or "Flasherboard" and also "IniceIcetop".
    copy the output of Monolith to the directory on the multi-Terrabyte disk.
    run the TinyDAQ program on the data files still in the local directory.
    copy the output of TinyDAQ to the directory on the multi-Terrabyte disk.
    finally, tar up the directory on the multi-terrabyte disk and put the tarball into the spade directory, along with a .sem file, for pickup by the satellite file transfer program.
    Marc Hellwig will likely modify this part of the DAQ so that less information makes it to the north - for example, not every .hit file will make it into the final .tar file.

    Stopping a Run

    Runs should first be stopped from the String Processor, then from each of the DOMhubs.
    There are two ways to stop a run:

    To stop a run cleanly from the string processor, issue the command "stoptestdaqclean". This will kill the program "automate". This means that no new runs will be started. The current run will go to completion (unless either domhubapp or the DOMs are shut down prematurely on any of the DOMhubs). Stopping a run this way also makes it possible to later start up runs again without having to issue any commands on any of the DOMhubs. It is necessary to ensure that data-accumulation for the current run has finished before you try to start up any new runs (use "ps -ef").

    to stop all activity on the string processor immediately (including ongoing post-run processing via the program "background_it.pl"), issue the command "stoptestdaq". In this case, it is also necessary that you issue the command "stoptestdaq" on each of the DOMhubs, since the DOMs will likely be left in a strange state as a result of the abrupt end to data-taking.

    It is usually desirable to leave the DOMs powered up with their high voltage on (for temperature and high voltage stability). If not taking TestDAQ data, then it is suggested you take multimon data instead.

    How to tell if a run is good

    Since this document was written, automated scripts have been put in place which detect errors and automatically send out email when they occur.

    TestDAQ runs need to be monitored regularly, especially since there has not been a lot of experience at running TestDAQ at the pole. Initially you should watch (via "tail -f") the domhubapp.log file on each DOMhub and also the testdaq.log file on the string processor, the nohup.out file in /mnt/data/testdaq, and the size of the files in the output directory on the local disk. You should see the size of the files increasing (especially the .hit file), and you should not see any obvious errors in either the domhubapp.log file or the testdaq.log file.

    When the run is over, you should look at the .dataqualitylog when it becomes available. Look at the hit rates for each DOM to see if they make sense. Look for the message "Requeted DOM not present in hit stream" - if you haven't explicitly turned this DOM off in the exclude DOM section of dh.properties, then something is likely wrong.

    If you want to scan through a bunch of runs then cd to /mnt/data/testdaq/latest_data and issue the following commands:

    grep -i error */*.testdaqcontrollog

    grep -i exception */*.testdaqcontrollog

    grep -i requested */*.dataqualitylog

    The major problem to look out for is the case where a run hangs - either while data is being accumulated or else when the run is stopping. In either of these cases TestDAQ will need to be shut down and restarted. The sooner you catch a problem like this the better.

    Taking multimon data

    Multimon is a program which monitors DOM scaler rates, temperatures, pressures and high voltages settings. The scaler rate is read out approximately once per second. You should try to run this program when you are not taking TestDAQ data. This way the DOMs will maintain their nominal operating temperature, and the high voltage will not be shut off for long periods of time.

    On each DOMhub for which you wish to include DOMs in your multimon runs:
    Issue command "ps -ef" to verify that no programs are running which might hold on to the DOM device ports (look for and terminate programs like domterm and dtsx and domhub-app).
    Issue command "off all"
    Issue command "on all"
    Issue command "gotoiceboot"
    Issue command "dtsxall"

    On the appropriate string processor issue the command:
    multimon-wrapper-icecube (for both inice hubs)
    or
    multimon-wrapper-icetop
    (multimon-wrapper-icecube1 and multimon-wrapper-icecube2 will only attempt to connect to one of the inice hubs, which may be the appropriate command depending on what is going on at the time).

    Note that it is necessary to shut down multimon (use the "ps -ef" and "kill" commands) and also to power off the DOMs before they can be utilized for some other purpose.

    The output of multimon is directed towards /mnt/data/testdaq/monitoring/icecube or icetop.

    Multimon for the inice hubs should be run on SPS-STRINGPROC01. Meanwhile, Multimon for the icetop hub should be run on SPS-ICETOP01.

    java -jar mmdislay.jar invokes a nice tool for looking at the output of multimon. (invoke from /mnt/data/testdaq on the string processors).