Operator's Guide to the South Pole Archival and Data Exchange (SPADE) Program Author: Cindy Mackenzie (mackenzie@icecube.wisc.edu) Update: 17 January 2005 A) Launching SPADE As of the writing of this Guide (Jan 17th 2005), there was no automated way to start SPADE on boot-up of sps-sattx. This is expected to be set up by Experiment Control developers. For now and for future reference, this is the way to start and stop SPADE at the command line. Log on to sps-sattx and become the jboss user. This sets up the following two aliases to start and stop JBoss, which in turn starts and stops SPADE: jboss nojboss The command "jboss &" starts JBoss in background, and starts SPADE but leaves it in a "startup pause" state. You must explicitly tell SPADE to start running, via the operator's interface (more information below). You will see two running processes owned by the jboss user. JBoss also sends a great deal of logging output to the console. The logging information, in a slightly different form, is also sent to the file: /mnt/local/icecube/jboss/server/iceboss0/log/server.log The command "nojboss" shuts down JBoss and SPADE. B) Accessing the SPADE operator's interface The operator's interface is the JBoss "JMX Console", which is running on port 8080 on sps-sattx. To access it, from your workstation use a command similar to the following. This command logs you on to sps-access, and tunnels port 8080 on sps-sattx to a convenient port on your local machine (here I've used 8082): ssh -L 8082:sps-sattx:8080 mackenci@sps-access Once this command logs you on to sps-access, open a browser page on your workstation to localhost:8082. On this initial screen you will see a number of links under the heading "icecube". The last of this group: component=SPADE,host=localhost is the link to the default page centrally providing all configuration options and functions for SPADE. However, JBoss does not furnish information about the names of parameters or their datatypes. For this information, see one of the "acme-aspect" pages listed in the group prior to the last link. Each one of these pages represents a subset of the "component=SPADE" page, and performs the same functions, however, the methods and configuration parameters are nicely labeled and described. The trick is to figure out which sub-page contains the function that you are interested in. The names of the links provide a breakdown between "configure," "control," and "monitor" aspects, for various subsystems of the program. These subsystems are: admin general or miscellaneous functionality archive taping email transfers to the north via email registry functions related to the database list of files for pickup scp transfers to the north via scp tdrss transfers to the north via TDRSS verify confirmation of transfers by receipt of emails from the north You will want to investigate the contents of these pages. The vast majority of the configuration parameters are "set and forget", and you will never need to touch them. Many of the functions or "operations" are useful for debugging purposes only. A very few are critically important, and those are the ones which are numbered below (roughly in order of importance). The entire list of operations and configuration parameters will be described in a reference section to be added to this guide later. 1) Operation: startup Invoking this method will start SPADE from its "startup pause". You will receive confirmation. 2) Operation: showSubsystemRunState This shows the run state of the program as a whole and all its subsystems. On a "startup pause", SPADE will be "STARTING" and all subsystems will be "SHUT DOWN". After being started, all subsystems will go to a "RUNNING" state. The only other state you might see would be "SUSPENDED", if you invoke the "suspend" method from the interface or (less likely) the program puts itself into a SUSPENDED state after encountering a critical error. 3) Operations: suspend and resume As they imply, these cause SPADE to suspend all operations, including fetching files, taping and sending them. If the log is spewing errors or some other catastrophe seems imminent, it is wise to suspend SPADE until the situation can be assessed. NOTE: Suspending the program can be a VERY time-consuming process, because of the tape drives. On a suspend command, the drives will attempt to finish their current taping job, then come to a stop, and if a large file is being taped, this can take a while. It is not unreasonable for you to look at an hourglass cursor for up to five minutes after clicking "suspend." Much beyond this period (depending on the size of files being taped), is a sign of a serious problem, for example if a tape drive is not responding and SPADE cannot enter a suspended state. Watch the log output. You will have to make a judgement as to when it becomes necessary to shut down JBoss (and therefore SPADE) at the command line. 4) Operation: shutdown This operation is intended to be part of later functionality. For now, to shutdown SPADE, go to the command line and type "nojboss". 5) Operation: restartTapeDrive When a tape drive encounters a critical error or needs a tape to be changed, it will send an alert email to the winterovers. The content of the email is generic, and more specific information about the action required is found in the log. When a tape drive error occurs, the drive is put into a STOPPED state until the operator takes the action requested. The operator must then go to the JMX console and explicitly restart the drive. The first parameter to the method is the drive name, always specified as the "nstN" form rather than the "stN" form, for example, "/dev/nst0" The second parameter is the name of the tape server, for example, "sps-tape01". NOTE: Like the suspend function, restarting a tape drive can be a lengthy operation. The tape must be rewound to the start and the first "header" archive read to identify the tape, then it is wound forward to the end of data. Allow several minutes to restart any tape drive. 6) Operation: updateRegistry This is used when registering new sources of data with SPADE. More information below. 7) Operation: showActiveAlerts The information described on this page describes the error conditions which spade has encountered within the past 48 hours. ERROR-level alerts are more severe than WARN-level alerts, and usually (but not always) require operator intervention of some kind. WARN-level alerts indicate an unusual event that generally doesn't require operator intervention. More information on any alert can be found in the JBoss server log; look up the alert by the date and time reported on this page. 8) Operation: signOffAlert An alert which has been "signed off" has been acknoledged by the operator and will no longer appear on the Active Alerts page. The single parameter is the number of the alert from the Active Alerts page. This is intended as a convenience to clean up the alerts page, and has no other functionality. 9) "Show" Operations: The additional "show" operations are information pages only: showArchiveQueueStatus -- the allocation of tape drives among archive queues, and their current run status showFileRegistry -- the contents of the database registry table, all the sources of data which are known to SPADE showNetworkStatus -- the single parameter is the number of hours of network connectivity results to show showPerformanceHistoryByCategory -- most recent SPADE throughput by data category in table form, will become graphical in a later version of SPADE showPerformanceHistoryByPriority -- table of throughput by data priority (the means of transfer) showQueuedFiles -- all files currently waiting to be taped on all archive queues. C) Adding data sources SPADE picks up and transfers files from data sources listed in its "file registry". In a future version of SPADE, the registry will be maintained via a GUI interface, but for this first version it is a somewhat manual process. New data sources are added to the registry by an XML document ("registry XML") which is provided to SPADE by one of two methods: (1) either a file containing the XML document is placed in a specific directory and then SPADE is instructed to read it; or (2) an email containing registry XML is mailed directly to SPADE. For method (1), log on to sps-sattx and become the spade user. Change to the directory containing the archive of the registry contents: /mnt/data/spade/registry_files/archive Each file in this directory represents a data source already registered with SPADE. In addition, the "register_test*" files are useful as templates for creating new registry XML files. There is one "test" XML file as an example of each priority of transfer (recall that the lowest priority, here called "raw", is not transferred to the north but only archived on tape): register_test01.raw.xml register_test02.tdrss.xml register_test03.scp.xml register_test04.email.xml Choose a file of the desired transfer priority and copy it to a new file in this directory. The filename must begin with the "register_" prefix, but the remainder of the name should be descriptive to identify the data being registered. Edit the new file and modify the fields as appropriate for your data source. Here is the contents of the register_test02.tdrss.xml file: test02.tdrss .dat .sem sps-fpmaster /mnt/local/spade/testdata 1 > Transfer via TDRSS icwebcam1 false false localhost:/tmp this dataset is transferred by tdrss ICECUBE > IceCube Briefly, to explain how SPADE receives files from data sources: SPADE polls registered directories on hosts for files to appear in pairs. Data producers should move a data file first into the directory, then atomically create a "semaphore file" to indicate that the data file is ready for pickup. The semaphore file may be zero length, but if it contains any text, that text becomes a part of the metadata which accompanies the outbound data file. Semaphore files and data files must have the same filename, except for the extension. This filename begins with an unchanging prefix, optimally followed by a date/time stamp to make each filename unique. The length of the filename, not counting the extension, must be 31 characters or less. This filename (less the extension) becomes the primary identifier of the data in the data warehouse. If the filename (less the extension) follows this convention for the date and timestamp: someprefix_yyyymmdd_hhmmss_sss Then the outgoing metadata will contain the given year, month, day, hour, minute, and second as the "start of data" date/time. The "sss" at the end of the filename may be any number of digits (keeping the overall length under 31 characters) and represents a span, in seconds, of the data contained in the file. The "end of data" date/time will be calculated and saved in the metadata as the start date/time plus the seconds span. Note that the first underscore character signifies the beginning of the date portion of the filename; no underscore should appear in the prefix. With this in mind, the registry XML fields are as follows: Filename_Prefix -- the fixed prefix for all data and semaphore filenames produced at this source. Binary_Suffix -- extension for the binary or data file, can be any number of characters. The "dot" should be included. Semaphore_Suffix -- extension for the semaphore file, same rules as for the binary extension. File_Host -- hostname of the data source computer. The spade user must exist on this host, with the keys set up appropriately so that spade can connect by key-based authentication. Use sps-fpmaster as a source for the spade user's keys. File_Directory -- directory of the data source, note that this directory must be writable by the spade user, because it deletes the files after they are fetched. Priority -- the mode of transfer for the data. The contents of this field are strictly controlled; the options are: "0 > Archive only" "1 > Transfer via TDRSS" "2 > Transfer via scp" "3 > Transfer as email attachment" As shown, leave off the quotes when entering the priority into the XML. Apart from the quotes, the string should be entered exactly as shown. Category -- the general classification of data. These are the available categories: unclassified raw filtered calibration monitoring sps-expcont sps-evbuilder sps-fpmaster sps-fpslave01 sps-fpslave02 sps-stringproc01 sps-sattx hole21 hole29 hole30 hole39 icwebcam1 icwebcam2 icwebcam3 icwebcam4 TestDAQ Make_Local_Copy, Gzip_Local_Copy, Local_Copy_Destination -- self explanatory. The destination must be accessible for scp by the spade user, using key-based authentication. The copy that is made will be a tarfile or a gzipped tarfile containing two files: the original data file received, and the metadata file created by SPADE to accompany the data. Data_Description -- 80 characters, will be saved as part of the permanent data warehouse record for this data. Sensor_Name -- the project producing the data. The relevant options are (again, leave off the quotes): "ICECUBE > IceCube" "ICETOP > IceTop" "EHWD > Enhanced Hot Water Drill" Once you have completed editing the registry XML file, copy the file one directory level up, to: /mnt/data/spade/registry_files Go to the operator's console page, and invoke "updateRegistry". You will see the results of the registry addition on screen and in the log. Any errors in the registry XML are shown; the parser error syntax can be cryptic, but you should be able to tell what you need to fix. For a formal definition of the XML document, see the XML schema file: /mnt/data/spade/registry_files/FileRegistry.xsd Briefly, this is how to update the registry using method (2), or email directly to SPADE. Create an email with the message body containing the registry XML. The email must be sent as plain text, and there should not be any extra characters (such as a signature) in the body. Moreover, the email client must not wrap the lines of XML. Give the email subject line the same name as you used for the registry XML file, with the prefix "register_" followed by the descriptive name of what you're registering. Send the email to spademail@amanda.spole.gov. SPADE will read the email on it's "frequent tasks" timer, which runs every 10 minutes by default. D) Archive queues and tape labeling scheme SPADE is designed to have a flexible taping system, where tape drives are assigned to different types of data but can be re-assigned or DE-assigned fairly easily. There are two main archives being created, of "raw" and "filtered" data, and potentially more than one copy of any given file can be made. SPADE has the concept of an "archive queue", or a combination of data type and copy number. For year one, there will only be one copy of data made, so the names of the two archive queues in use are: RAW-copy1 FILTERED-copy1 Multiple tape drives can be assigned to a pool for use by an archive queue. SPADE will cycle the taping jobs between the various drives in the pool for any given archive queue. For year one, there are two drives assigned to each archive queue: RAW-copy1 sps-tape01:/dev/nst0 sps-tape01:/dev/nst1 FILTERED-copy1 sps-tape02:/dev/nst0 sps-tape02:/dev/nst1 So tape01 can be thought of as the "raw" tape drive, and tape02 as the "filtered" tape drive. The labels to be put on the tapes include the archive queue, the copy number, and a tape index number that resets to 1 each year. The initial tape labels corresponding to the above drives for year one are: RAW-copy1 sps-tape01:/dev/nst0 -- tape_2005_1_RAW_copy1 sps-tape01:/dev/nst1 -- tape_2005_3_RAW_copy1 FILTERED-copy1 sps-tape02:/dev/nst0 -- tape_2005_2_FILTERED_copy1 sps-tape02:/dev/nst1 -- tape_2005_4_FILTERED_copy1 The index number will be assigned sequentially as new tapes are needed, so depending on which tape fills up first. So the next blank tape will be either tape_2005_5_RAW_copy1 -- or it will be -- tape_2005_5_FILTERED_copy1 When a tape becomes full, it is ejected and an email sent to the winterovers list. The log will have information on what label should be on the tape just ejected, and what label to put on the new tape. After labeling and inserting the new tape, restart the tape drive by invoking "restartTapeDrive" from the operator's console with the drive name and host name. If a tape drive needs to be taken out of the available pool for SPADE's use, for example if some manual taping needs to be done, then invoke the operation "makeTapeDriveNotAvailable" from the console page, again with the drive name and host. Invoke "makeTapeDriveAvailable" to put the drive back in the pool for use. One last general note about the tape drives. A separate application (called "tapeserver," logically), runs on both sps-tape01 and sps-tape02. It is a slave, taking direction from SPADE about what to do with the drive. If a drive is giving problems, or can't be restarted, consulting the tapeserver logs can be helpful. They are located on both sps-tape01 and sps-tape02 at: /mnt/data/spade/tapeserver/logs There is one log per tape drive served. If you see errors referring to "RMI", or there seems to be no communication occurring between SPADE and the tapeserver applications, the best bet is to reboot the tape server, which will restart the tapeserver application. E) The most likely alerts SPADE does its best to recover from error conditions and continue on. Some WARN-level messages are a part of normal functioning of the system, but generally speaking, ERROR-level messages need attention. Typical alerts which should not cause concern: "Unable to send file pair by scp; pair will be tried again." When the satellite sets for the day, SPADE might be in the middle of transferring a file by scp. The transfer might time out or simply fail, and you will see this WARN-level message in the log. It is not a concern; the scp will be attempted again when the satellite comes up. "A registry table entry marked as unused has been referenced." A data source registered with SPADE can be "turned off", if it will no longer be used, to prevent SPADE from polling for data. This WARN-level alert, in the unlikely event that you see it, will most likely refer to test data sources that were registered when SPADE was installed, which have since been turned off. SPADE's normal cleanup activities might deal with some of the test data, which would cause this warning to be triggered. "Length of remote file name exceeded limit for DIF Entry_ID. Name truncated." A file fetched from a data source has a filename too long, greater than 31 characters (without the extension). SPADE must truncate the filename to follow the rules for a metadata file in Directory Interchange Format (DIF). Alerts which SHOULD cause concern: "Unexpected error when executing a database query." This could indicate a connectivity problem with the database server, sps-dbs. Under normal circumstances, database queries are designed NOT to fail. "Unable to delete remote file(s)." The spade user apparently does not have write access to a directory where it is fetching files. It will continue to fetch these files until they can be deleted, resulting in multiple occurrences of this error. "Unable to scp file(s) from remote host to local directory." SPADE cannot scp a file from a remote host. This could be a network problem. "Unable to scp local file(s) to remote host." Again, probably a connectivity issue, since SPADE cannot send the file this time. "A system component has encountered a fatal error and suspended processing!" This alert would indicate a catastrophic error, causing SPADE to suspend all operations. Just clicking "resume" on the operator's interface might not solve the problem. The most likely cause of a suspend processing error is a disk being full. The log should be examined before attempting to restart SPADE. For reference, these are all the possible alerts that could appear on the Active Alerts page. Remember, these are all generic messages; for more specific information, consult the JBoss log at /mnt/local/icecube/jboss/server/iceboss0/log/server.log. Note that only the TAPE subsystem, ERROR-level alerts send email to the winterovers. | subsystem | alert_level | alert_name | alert_description +-----------+-------------+--------------+--------------------------------------------------------------------------------- | UNKNOWN | ERROR | Error | Unknown error was reported, no handler, processing suspended. | DISK | ERROR | DiskIO | Disk I/O error found; the disk could be full. | TAPE | ERROR | Serialize | Unable to serialize file pair object for later archival. | TAPE | ERROR | Deserialize | Unable to deserialize a saved file pair for archival. | TAPE | ERROR | Offline | A tape drive is offline or busy, please check the log and make the drive ready. | TAPE | WARN | Compression | The tape compression could not be set, using defaults. | TAPE | ERROR | Application | The tape server application returned an error, please check the logs. | TAPE | WARN | EOFFound | Unexpected EOF returned when reading archive. Possible file loss! | TAPE | ERROR | WrongTape | Wrong tape found in a drive. Please check the log and load the correct tape. | TAPE | ERROR | BadHeader | Bad header file found on a tape. Please load a blank tape. | TAPE | ERROR | Mismatch | Tape contents don't match the database record. Possible data loss! | TAPE | ERROR | Header | Tape header file could not be written to a blank tape. | TAPE | ERROR | ChangeTape | A tape must be changed, please check the log for the drive and label details. | TAPE | ERROR | Handling | Unexpected error returned when reading or handling tape; check the log. | TAPE | ERROR | Unknown | A tape drive did not respond as expected. Please check the log. | TAPE | WARN | NotFound | Unable to locate file on tape for verification; will tape it again. | TAPE | WARN | Checksum | Checksum verification failed for taped file; will tape it again. | TAPE | WARN | Cleanup | The remote application was unable to delete obsolete data files. | TAPE | ERROR | NotTaped | A file could not be taped after numerous attempts. Please check the log. | DB | ERROR | Registry | File registry not prepared to handle this file pair. | DB | ERROR | Priority | Invalid priority value found in registry. | DB | ERROR | Category | Invalid category value found in registry. | DB | ERROR | DataError | Unable to retrieve needed information from the database. | DB | ERROR | DbError | Unexpected error when executing a database query. | DB | WARN | Registry | A registry table entry marked as unused has been referenced. | MAIL | ERROR | Send | Unable to send outgoing email. | MAIL | ERROR | Unrecognized | Incoming email not recognized; please check the email file on disk. | MAIL | ERROR | Receive | Unable to receive incoming email. | XML | ERROR | Registry | Unable to process registry XML file. | XML | ERROR | Verification | Unable to process verification XML file. | SSH | ERROR | Connection | Unexpected SSH connection problem, see log for details. | SSH | WARN | HostKey | Remote host key unrecognized, has been added to known_hosts. | SSH | ERROR | Delete | Unable to delete remote file(s). | SSH | ERROR | List | Unable to list remote directory with sftp. | SSH | ERROR | ScpFrom | Unable to scp file(s) from remote host to local directory. | SSH | ERROR | ScpTo | Unable to scp local file(s) to remote host. | PAIR | WARN | Filename | Length of remote file name exceeded limit for DIF Entry_ID. Name truncated. | PAIR | WARN | Fetch | Remote files did not appear to be a well-defined pair. Could not fetch. | PAIR | ERROR | Fetch | Fetch/setup of pair halted due to previous errors. | PAIR | ERROR | Send | Processing/sending of pair halted due to previous errors. | PAIR | ERROR | Verify | Verification of pair halted due to previous errors. | PAIR | WARN | Resend | Pair has met or exceeded the send limit and will not be re-sent. | PAIR | ERROR | Missing | A file or files corresponding to the pair was not found where expected. | PAIR | WARN | Outbox | Pair could not be sent and has been moved to the outbox directory. | TDRSS | FATAL | File | TDRSS outbound file not found; possible data loss! Processing suspended. | TDRSS | WARN | File | TDRSS outbound file was an unexpected length; possible data corruption! | TDRSS | ERROR | Scp | TDRSS outbound file could not be copied to TDRSS server! | TDRSS | ERROR | Send | Exception caught when attempting to send file pair by satellite. | SCP | ERROR | Serialize | Unable to serialize file pair object for later scp. | SCP | ERROR | Deserialize | Unable to deserialize saved file pair for scp. | SCP | WARN | Send | Unable to send file pair by scp; pair will be tried again. | NET | WARN | RemoteHost | Unable to perform DNS lookup on remote host. | SYSTEM | ERROR | Suspend | A system component has encountered an error and suspended processing!