The SCAP 2014 panel had this to say about IceProd 2 development:
Comments
The development of IceProd is clearly driven by needs of the simulation production and the analysis. The committee was curious to what extent existing workflow management tools could achieve a similar outcome, particularly the tools developed and used by the LHC experiments and supported by the Open Science Grid. The development of IceProd2 would be a convenient opportunity to reevaluate the needs and requirements for an in-house development that draws on scarce effort.
Recommendations
The collaboration should evaluate potential savings from migrating IceProd to a more commonly utilized tool.
So, let's study these other tools and how well they fit to IceCube.
As a reminder, here are the IceProd goals/requirements:
We don't desire to control storage, other than knowing where we put our own files.
CMS actually has a chain of systems: ProdRequest, ProdManager, and ProdAgent.
The Request system (ProdRequest), acts as a frontend application for (user) production request submissions into the production system; the Production Manager (ProdManager), manages these user requests, performing accounting and allocating work to a collection of Production Agents (ProdAgents). The agents ask for work when resources they manage are available and manage submissions, possible errors and resubmissions while performing local cataloguing operations.
We probably care most about ProdAgent as the tool that does most of the work.
ProdAgent runs above the T2 level, but there can be multiple ProdAgents per T1. It submits to multiple T2 resources via HTCondor-G.
This is like using a steamshovel to remove a pebble from the ground. The tool is significantly bigger than is necessary, and we don't have the infrastructure for it. But we can use some of the design and downstream components to help improve anything we create.
PanDA is a little more general than ProdAgent. Some basic facts:
Direct use of HTCondor-G (or optionally GLite)
Use of pilot jobs (PanDA does not submit pilots, just use them):
"An independent subsystem manages the delivery of pilot jobs to worker nodes"
pre-staging of input data and transfer of output data
"Minimum site requirements are a grid computing element or local batch queue to receive pilots, outbound http support, and remote data copy support using grid data movement tools."
"Authentication and authorization is based on grid certificates, with the job submitter required to hold a grid proxy and VOMS role that is authorized for PanDA usage"
"Allocation of job sets to sites is followed by the dispatch of corresponding input datasets to those sites, handled by a data service interacting with the ATLAS distributed data management system"
PanDA is better than CMS's ProdAgent, but still probably bigger than necessary. It's more intertwined with the Altas file transfer system than we'd like, and would require significant setup of T2s in order to use it. The DB size and Oracle requirement is worrying.
Designed as HTC PaaS (platform as a service). Has connectors to all our grid types, with fairly easy extensibility. Decent web API with sign on using grid certificates.
Also has a data management system. It's interesting, though shows its early 2000s roots.
This is the best of the LHC crowd for supporting medium-size experiments. We could probably have it up and running in a year, though adoption at smaller sites might be a problem. It's highly configurable (lots of knobs), though that might also hurt it because of difficulty to set up correctly.
Not my first choice, but a viable alternative.
Local (workstation) submits pilot glidein jobs to clusters via ssh (must have an account on each cluster). Also runs an HTCondor server which uses those glideins.
Cluster needs shared filesystem except for HTCondor grid universe jobs.
The worker nodes need to have access to the submit host (local workstation) except for HTCondor grid universe jobs.
This is something we could investigate for using as middleware to submit from one machine to many clusters.
IPython can now do clustered computing.
Supports PBS variants, but not HTCondor (yet).
Allows large amounts of python computing from a single IPython notebook.
This is a fancy toy that Jakob might use.