Grid Middleware Evaluation

Motivation

The SCAP 2014 panel had this to say about IceProd 2 development:

Comments

The development of IceProd is clearly driven by needs of the simulation production and the analysis. The committee was curious to what extent existing workflow management tools could achieve a similar outcome, particularly the tools developed and used by the LHC experiments and supported by the Open Science Grid. The development of IceProd2 would be a convenient opportunity to reevaluate the needs and requirements for an in-house development that draws on scarce effort.

Recommendations

The collaboration should evaluate potential savings from migrating IceProd to a more commonly utilized tool.

So, let's study these other tools and how well they fit to IceCube.

As a reminder, here are the IceProd goals/requirements:

Distributed, no single point of failure.
Easy to install, lightweight.
Easy way to run a single job manually for debugging.
Must be able to do DAG jobs, such that we can run on a CPU, then GPU, then CPU as parts of the same job.
Scalable up to 100,000 cores or more.
Track "all the things" relating to grid jobs (what ran, where, when, why, and how; with as much detail as possible).
Cleaned up website, with much faster page loads.

We don't desire to control storage, other than knowing where we put our own files.

CMS - ProdAgent

ProdAgent in Depth

CMS actually has a chain of systems: ProdRequest, ProdManager, and ProdAgent.

The Request system (ProdRequest), acts as a frontend application for (user) production request submissions into the production system; the Production Manager (ProdManager), manages these user requests, performing accounting and allocating work to a collection of Production Agents (ProdAgents). The agents ask for work when resources they manage are available and manage submissions, possible errors and resubmissions while performing local cataloguing operations.

We probably care most about ProdAgent as the tool that does most of the work.

ProdAgent runs above the T2 level, but there can be multiple ProdAgents per T1. It submits to multiple T2 resources via HTCondor-G.

Cons

ProdAgent has good structure (which we could learn from). In particular, it uses autonomous components and asyncronous, persistent messaging.
Some of the downstream components that CMS uses for getting jobs running on sites, like the HTCondor JobRouter, we should think of using.

The main downside is that it has high site requirements, in particular that it relies heavily on Tier2 and local storage, and the CMS data transfer tool PhEDEx. We don't have any of that infrastructure, and don't really need it.
It also seems to require local merging of job output files into larger dataset output files, which doesn't fit with most of our workflow. Additionally, there doesn't seem to be support for DAG jobs.
One (not-insurmountable) obstacle is that ProdAgent consumes an entire machine by itself (I've seen specs of 48 GB memory, several CPU cores). It probably also requires root access for some of the installation. These are two things we've been avoiding in IceCube on a philosophy basis.
A concern is that there are only one ProdRequest and one ProdManager, meaning two single points of failure and potential bottlenecks.

Conclusion

This is like using a steamshovel to remove a pebble from the ground. The tool is significantly bigger than is necessary, and we don't have the infrastructure for it. But we can use some of the design and downstream components to help improve anything we create.

Atlas - PanDA

About

PanDA is a little more general than ProdAgent. Some basic facts:

Direct use of HTCondor-G (or optionally GLite)
Use of pilot jobs (PanDA does not submit pilots, just use them):
"An independent subsystem manages the delivery of pilot jobs to worker nodes"
pre-staging of input data and transfer of output data
"Minimum site requirements are a grid computing element or local batch queue to receive pilots, outbound http support, and remote data copy support using grid data movement tools."
"Authentication and authorization is based on grid certificates, with the job submitter required to hold a grid proxy and VOMS role that is authorized for PanDA usage"
"Allocation of job sets to sites is followed by the dispatch of corresponding input datasets to those sites, handled by a data service interacting with the ATLAS distributed data management system"

Cons

Use of HTCondor-G and Glite is good.
Using pilots is more of the way we want to go.
Supports user jobs as well as production data.

Similar to CMS, the main downside is that it has high site requirements. It relies heavily on Tier2 and local storage, and Atlas data transfer and storage federation. We don't have any of that infrastructure, and don't really need it.
Can probably run HTCondor DAGs, but fully distributed DAGs are missing. There is talk of adding Meta-Task as a group of related tasks (CHEP 2013), though this is more like related datasets.
A secondary pilot job system is a little annoying, and we'd prefer if it was built in.
Doing gridftp certificate validation for http transport sounds like a recipe for trouble if things don't work.
Their production DB size is around 6 TB. Need a big server for this. Also uses Oracle server. (note that JEDI data would grow 2-3TB/year without Oracle licensed compression.)

Conclusion

PanDA is better than CMS's ProdAgent, but still probably bigger than necessary. It's more intertwined with the Altas file transfer system than we'd like, and would require significant setup of T2s in order to use it. The DB size and Oracle requirement is worrying.

LHCb - DIRAC

About

Designed as HTC PaaS (platform as a service). Has connectors to all our grid types, with fairly easy extensibility. Decent web API with sign on using grid certificates.

Also has a data management system. It's interesting, though shows its early 2000s roots.

Cons

Decent API for making wrappers.
Integrated pilot submission.
I like their web stack (Nginx + Torando), since that's exactly what I use for IceProd2.

Central queues hosted in one place, so a point of failure.
Basically takes over a machine (requires root). Then installs several services (MySQL, etc), managing them internally.
Has modules, but no mention of executing them on different machines. Looks like a single JDL with one input and output sandbox per job.
Uses a user switching program on worker nodes that must be installed by the cluster admin (gLexec).
The data management system uses pseudo-unix commands. Would have been a lot nicer to use FUSE and actual commands.
I can't find mention of it, but several of the commands look RHEL/SL only.
I personally dislike their web portal. It uses frames. No one has used frames in webapps in about a decade. They also created their own in-browser window manager, which demands lots of pain.

Conclusion

This is the best of the LHC crowd for supporting medium-size experiments. We could probably have it up and running in a year, though adoption at smaller sites might be a problem. It's highly configurable (lots of knobs), though that might also hurt it because of difficulty to set up correctly.

Not my first choice, but a viable alternative.

OSG - BOSCO

About

Local (workstation) submits pilot glidein jobs to clusters via ssh (must have an account on each cluster). Also runs an HTCondor server which uses those glideins.

Cluster needs shared filesystem except for HTCondor grid universe jobs.

The worker nodes need to have access to the submit host (local workstation) except for HTCondor grid universe jobs.

Cons

Submit to many clusters at once.
DAGMan can do DAGs.

Nothing to get from dataset configuration to many jobs, or monitor that.

Conclusion

This is something we could investigate for using as middleware to submit from one machine to many clusters.

IPython Cluster

About

IPython can now do clustered computing.

Supports PBS variants, but not HTCondor (yet).

Allows large amounts of python computing from a single IPython notebook.

Cons

Really easy for analyzers to run numpy in parallel on a cluster.

Nothing to get from dataset configuration to many jobs, or monitor that.
Nothing about DAGs, though you could write a wrapper and get fancy yourself. Not sure if this would work with GPUs in the first place.
Another process on the worker or head node could snoop on data in the ZeroMQ sockets.

Conclusion

This is a fancy toy that Jakob might use.