Data access and analysis with Panda/pathena
Page under construction
The Panda processing system supports full scale production and regional/group/individual user's production and distributed analysis; it uses Grid or local farm-based resources (OSG). Job dispatching is done within the system without a resource broker. Panda works reliably and fast: Job dispatching is done within the system without a resource broker; jobs are only executed after the input data are staged. Each user owns his user web page (after first job submission or registration), which keeps track of information concerning his jobs and input/output datasets. Output files can be accessed with DQ2.
Panda is available at BNL and other US sites, and will become available at CCIN2P3/Lyon.
Selected references
- Distributed analysis on Panda
- Panda monitor -- Portal page
Data access
- Dataset browser -- access to a large set of data on the BNL server
- Request subset of dataset -- to be transfered to the BNL server (submit Savannah report)
Data analysis
- Distributed analysis with Panda
provides a lightweight framework for job submission with good user support.
It is fully integrated with the data management tool DQ2: Outputs (logs, Ntup, …) are DQ2 datasets
accessible through the Panda browser, no bookkeeping is needed.
Extensive browsing and monitoring of job status is provided.
Tutorial, which explains, how to submit jobs from different Grids: - "pathena" is an adaption of the Athena command line, which is processed via Panda.
It processes user submissions in two steps:
- Build step: gather user code, store it and ship it to the processing site.
- Run step: run N Athena jobs with user designated input and output datasets.
- The user retrieves the output dataset via the DQ2 tools (dq2_get).
- Jobs can be re-submitted using the already built job (libDS): only new datafiles, failed or killed jobs will be re-processed.
Selected commands
- pathena
- pathena command and its options; all options are optional, at least one jobOption file has to be provided.
> pathena--inDS InputDataset--outDS OutputDataset--minDS MinimumBiasDataset--cavDS CavernDataset--split N--site SiteName--nfiles N--nFilesPerJob N--nEventsPerJob N {deprecated}--nSkipFiles N--libDS LibraryDataset--long--blong-p bootstrap-c command<jobOption1.py> [<jobOption2.py> …] - pathena_util
- pathena_util accesses the database, which keeps recent records about jobs (last 3 days);
without argument the latest or all jobIDs are used.
> pathena_util>>> help()>>> help( <cmd> ) ex: help(show)>>> show( <JobID> )>>> kill( <JobID> ) ex: kill(1) {kill job with jobID = 1}>>> retry( <JobID> ) {only failed subjobs will be retried}>>> select( 'pandaParam=xxx' ) { use any Panda job parameter, wildcard character "*" accepted}>>> get( <JobID> ) {get job record}
Reporting problems
- Savannah -- report any problems or tasks
- Savannah -- report bugs
- FAQ -- check here before submitting a bug report
Tutorials
See also
- Parameters of Panda displays
- PanDA/DDM integration, T. Wenaus, Sept 2006
- The PanDA Production and Distributed Analysis System -- explains the system, many links for administrators, few links for users (see above). Consult this reference to understand Panda's architecture and implementation.
- Analysis tutorial -- TWiki LyonAnalysisTutorial1206, Feb 2007