Releases: PanDAWMS/pilot3
Releases · PanDAWMS/pilot3
3.13.0.23
- Maxtime related server update problem, walltime grace buffer and log verbosity
- Added intrinsic grace buffer before walltime limit: 120s for all jobs, 300s for jobs with walltime > 6 hours, to allow clean pilot shutdown and reliable final server update
- Reduced "time since job start" log verbosity from every loop iteration to once per minute
- Fixed a condition that can lead to lost heartbeat when maxtime is reached
- Reported by R. Walker
- Discussed in ATLASPANDA-1671
- Pilot not terminating quickly after maxtime event
- Fixed a 10-minute stall in pilot shutdown following a MAXTIME (batch system time limit) event
- Added corresponding unit test
- Reported by R. Walker
- Apptainer /tmp bind-mount for --contain mode
- When apptainer --contain / -c mode is active, now creating a tmp/ subdirectory in the job workdir and bind-mount it over /tmp inside the container
- This replaces the default 64 MB apptainer tmpfs with workdir-backed space, preventing out-of-space errors for payloads (e.g. user jobs compiling code) that write significant data to /tmp
- Requested by A. de Silva and R. Walker
- Discussed in ATLASPANDA-1686
- OIDC token refresh not taking effect
- When PANDA_AUTH_TOKEN is a bare filename, the refreshed token was written into the pilot's CWD rather than overwriting the original token file, and OIDC_REFRESHED_AUTH_TOKEN was stored as a bare name rather than an absolute path. Subsequent calls to locate_token would then find the original stale token earlier in the search path, so the token appeared never to refresh. Fixed by resolving the token name to its full path via locate_token before writing, ensuring the correct file is overwritten in-place and the env var holds an absolute path
- Reported by R. Walker
- Discussed in ATLASPANDA-1695
- Alternative stage-out fixes
- Fix for altStageOut=off (sent by PanDA server for e.g. pre-merged jobs) being ignored when the queue has allow_altstageout=True set - job-level explicit off now always takes priority over queue-level settings
- Restored altTransferred=lfn1,lfn2,... reporting in jobMetrics for files transferred to an alternative destination during alt stage-out failover
- Discussed in ATLASPANDA-1670
- Remote io test Argument list too long
- Remote file open verification now writes TURLs to a file (turls.txt) instead of passing them on the command line when the input list exceeds 500 files, avoiding Argument list too long failures seen with ~1000-file merge jobs
- Requested by R. Walker
- Discussed in ATLASPANDA-1681
- Bug in send_worker_status
- Fixed a KeyError crash in send_worker_status() when HARVESTER_WORKER_ID or HARVESTER_ID environment variables are not set by the site (e.g. some non-Harvester sites). The worker_id key is now initialised to None before the conversion attempt, so missing variables are handled gracefully instead of crashing the pilot
- Reported by F. Barreiro
- Discussed in ATLASPANDA-1701
- Parsing directIO errors from stdout
- When a job uses direct access and an XRootD file-open fails inside the payload after the pilot's pre-flight check has passed, the pilot now scans payload.stdout for known error patterns (TNetXNGFile open errors, Operation expired, No servers available, Unable to open ROOT file, and related XRootD strings). On a match, the job is classified as STAGEINFAILED(1099) with the first matched line recorded as error diagnostics, replacing the previously uninformative UNKNOWNPAYLOADFAILURE or PAYLOADEXECUTIONFAILURE codes. The scan is restricted to remoteIO jobs and only fires when no more specific error (OOM, disk full, setup failure, etc.) has already been identified
- Requested by R. Walker
- Discussed in ATLASPANDA-1690
- Allow job/task to choose io protocol
- Added support for protocol selection in direct I/O mode, controlled by the existing transfertype job parameter
- Full details in ATLASPANDA-1646
- Requested by R. Walker
- Pilot is now scanning the payload stdout for cling JIT memory allocation errors
- A new error code was created, 1387: “Failed to allocate memory for transform execution (cling JIT failure)”
- Requested by R. Walker
- Discussed in ATLASPANDA-1703
- ePIC
- Enabled memory monitoring using prmon
- Enabled job metrics
- Implemented removal of redundant files prior to log file creation
- Support for publishing slurm job logs via web server for perlmutter jobs
- Changes to pilot id reporting (adding job id)
- Note: This is also of interest to ATLAS (D. Benjamin is testing)
- Code improvements
- Refactorings of functions in the https module by Claude
- Conversion of docstrings in several pilot modules using Claude Code (ongoing project)
- Fixed spurious job failures on worker nodes with unmapped UIDs
- Pilot can now handle “ 'getpwuid(): uid not found”-errors (site infrastructure problem but not handled well by the pilot until now)
- Reported by K. De via the BNL Site Status Report (2026-04-17)
- Remote file open
- Increased timeout from 600 to 900s for remote file open script after seeing timeouts very close to limit
- Now also verifying if file open still succeeded in spite of timeout, which can happen when timeout happens very close to the time limit (in which case timeout is ignored)
- Added timeouts to df, os.walk and heartbeat file writes to prevent the monitor thread from freezing indefinitely if e.g. an NFS mount stalls
- Documentation
- Pilot error codes updated and fully documented (149 error codes)
- Memory monitoring described
- Direct access described
- Queuedata fields used by the pilot
- Sphinx documentation (long broken) is being restored; docstrings processed in several modules with Claude Code assistance
Code contributions from X. Zhao, P. Nilsson.
Claude and Claude Code were used to document and improve the pilot code in this version.
3.12.4.1
- Patch for wrong end time
- Same problem as with start time reporting in the previous pilot release, now resolved
3.12.3.1
- Hot patch for fixing a problem with start_time
- The used function datetime.fromtimestamp(ts) interprets the epoch in the local system
timezone. On non-UTC worker nodes (e.g. JST/UTC+9 at TOKYO) this
produced a start_time 9 hours ahead of the correct UTC value, which
the PanDA server then used to overwrite the correct start time it had
already recorded. Fixed by passing tz=timezone.utc.
- The used function datetime.fromtimestamp(ts) interprets the epoch in the local system
3.12.1.99
- The pilot now supports protocol preference via the job’s transferType field for stage-in transfers
- If set to file, root, or davs, the pilot will prioritize replicas matching that schema during copy-to-scratch, while preserving existing fallback logic
- Direct access handling is unaffected
- Requested by R. Walker to alleviate mount tests at DESY
- Discussed in ATLASPANDA-1646
- Related change: Updated alder32 checksum algorithm for read-only filesystems
- Resilience
- Added retry with exponential backoff to list_replicas() to handle transient Rucio streaming/connection errors
- Also using the Rucio iterator explicitly instead of using list() to reduce streaming-related failures
- Example failure where this new code would have been useful: 7039823132 (1099, “Failed to stage-in file: [ChunkedEncodingError(ProtocolError('Response ended prematurely'))]:failed to transfer files using copytools=['rucio']”)
- Added retry with exponential backoff to list_replicas() to handle transient Rucio streaming/connection errors
- PanDA API migration
- All PanDA server function calls and corresponding responses have been updated
- Related changes: Updated all internal usages of several job parameters for type changes, refactored server call functions, etc
- Discussed in JIRA ticket ATLASPANDA-1562
- New Kubernetes Executor workflow
- Development in progress for SKA
- Pilot and payload runs in different pods in this workflow
- Refactored EIC user to ePIC user
- Added resource specific payload setup (for NERSC esp.)
- Improvements
- Various code cleanups (incl. documentation and command execution updates) using OpenAI Codex and Claude Code
- Corrected intersect calculation in analytics package so that it can handle large x values (such as timestamps)
3.11.5.1
- Changed default pilot walltime grace from 1% to 0%
3.11.4.1
- Now setting ‘protocol’ in Rucio trace report if not set already
- Extracted from ‘url’ field, which then must have been set
- Example job: 6996615722
- Discussed in ATLASPANDA-1485
3.11.3.9
- Worker node map
- To facilitate the introduction of worker node maps on ND, the pilot will from now on add the worker node JSON to the job report, reported to the server
- In case the job report does not exist, it will be created (with only the worker node map)
- Corresponding JIRA ticket: ATLASPANDA-1600
- Maxwdir
- Fixed problem with divider used to scale maxwdir (now explicitly using PQ.corecount, which is also verified to be set)
- Reported in ATLASPANDA-1575
3.11.2.19
- Worker node map update
- Now also reporting PanDA queue name
- If environment variable storageLimitMiB (and storageRequestMiB) are set, local space check returns the former instead of df value
- “The df call reflects the node filesystem, not the pod’s quota, which is why we’re seeing evictions around ~24 Gi (our 20 Gi request + 4 Gi margin)”
- Requested by Eduardo Bach (NET2)
- Discussed in ATLASPANDA-1566
- Explicitly avoiding panda token file in debug mode
- To prevent token from being exposed by tail command
- Reported by M. Borodin
- Chirp
- Pilot now only writes job and pilot id’s to ClassAd
- Discussed in ATLASPANDA-1518
- Cgroups
- Updated cgroup creation logic to account for HTCondor recent changes
- Pilot now creates sub-cgroups under the job’s .slice instead of the .scope, avoiding permission-denied errors
- Fully backwards compatible with older HTCondor versions
- Requested by R. Walker
- Work dir size checks now take number of cores into consideration
- Discussed in ATLASPANDA-1575
- To look out for: 1104 errors which might increase
3.11.1.15
- Condor chirp
- When available, condor_chirp is used to set job id and current job state (retrieved, starting, running, finished, failed) in the ClassAd
- Sites interested in using this feature must make sure that condor_chirp is locally available (ideally in default installation location /usr/bin)
- If the command is not in the standard PATH, the pilot will attempt to locate the path using the condor_config_val command
- This method is e.g. used on MWT2
- Also, “want_io_proxy = true” must be set in the job submit jdl to allow the command to update the ClassAd
- Pilot now exits after queuedata download if PQ.status is set to ‘offline’
- Previously, pilot only checked whether the queue was active (using PQ.state)
- Added new error code 1386, PANDAQUEUENOTONLINE (used internally)
- Pilot returns exit code 83 to the wrapper in this case
- A problem with exception handling prevented alternative stage-out from working properly in a case seen at SARA-MATRIX, should now be fixed
- Discussed in ATLASPANDA-1547
- Improvements to event service error handling
- Added GitHub Action for testing code for circular imports
- Using newly developed tool: https://pypi.org/project/circular-import-detector/
- Plugin added the SKA collaboration
Code contributions from W. Guan, P. Nilsson
3.11.0.29
- Memory checks for different resource types
- Discussed in JIRA ticket ATLASPANDA-1051 (“Pilot RSS kill threshold should depend on subresource, not just PQ.maxrss”)
- Same memory limits are used for cgroups as for different resource types
- All subprocesses - including payload - are in their own cgroup (“subprocesses”)
- Tested in production at MWT2
- Discussed in JIRA ticket ATLASPANDA-1251 (“Implement memory restrictions on just the pilot payload using cgroup v2”)
- Resolved JIRA ticket ATLASPANDA-1483 (“Traceback in pilot for remoteio timeout”)
- An exception was caught in the setup command which could indicate a worker node issue / severe apptainer issue, but for remote file open that is not the case and should not result in a non-zero pilot exit code.
- Example job: 6794333602
- JIRA ticket ATLASPANDA-1496 (“Add retries for pilot stage-out”)
- Added --stageout-attempts pilot argument
- Likely needs some further development
- Debugging issue with OIDC token seen with long running job at MWT2
- Now decoding token after refresh to print out issue and expiry time
- Worker node map
- Pilot now adds /usr/sbin to PATH if missing (needed to find lspci command at BNL_GPU)
- Discussed in JIRA ticket ATLASPANDA-1368
- Internal improvements
- Refactoring of main pilot module
- Pylint updates - Current average score across all 215 pilot modules: 9.77/10
- Added GitHub Action workflow using new tool VeriCode, capable of running different linters and other tools
- Currently, it verifies that no individual pilot module has a pylint score of less than 8 (it would fail the PR)
- More info at https://pypi.org/project/vericode/