Releases · PanDAWMS/pilot3

22 Apr 08:22

PalNilsson

3.13.0.23

6288c57

3.13.0.23 Latest

Latest

Maxtime related server update problem, walltime grace buffer and log verbosity
- Added intrinsic grace buffer before walltime limit: 120s for all jobs, 300s for jobs with walltime > 6 hours, to allow clean pilot shutdown and reliable final server update
- Reduced "time since job start" log verbosity from every loop iteration to once per minute
- Fixed a condition that can lead to lost heartbeat when maxtime is reached
- Reported by R. Walker
- Discussed in ATLASPANDA-1671
Pilot not terminating quickly after maxtime event
- Fixed a 10-minute stall in pilot shutdown following a MAXTIME (batch system time limit) event
- Added corresponding unit test
- Reported by R. Walker
Apptainer /tmp bind-mount for --contain mode
- When apptainer --contain / -c mode is active, now creating a tmp/ subdirectory in the job workdir and bind-mount it over /tmp inside the container
- This replaces the default 64 MB apptainer tmpfs with workdir-backed space, preventing out-of-space errors for payloads (e.g. user jobs compiling code) that write significant data to /tmp
- Requested by A. de Silva and R. Walker
- Discussed in ATLASPANDA-1686
OIDC token refresh not taking effect
- When PANDA_AUTH_TOKEN is a bare filename, the refreshed token was written into the pilot's CWD rather than overwriting the original token file, and OIDC_REFRESHED_AUTH_TOKEN was stored as a bare name rather than an absolute path. Subsequent calls to locate_token would then find the original stale token earlier in the search path, so the token appeared never to refresh. Fixed by resolving the token name to its full path via locate_token before writing, ensuring the correct file is overwritten in-place and the env var holds an absolute path
- Reported by R. Walker
- Discussed in ATLASPANDA-1695
Alternative stage-out fixes
- Fix for altStageOut=off (sent by PanDA server for e.g. pre-merged jobs) being ignored when the queue has allow_altstageout=True set - job-level explicit off now always takes priority over queue-level settings
- Restored altTransferred=lfn1,lfn2,... reporting in jobMetrics for files transferred to an alternative destination during alt stage-out failover
- Discussed in ATLASPANDA-1670
Remote io test Argument list too long
- Remote file open verification now writes TURLs to a file (turls.txt) instead of passing them on the command line when the input list exceeds 500 files, avoiding Argument list too long failures seen with ~1000-file merge jobs
- Requested by R. Walker
- Discussed in ATLASPANDA-1681
Bug in send_worker_status
- Fixed a KeyError crash in send_worker_status() when HARVESTER_WORKER_ID or HARVESTER_ID environment variables are not set by the site (e.g. some non-Harvester sites). The worker_id key is now initialised to None before the conversion attempt, so missing variables are handled gracefully instead of crashing the pilot
- Reported by F. Barreiro
- Discussed in ATLASPANDA-1701
Parsing directIO errors from stdout
- When a job uses direct access and an XRootD file-open fails inside the payload after the pilot's pre-flight check has passed, the pilot now scans payload.stdout for known error patterns (TNetXNGFile open errors, Operation expired, No servers available, Unable to open ROOT file, and related XRootD strings). On a match, the job is classified as STAGEINFAILED(1099) with the first matched line recorded as error diagnostics, replacing the previously uninformative UNKNOWNPAYLOADFAILURE or PAYLOADEXECUTIONFAILURE codes. The scan is restricted to remoteIO jobs and only fires when no more specific error (OOM, disk full, setup failure, etc.) has already been identified
- Requested by R. Walker
- Discussed in ATLASPANDA-1690
Allow job/task to choose io protocol
- Added support for protocol selection in direct I/O mode, controlled by the existing transfertype job parameter
- Full details in ATLASPANDA-1646
- Requested by R. Walker
Pilot is now scanning the payload stdout for cling JIT memory allocation errors
- A new error code was created, 1387: “Failed to allocate memory for transform execution (cling JIT failure)”
- Requested by R. Walker
- Discussed in ATLASPANDA-1703
ePIC
- Enabled memory monitoring using prmon
- Enabled job metrics
- Implemented removal of redundant files prior to log file creation
- Support for publishing slurm job logs via web server for perlmutter jobs
  - Changes to pilot id reporting (adding job id)
  - Note: This is also of interest to ATLAS (D. Benjamin is testing)
Code improvements
- Refactorings of functions in the https module by Claude
- Conversion of docstrings in several pilot modules using Claude Code (ongoing project)
- Fixed spurious job failures on worker nodes with unmapped UIDs
  - Pilot can now handle “ 'getpwuid(): uid not found”-errors (site infrastructure problem but not handled well by the pilot until now)
  - Reported by K. De via the BNL Site Status Report (2026-04-17)
- Remote file open
  - Increased timeout from 600 to 900s for remote file open script after seeing timeouts very close to limit
  - Now also verifying if file open still succeeded in spite of timeout, which can happen when timeout happens very close to the time limit (in which case timeout is ignored)
- Added timeouts to df, os.walk and heartbeat file writes to prevent the monitor thread from freezing indefinitely if e.g. an NFS mount stalls
- Documentation
  - Pilot error codes updated and fully documented (149 error codes)
    - https://github.com/PanDAWMS/pilot3/wiki/Error-codes
  - Memory monitoring described
    - https://github.com/PanDAWMS/pilot3/wiki/Memory-monitoring
  - Direct access described
    - https://github.com/PanDAWMS/pilot3/wiki/Direct-access
  - Queuedata fields used by the pilot
    - https://github.com/PanDAWMS/pilot3/wiki/Queuedata
  - Sphinx documentation (long broken) is being restored; docstrings processed in several modules with Claude Code assistance

Code contributions from X. Zhao, P. Nilsson.

Claude and Claude Code were used to document and improve the pilot code in this version.

Assets 2

01 Apr 12:24

PalNilsson

3.12.4.1

9e35794

3.12.4.1

Patch for wrong end time
- Same problem as with start time reporting in the previous pilot release, now resolved

Assets 2

31 Mar 16:24

PalNilsson

3.12.3.1

1edb2ca

3.12.3.1

Hot patch for fixing a problem with start_time
- The used function datetime.fromtimestamp(ts) interprets the epoch in the local system
  timezone. On non-UTC worker nodes (e.g. JST/UTC+9 at TOKYO) this
  produced a start_time 9 hours ahead of the correct UTC value, which
  the PanDA server then used to overwrite the correct start time it had
  already recorded. Fixed by passing tz=timezone.utc.

Assets 2

24 Mar 10:06

PalNilsson

3.12.1.99

777bb99

3.12.1.99

The pilot now supports protocol preference via the job’s transferType field for stage-in transfers
- If set to file, root, or davs, the pilot will prioritize replicas matching that schema during copy-to-scratch, while preserving existing fallback logic
- Direct access handling is unaffected
- Requested by R. Walker to alleviate mount tests at DESY
- Discussed in ATLASPANDA-1646
- Related change: Updated alder32 checksum algorithm for read-only filesystems
Resilience
- Added retry with exponential backoff to list_replicas() to handle transient Rucio streaming/connection errors
  - Also using the Rucio iterator explicitly instead of using list() to reduce streaming-related failures
  - Example failure where this new code would have been useful: 7039823132 (1099, “Failed to stage-in file: [ChunkedEncodingError(ProtocolError('Response ended prematurely'))]:failed to transfer files using copytools=['rucio']”)
PanDA API migration
- All PanDA server function calls and corresponding responses have been updated
- Related changes: Updated all internal usages of several job parameters for type changes, refactored server call functions, etc
- Discussed in JIRA ticket ATLASPANDA-1562
New Kubernetes Executor workflow
- Development in progress for SKA
- Pilot and payload runs in different pods in this workflow
Refactored EIC user to ePIC user
- Added resource specific payload setup (for NERSC esp.)
Improvements
- Various code cleanups (incl. documentation and command execution updates) using OpenAI Codex and Claude Code
- Corrected intersect calculation in analytics package so that it can handle large x values (such as timestamps)

Assets 2

10 Mar 15:12

PalNilsson

3.11.5.1

a090538

3.11.5.1

Changed default pilot walltime grace from 1% to 0%

Assets 2

02 Feb 14:16

PalNilsson

3.11.4.1

501879f

3.11.4.1

Now setting ‘protocol’ in Rucio trace report if not set already
- Extracted from ‘url’ field, which then must have been set
- Example job: 6996615722
- Discussed in ATLASPANDA-1485

Assets 2

14 Jan 10:10

PalNilsson

3.11.3.9

126508a

3.11.3.9

Worker node map
- To facilitate the introduction of worker node maps on ND, the pilot will from now on add the worker node JSON to the job report, reported to the server
- In case the job report does not exist, it will be created (with only the worker node map)
- Corresponding JIRA ticket: ATLASPANDA-1600
Maxwdir
- Fixed problem with divider used to scale maxwdir (now explicitly using PQ.corecount, which is also verified to be set)
- Reported in ATLASPANDA-1575

Assets 2

03 Dec 16:03

PalNilsson

3.11.2.19

8d7d1f9

3.11.2.19

Worker node map update
- Now also reporting PanDA queue name
If environment variable storageLimitMiB (and storageRequestMiB) are set, local space check returns the former instead of df value
- “The df call reflects the node filesystem, not the pod’s quota, which is why we’re seeing evictions around ~24 Gi (our 20 Gi request + 4 Gi margin)”
- Requested by Eduardo Bach (NET2)
- Discussed in ATLASPANDA-1566
Explicitly avoiding panda token file in debug mode
- To prevent token from being exposed by tail command
- Reported by M. Borodin
Chirp
- Pilot now only writes job and pilot id’s to ClassAd
- Discussed in ATLASPANDA-1518
Cgroups
- Updated cgroup creation logic to account for HTCondor recent changes
- Pilot now creates sub-cgroups under the job’s .slice instead of the .scope, avoiding permission-denied errors
- Fully backwards compatible with older HTCondor versions
- Requested by R. Walker
Work dir size checks now take number of cores into consideration
- Discussed in ATLASPANDA-1575
- To look out for: 1104 errors which might increase

Assets 2

05 Nov 09:56

PalNilsson

3.11.1.15

32d96ac

3.11.1.15

Condor chirp
- When available, condor_chirp is used to set job id and current job state (retrieved, starting, running, finished, failed) in the ClassAd
- Sites interested in using this feature must make sure that condor_chirp is locally available (ideally in default installation location /usr/bin)
- If the command is not in the standard PATH, the pilot will attempt to locate the path using the condor_config_val command
  - This method is e.g. used on MWT2
- Also, “want_io_proxy = true” must be set in the job submit jdl to allow the command to update the ClassAd
Pilot now exits after queuedata download if PQ.status is set to ‘offline’
- Previously, pilot only checked whether the queue was active (using PQ.state)
- Added new error code 1386, PANDAQUEUENOTONLINE (used internally)
- Pilot returns exit code 83 to the wrapper in this case
A problem with exception handling prevented alternative stage-out from working properly in a case seen at SARA-MATRIX, should now be fixed
- Discussed in ATLASPANDA-1547
Improvements to event service error handling
Added GitHub Action for testing code for circular imports
- Using newly developed tool: https://pypi.org/project/circular-import-detector/
Plugin added the SKA collaboration

Code contributions from W. Guan, P. Nilsson

Assets 2

07 Oct 07:00

PalNilsson

3.11.0.29

dd6382e

3.11.0.29

Memory checks for different resource types
- Discussed in JIRA ticket ATLASPANDA-1051 (“Pilot RSS kill threshold should depend on subresource, not just PQ.maxrss”)
Same memory limits are used for cgroups as for different resource types
- All subprocesses - including payload - are in their own cgroup (“subprocesses”)
- Tested in production at MWT2
- Discussed in JIRA ticket ATLASPANDA-1251 (“Implement memory restrictions on just the pilot payload using cgroup v2”)
Resolved JIRA ticket ATLASPANDA-1483 (“Traceback in pilot for remoteio timeout”)
- An exception was caught in the setup command which could indicate a worker node issue / severe apptainer issue, but for remote file open that is not the case and should not result in a non-zero pilot exit code.
- Example job: 6794333602
JIRA ticket ATLASPANDA-1496 (“Add retries for pilot stage-out”)
- Added --stageout-attempts pilot argument
- Likely needs some further development
Debugging issue with OIDC token seen with long running job at MWT2
- Now decoding token after refresh to print out issue and expiry time
Worker node map
- Pilot now adds /usr/sbin to PATH if missing (needed to find lspci command at BNL_GPU)
- Discussed in JIRA ticket ATLASPANDA-1368
Internal improvements
- Refactoring of main pilot module
- Pylint updates - Current average score across all 215 pilot modules: 9.77/10
- Added GitHub Action workflow using new tool VeriCode, capable of running different linters and other tools
  - Currently, it verifies that no individual pilot module has a pylint score of less than 8 (it would fail the PR)
  - More info at https://pypi.org/project/vericode/

Assets 2

Releases: PanDAWMS/pilot3

3.13.0.23

Uh oh!

3.12.4.1

Uh oh!

3.12.3.1

Uh oh!

3.12.1.99

Uh oh!

3.11.5.1

Uh oh!

3.11.4.1

Uh oh!

3.11.3.9

Uh oh!

3.11.2.19

Uh oh!

3.11.1.15

Uh oh!

3.11.0.29

Uh oh!