Skip to content

Fix contacting remote scheduler in jobs and add GDB to KingMaker environment#94

Merged
nshadskiy merged 3 commits into
mainfrom
fix-remote-scheduler-in-jobs
May 9, 2026
Merged

Fix contacting remote scheduler in jobs and add GDB to KingMaker environment#94
nshadskiy merged 3 commits into
mainfrom
fix-remote-scheduler-in-jobs

Conversation

@moritzmolch

Copy link
Copy Markdown
Contributor

Inside HTCondor jobs, the LAW task tries to contact the remote scheduler under an address that is not reachable. This leads to errors like this [1], after the sample has been processed successfully. This is fixed by forcing the task to use the local scheduler.

Further, gdb has been added as requirement to the KingMaker minimal standalone environment.

[1] Example log:

WARNING: Failed connecting to remote scheduler 'http://localhost:50642'
NoneType: None
WARNING: Failed pinging scheduler
──────────────────────────────────────────────────────────── Finished CROWNRun ─────────────────────────────────────────────────────────────
WARNING: Failed connecting to remote scheduler 'http://localhost:50642'
NoneType: None
WARNING: Failed connecting to remote scheduler 'http://localhost:50642'
NoneType: None
WARNING: Failed connecting to remote scheduler 'http://localhost:50642'
NoneType: None
WARNING: Failed connecting to remote scheduler 'http://localhost:50642'
NoneType: None
WARNING: Failed connecting to remote scheduler 'http://localhost:50642'
NoneType: None
WARNING: Failed connecting to remote scheduler 'http://localhost:50642'
NoneType: None
WARNING: Failed pinging scheduler
ERROR: Uncaught exception in luigi
Traceback (most recent call last):
  File "/opt/conda/envs/env/lib/python3.12/site-packages/urllib3/connection.py", line 204, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/opt/conda/envs/env/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/env/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/urllib3/connectionpool.py", line 493, in _make_request
    conn.request(
  File "/opt/conda/envs/env/lib/python3.12/site-packages/urllib3/connection.py", line 500, in request
    self.endheaders()
  File "/opt/conda/envs/env/lib/python3.12/http/client.py", line 1333, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/envs/env/lib/python3.12/http/client.py", line 1093, in _send_output
    self.send(msg)
  File "/opt/conda/envs/env/lib/python3.12/http/client.py", line 1037, in send
    self.connect()
  File "/opt/conda/envs/env/lib/python3.12/site-packages/urllib3/connection.py", line 331, in connect
    self.sock = self._new_conn()
                ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/urllib3/connection.py", line 219, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: HTTPConnection(host='localhost', port=50642): Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/env/lib/python3.12/site-packages/requests/adapters.py", line 644, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/urllib3/util/retry.py", line 535, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=50642): Max retries exceeded with url: /api/add_task (Caused by NewConnectionError("HTTPConnection(host='localhost', port=50642): Failed to establish a new connection: [Errno 111] Connection refused"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/env/lib/python3.12/site-packages/luigi/rpc.py", line 185, in _fetch
    response = scheduler_retry(self._fetcher.fetch, full_url, body, self._connect_timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/tenacity/__init__.py", line 418, in exc_check
    raise retry_exc.reraise()
          ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/tenacity/__init__.py", line 185, in reraise
    raise self.last_attempt.result()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/opt/conda/envs/env/lib/python3.12/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/luigi/rpc.py", line 131, in fetch
    resp = self.session.post(full_url, data=body, timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/requests/adapters.py", line 677, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=50642): Max retries exceeded with url: /api/add_task (Caused by NewConnectionError("HTTPConnection(host='localhost', port=50642): Failed to establish a new connection: [Errno 111] Connection refused"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/env/lib/python3.12/site-packages/luigi/retcodes.py", line 75, in run_with_retcodes
    worker = luigi.interface._run(argv).worker
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/luigi/interface.py", line 217, in _run
    return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/job_3elezc2vXwtn/law/law/patches.py", line 94, in _schedule_and_run
    return _schedule_and_run_orig(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/luigi/interface.py", line 177, in _schedule_and_run
    success &= worker.run()
               ^^^^^^^^^^^^
  File "/srv/job_3elezc2vXwtn/law/law/patches.py", line 91, in run
    return run_orig(self)
           ^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/luigi/worker.py", line 1239, in run
    self._handle_next_task()
  File "/opt/conda/envs/env/lib/python3.12/site-packages/luigi/worker.py", line 1140, in _handle_next_task
    self._add_task(worker=self._id,
  File "/srv/job_3elezc2vXwtn/law/law/patches.py", line 182, in _add_task
    return _add_task_orig(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/luigi/worker.py", line 638, in _add_task
    self._scheduler.add_task(*args, **kwargs)
  File "/opt/conda/envs/env/lib/python3.12/site-packages/luigi/scheduler.py", line 114, in rpc_func
    return self._request('/api/{}'.format(fn_name), actual_args, **request_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/luigi/rpc.py", line 198, in _request
    page = self._fetch(url, body)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/env/lib/python3.12/site-packages/luigi/rpc.py", line 187, in _fetch
    raise RPCError(
luigi.rpc.RPCError: Errors (3 attempts) when connecting to remote scheduler 'http://localhost:50642'
task exit code: 60
07/05/2026 11:12:55.475264861 (CEST)
execution of branch 14 failed (exit code 60), stop job

@nshadskiy

Copy link
Copy Markdown
Contributor

The current container was not able to run on local resources because /ceph was not mounted in it. Moritz pushed a fix for that.

@tvoigtlaender tvoigtlaender left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@nshadskiy nshadskiy merged commit 73f1f7b into main May 9, 2026
1 check passed
@nshadskiy nshadskiy deleted the fix-remote-scheduler-in-jobs branch May 9, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants