Local install by Donaim · Pull Request #2069 · cfe-lab/Kive

Donaim · 2026-05-28T18:22:23Z

This PR adds utils/dev script that makes it much easier to run and test Kive locally after the repository has been cloned.
This PR also:

adds a CI job that tests the utils/dev functionality,
fixes the existing CI job that runs various Kive tests,
implements small improvements in Ansible scripts and Kive tests

- Ensure the controller host alias 'head' resolves locally before Slurm reconfigure runs - Make the slurm reconfigure handler resilient to transient controller startup races - Update dependency package list for Ubuntu 24.04 availability and required Ansible modules - Add a slurm_worker compatibility role that aliases to slurm_node

- Switch Ansible interpreter settings to Python 3 and shorten SSH control path - Update setup-dev-env user/firewall tasks for Ubuntu and include slurm_builder - Expand dev environment variables for Python 3.10, PostgreSQL 16, firewall interfaces, Slurm build inputs, and mail/backup settings

- Replace legacy get-pip bootstrap with ensurepip + pip upgrade flow - Add missing git dependency for source checkout workflows - Correct remote requirements file copy behavior and mod_wsgi Python target - Rework firewall enablement to keep SSH access while applying UFW rules - Parameterize PostgreSQL versioned paths/services and grant schema privileges for migrations - Use remote slurp-based SSH key exchange for postgres/barman authorized_keys

Include self-signed key/certificate files required by the kive_server role in local VM provisioning flows.

…n build_vm.py

… function

…g and logging

…ild_vm.py

…st to skip image diff

…nvas test connector retrieval

…onfiguration and simplify command execution

…inates and enhance validation checks

…d error handling

…ling

…nces and refactoring rsync logic

- Introduced `validate-vm` and `test-api` subcommands in `checks.py` for instance validation and API checks. - Updated `build-and-test.yml` to include calls for the new commands during CI smoke tests. - Registered new subcommands in the entry point for command-line access.

…connectivity

- Refactor get_bridge_cidr to simplify command execution - Add get_existing_network_parent to retrieve configured parent interface - Update runner to infer host interface for existing instances - Introduce HTTP request handling and validation in checks.py - Implement API probe functionality for improved connectivity checks

Remove _remove_bridges and _remove_nftables function definitions and their calls from run_purge. These resources are no longer created by the build-vm path — VM networking uses existing Incus managed networks. Keep: instance deletion, workspace/workdir cleanup, port forward cleanup (for any stale socat processes), and registry removal.

… IPs smoke-local-install now defaults to VM mode (--instance-type vm) instead of container. Container mode is still available explicitly for CI. Add --vm-network NAME to select an existing Incus managed network. Remove hardcoded vm_cidr and vm_ip from _build_vm_args — VM uses DHCP. Set no_web_proxy=True for VM mode (no host forwarding needed); container mode keeps no_web_proxy=False for CI compatibility. Pass instance_type through to test-api args so the API probe can fall back to incus-exec based probing for VM mode.

utils/dev --purge now works as a shortcut for utils/dev purge. Add --purge as a global optional argument that constructs a minimal argparse Namespace and calls run_purge directly with debug-level logging.

…precated Update build-vm CLI defaults to match the new networking model: - --vm-network: default '' (auto-detect), help says 'existing Incus managed network' - --vm-cidr: default '' (deprecated), VM uses DHCP - --vm-ip: default '' (deprecated), VM uses DHCP

When test-api cannot find a host-reachable API endpoint (no socat/forward), it now falls back to running the probe script inside the VM via incus exec: 1. Push a Python API probe script to /var/tmp/_kive_api_probe_<pid>.py 2. Execute it with 'incus exec INSTANCE -- python3 /var/tmp/...' 3. Parse JSON results 4. Clean up the temp file Also refactors result checking into _check_api_probe_results() shared helper. Adds --instance-type CLI arg to test-api for early VM detection.

…tests Remove all tests that referenced the removed custom bridge/NAT stack: - ensure_owned_bridge, _setup_host_nat, _ensure_ip_forward, _create_owned_bridge - _ensure_existing_forward_accept, _get_nft_json_list, _find_forward_chains_with_drop_policy - _remove_forward_rules, _remove_bridges, _remove_nftables - socat/port-forward _ensure_web_port_forward tests - host diagnostics (_collect_host_diagnostics) tests Add new tests: - choose_existing_vm_network: prefers incusbr0, falls back to single, returns requested, fails on missing/multiple/no networks - get_managed_networks: parsing and error handling - ensure_vm_nic: adds with network=, noop on correct config, fails on conflict, replaces stale kive-devel-br parent, accepts profile NICs - no_bridge_nft_or_iptables_in_network_code: verifies clean removal - purge does not call ip link or nft delete - VM mode calls choose_existing_vm_network, no host-local forwarding

…d bridges The core bug: incusbr0 exists as an unmanaged host bridge (managed: false) but the code tried to attach it with 'network=incusbr0' which only works for Incus-managed networks. For unmanaged bridges the correct syntax is 'nictype=bridged parent=incusbr0'. Fix: - Add VmNicTarget(name, managed) dataclass to represent the selected NIC target. - choose_existing_vm_network() now returns VmNicTarget with the correct managed flag from incus network list output. - choose_existing_vm_network() handles managed=False bridges (unmanaged host bridges like incusbr0), preferring incusbr0 regardless of managed flag. - ensure_vm_nic() accepts VmNicTarget and emits network=NAME when managed is True, or nictype=bridged parent=NAME when managed is False. - Idempotency checks match network= for managed targets and parent= for unmanaged targets. - Rename get_managed_networks → get_existing_bridges to reflect that it returns all bridges (managed and unmanaged).

Update all network selection and NIC attachment tests to use VmNicTarget: - choose_existing_vm_network tests: check both .name and .managed on result - Add test for incusbr0 unmanaged (managed=False) preference - Add test for requested unmanaged bridge returns managed=False - ensure_vm_nic tests: managed target emits network=, unmanaged emits nictype=bridged parent=, noop checks match correct field per mode - Replaces old get_managed_networks references with get_existing_bridges - Runner tests mock choose_existing_vm_network to return VmNicTarget

Before provisioning, verify the selected NIC target can actually provide IPv4 connectivity to the VM: - For managed Incus networks: check 'incus network get NAME ipv4.address' returns a non-empty value (Incus provides DHCP+NAT automatically). - For unmanaged host bridges: check 'ip -4 addr show dev NAME' has a global-scope inet address (meaning the bridge is connected upstream). If the check fails, fail early with a precise message before the VM is started, rather than letting provisioning proceed with a VM that has only link-local IPv6 and no default route.

Replace the single 'cannot open TCP connection' error with a staged preflight that fails at the first missing dependency: 1. No IPv4 address on enp5s0 → 'no IPv4 DHCP lease was received' 2. IPv4 exists but no default route → 'no default route' 3. IPv4 + route but raw IP (1.1.1.1:53) fails → 'cannot reach internet by IP' 4. Raw IP works but DNS (archive.ubuntu.com) fails → 'DNS resolution is broken' 5. DNS works but TCP egress fails → 'cannot open TCP connection' Each stage has a distinct error message pointing to the likely cause.

…out IPv4 Add 6 tests: - managed with IPv4 address configured → True - managed without IPv4 address → False - unmanaged bridge with global IPv4 → True (mocked _bridge_has_ipv4) - unmanaged bridge without global IPv4 → False (mocked _bridge_has_ipv4) - _bridge_has_ipv4 with global 'inet' line → True - _bridge_has_ipv4 with only link-local IPv6 → False

…face detection Auto-detection now only considers Incus-managed networks (managed=True). Unmanaged host bridges like incusbr0 are never auto-selected because they may lack DHCP/NAT/outbound internet. If no managed bridge exists, fail with a message telling the user to use --vm-network NAME explicitly. Explicit --vm-network requests still accept unmanaged bridges but log a warning: 'Kive cannot guarantee DHCP, NAT, or outbound internet on unmanaged host bridges.' Guest network preflight now uses dynamic interface detection via the default route instead of hardcoding 'enp5s0' (which broke container mode in CI). The preflight script uses: IFACE="enp1s0" This works for both VMs (enp5s0) and containers (eth0).

…flight Update tests for new auto-select policy: - auto-select fails when only unmanaged bridges (managed=False) exist - auto-select prefers a managed bridge over unmanaged incusbr0 - requested unmanaged bridge still accepted (returns managed=False) - check_vm_network_target_usable always returns True for unmanaged (guest preflight is authoritative) Remove old test that expected auto-select to pick unmanaged incusbr0.

Two changes to the provision polling loop: 1. Once /var/lib/kive-provision/started has been observed, transport failures (incus exec EOF/timeouts during Ansible) are treated as transient — the loop logs and continues rather than aborting. 2. After 'started' is seen, the poll interval increases from 2s to 10s to reduce pressure on Incus during long-running Ansible provisioning. Fast-fail on 5 consecutive transport failures is preserved for the early phase before any marker has been observed (container mode only).

This reverts commit d87a565.

…marker seen" This reverts commit c42f3fa.

…stic preflight" This reverts commit e3cfc8e.

…ic interface detection" This reverts commit cdc5d68.

…ith/without IPv4" This reverts commit 623fcf0.

…stics" This reverts commit 03fe512.

…isks Change ensure_profile_with_root_disk() so --root-size is not silently ignored when the profile already has a root disk with a smaller size. New behavior: - Root disk with no explicit size → set to requested size. - Root disk with parsable size smaller than requested → update to requested size. - Root disk with parsable size >= requested → no-op (already sufficient). - Root disk with unparseable size → log warning and skip (keep existing). Adds _parse_gib() helper for comparing size strings like '10GiB'. Adds _incus_profile_device_set() helper wrapping the incus CLI command.

Raise the default root disk from 10GiB to 60GiB across all configuration sources to avoid 'No space left on device' during PostgreSQL/Slurm setup. The 10GiB default was too small for Ubuntu packages + Slurm + Singularity + Python 3.7/deadsnakes + virtualenv + source copy + user data. Updated: - BuildVmConfig.root_size: 10GiB → 60GiB - utils/dev build-vm --root-size default/help: 10GiB → 60GiB - smoke-local-install _build_vm_args root_size: 10GiB → 60GiB

Update all test references from 10GiB to 60GiB (the new default). - test_existing_root_disk_is_idempotent: now tests matching 60GiB → no-op. - test_log_when_existing_root_skipped: uses matching 60GiB for skip message. - Adds test_existing_root_disk_enlarged_when_too_small: verifies that an existing root disk with size: 10GiB gets updated to 60GiB via 'incus profile device set' when the requested size is larger.

Two fixes that together make locally-submitted Slurm jobs actually start: 1. CoresPerSocket=4 (was implicitly 1). The template defaults sockets/cores_per_socket/threads_per_core to 1, giving CPUs=4 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1. Slurm rejects this because CPUs != Sockets * CoresPerSocket * ThreadsPerCore, resets CPUs to 1, and marks the node INVALID_REG. Fix: explicitly set sockets=1 cores_per_socket=4 threads_per_core=1 so that CPUs == Sockets * CoresPerSocket * ThreadsPerCore (4 == 1*4*1). 2. RealMemory=7000 (was 8000). The VM has an 8GiB Incus limit but slurmd reports about 7927 MB. Configuring RealMemory=8000 makes Slurm drain the node with: 'Low RealMemory (reported:7927 < configured:8000)' Fix: RealMemory=7000 leaves headroom below what slurmd reports. Slurm accepts nodes whose reported memory is >= configured RealMemory, so 7000 is safe (7927 >= 7000). Also removes the unused typo'd 'slurmnodes' variable.

After ansible sets up Slurm, run a lightweight sanity check before marking provisioning as 'done': 1. scontrol show node head — fail if node is DOWN/DRAIN/INVALID_REG. 2. sinfo -Nel — log partition/node state for diagnostics. 3. sbatch --partition=debug --wrap='true' — submit a tiny job and poll sacct for up to 30s. Fail if the job stays PENDING or enters an unexpected state (prevents the 'PartitionConfig' / 'Nodes required are DOWN, DRAINED' regression). On failure, print slurm.conf NodeName/PartitionName lines, scontrol, sinfo, squeue, and tails of slurmd/slurmctld logs.

…s) and RealMemory headroom Replace the single 'memory: 8000' assertion with four tests: - test_dev_env_vars_slurm_node_topology_valid: checks cpus, memory, cores_per_socket, sockets, threads_per_core are set correctly. - test_dev_env_vars_realmemory_has_headroom: RealMemory < 8000 (VM limit) to avoid Slurm draining the node when reported memory is slightly less than configured. - test_dev_env_vars_cpus_equal_sockets_times_cores: verifies CPUs == Sockets * CoresPerSocket * ThreadsPerCore for the default node. - test_dev_env_vars_no_longer_has_typo_slurmnodes: verify cleanup.

The sbatch + sacct polling approach was unreliable: sacct can return empty output immediately after sbatch because Slurm accounting hasn't recorded the job yet. The readiness check treated empty output as an unexpected error state and failed. Replace with a synchronous srun --partition=debug --nodes=1 --ntasks=1 /bin/true inside a 120s timeout --foreground. srun blocks until the job completes, so there is no race with accounting. Keep the scontrol+sinfo diagnostics before the test. On failure, print scontrol, sinfo, squeue, sacct, slurm.conf node lines, and slurmd/slurmctld log tails.

…started The provision watcher was aborting after 5 consecutive transport failures in container mode even after /var/lib/kive-provision/started had been observed. During long-running Ansible/Slurm provisioning, incus exec can become temporarily unavailable without meaning provisioning failed. Changes to maybe_provision_instance(): 1. Track saw_started — set when a successful probe returns 'started'. 2. Before saw_started: fast-fail on 5 consecutive transport failures (container mode only, VM mode already tolerated them). 3. After saw_started: transport failures are logged but never abort. Continue polling until done/failed/stuck/timeout. 4. Poll interval increases from 2s to 10s after saw_started to reduce pressure on Incus during long-running provisioning. 5. Do not run _cloud_init_diagnostics() on transient post-start probe failures — only on real terminal states (failed marker, stuck, timeout, pre-start abort). New tests: - container_aborts_on_pre_start_transport_failures - container_does_not_abort_on_post_start_transport_failures - container_post_start_eventual_failed_marker - provision_does_not_abort_when_started_seen_then_transport_fails (explicit regression for the CI pattern)

Add 4 new tests: - container_aborts_on_pre_start_transport_failures: 5 consecutive transport failures before 'started' still abort (container mode). - container_does_not_abort_on_post_start_transport_failures: 10+ transport failures after 'started' are tolerated; loop continues. - container_post_start_eventual_failed_marker: after 'started' is observed, a 'failed' marker still raises with proper diagnostics. - provision_does_not_abort_when_started_seen_then_transport_fails: explicit regression test for the CI pattern — 'started' observed 3 times, then 10 transport failures ('probe timed out'), then 'done'. Must not raise RuntimeError.

Donaim added 30 commits May 27, 2026 16:03

Add development TLS certificate assets for Kive server role

51ca50c

Include self-signed key/certificate files required by the kive_server role in local VM provisioning flows.

Setup kivedevel

bc14f18

Improve code quality

1487ec5

Add logging support and improve command output in build_vm.py

d8fb69f

Fix import statement in entrypoint.py and add default password hash i…

e2a3220

…n build_vm.py

Refactor entrypoint.py to improve argument parsing and structure main…

a7d318e

… function

Enhance command execution in build_vm.py with improved output handlin…

bad4d8f

…g and logging

Replace print statements with logger calls for improved logging in bu…

5b2f74a

…ild_vm.py

Add verbose logging level support in configure_logging function

51d7533

Add instance type argument to main function and adjust VM creation logic

bc1ded1

Add smoke test workflow for build-vm and enhance file copying logic

73198d5

Update workspace disk handling and default working directory path

6668093

Fix rsync exclude path in _handle_workspace_disk function

2a5c736

Fix build-vm command path in smoke test workflow and update canvas te…

62ed44a

…st to skip image diff

Update incus admin initialization with storage backend and enhance ca…

595c4fa

…nvas test connector retrieval

Enable debug mode for build-vm smoke test command

1a57c5a

Refactor Incus initialization in build-vm smoke test to use preseed c…

ad9b8fe

…onfiguration and simplify command execution

Update connector movement logic in canvas tests to use rendered coord…

554d50d

…inates and enhance validation checks

Remove 'check=False' from subprocess calls in build_vm.py for improve…

57fa7c7

…d error handling

Refactor workspace disk check in main function to simplify error hand…

f57f3c8

…ling

Enhance workspace disk handling by adding support for container insta…

addd2f1

…nces and refactoring rsync logic

Refactor build-vm command in CI workflow to simplify sudo execution

39071c2

Refactor kivedevel package

1ab7bd6

better structure

7771cbf

Add HTTP request handling and API server checks to validate instance …

7e348be

…connectivity

Donaim added 30 commits June 25, 2026 23:11

entrypoint: add --purge global alias that delegates to purge subcommand

ed367be

utils/dev --purge now works as a shortcut for utils/dev purge. Add --purge as a global optional argument that constructs a minimal argparse Namespace and calls run_purge directly with debug-level logging.

more permissible ppa

d87a565

Revert "more permissible ppa"

74d1b12

This reverts commit d87a565.

Revert "provision: don't abort on transport failures after 'started' …

38f21f9

…marker seen" This reverts commit c42f3fa.

Revert "tests: auto-select fails on unmanaged bridges; interface-agno…

7142895

…stic preflight" This reverts commit e3cfc8e.

Revert "network: auto-select only managed networks; cloud-init: dynam…

5fe7f01

…ic interface detection" This reverts commit cdc5d68.

Revert "tests: check_vm_network_target_usable — managed/ unmanaged, w…

4bedf0d

…ith/without IPv4" This reverts commit 623fcf0.

Revert "provision: improve guest network preflight with staged diagno…

2f43945

…stics" This reverts commit 03fe512.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Local install#2069

Local install#2069
Donaim wants to merge 204 commits into
masterfrom
local-install

Donaim commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Donaim commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Donaim commented May 28, 2026 •

edited

Loading