Local install#2069
Open
Donaim wants to merge 204 commits into
Open
Conversation
- Ensure the controller host alias 'head' resolves locally before Slurm reconfigure runs - Make the slurm reconfigure handler resilient to transient controller startup races - Update dependency package list for Ubuntu 24.04 availability and required Ansible modules - Add a slurm_worker compatibility role that aliases to slurm_node
- Switch Ansible interpreter settings to Python 3 and shorten SSH control path - Update setup-dev-env user/firewall tasks for Ubuntu and include slurm_builder - Expand dev environment variables for Python 3.10, PostgreSQL 16, firewall interfaces, Slurm build inputs, and mail/backup settings
- Replace legacy get-pip bootstrap with ensurepip + pip upgrade flow - Add missing git dependency for source checkout workflows - Correct remote requirements file copy behavior and mod_wsgi Python target - Rework firewall enablement to keep SSH access while applying UFW rules - Parameterize PostgreSQL versioned paths/services and grant schema privileges for migrations - Use remote slurp-based SSH key exchange for postgres/barman authorized_keys
Include self-signed key/certificate files required by the kive_server role in local VM provisioning flows.
…st to skip image diff
…nvas test connector retrieval
…onfiguration and simplify command execution
…inates and enhance validation checks
…nces and refactoring rsync logic
- Introduced `validate-vm` and `test-api` subcommands in `checks.py` for instance validation and API checks. - Updated `build-and-test.yml` to include calls for the new commands during CI smoke tests. - Registered new subcommands in the entry point for command-line access.
- Refactor get_bridge_cidr to simplify command execution - Add get_existing_network_parent to retrieve configured parent interface - Update runner to infer host interface for existing instances - Introduce HTTP request handling and validation in checks.py - Implement API probe functionality for improved connectivity checks
Remove _remove_bridges and _remove_nftables function definitions and their calls from run_purge. These resources are no longer created by the build-vm path — VM networking uses existing Incus managed networks. Keep: instance deletion, workspace/workdir cleanup, port forward cleanup (for any stale socat processes), and registry removal.
… IPs smoke-local-install now defaults to VM mode (--instance-type vm) instead of container. Container mode is still available explicitly for CI. Add --vm-network NAME to select an existing Incus managed network. Remove hardcoded vm_cidr and vm_ip from _build_vm_args — VM uses DHCP. Set no_web_proxy=True for VM mode (no host forwarding needed); container mode keeps no_web_proxy=False for CI compatibility. Pass instance_type through to test-api args so the API probe can fall back to incus-exec based probing for VM mode.
utils/dev --purge now works as a shortcut for utils/dev purge. Add --purge as a global optional argument that constructs a minimal argparse Namespace and calls run_purge directly with debug-level logging.
…precated Update build-vm CLI defaults to match the new networking model: - --vm-network: default '' (auto-detect), help says 'existing Incus managed network' - --vm-cidr: default '' (deprecated), VM uses DHCP - --vm-ip: default '' (deprecated), VM uses DHCP
When test-api cannot find a host-reachable API endpoint (no socat/forward), it now falls back to running the probe script inside the VM via incus exec: 1. Push a Python API probe script to /var/tmp/_kive_api_probe_<pid>.py 2. Execute it with 'incus exec INSTANCE -- python3 /var/tmp/...' 3. Parse JSON results 4. Clean up the temp file Also refactors result checking into _check_api_probe_results() shared helper. Adds --instance-type CLI arg to test-api for early VM detection.
…tests Remove all tests that referenced the removed custom bridge/NAT stack: - ensure_owned_bridge, _setup_host_nat, _ensure_ip_forward, _create_owned_bridge - _ensure_existing_forward_accept, _get_nft_json_list, _find_forward_chains_with_drop_policy - _remove_forward_rules, _remove_bridges, _remove_nftables - socat/port-forward _ensure_web_port_forward tests - host diagnostics (_collect_host_diagnostics) tests Add new tests: - choose_existing_vm_network: prefers incusbr0, falls back to single, returns requested, fails on missing/multiple/no networks - get_managed_networks: parsing and error handling - ensure_vm_nic: adds with network=, noop on correct config, fails on conflict, replaces stale kive-devel-br parent, accepts profile NICs - no_bridge_nft_or_iptables_in_network_code: verifies clean removal - purge does not call ip link or nft delete - VM mode calls choose_existing_vm_network, no host-local forwarding
…d bridges The core bug: incusbr0 exists as an unmanaged host bridge (managed: false) but the code tried to attach it with 'network=incusbr0' which only works for Incus-managed networks. For unmanaged bridges the correct syntax is 'nictype=bridged parent=incusbr0'. Fix: - Add VmNicTarget(name, managed) dataclass to represent the selected NIC target. - choose_existing_vm_network() now returns VmNicTarget with the correct managed flag from incus network list output. - choose_existing_vm_network() handles managed=False bridges (unmanaged host bridges like incusbr0), preferring incusbr0 regardless of managed flag. - ensure_vm_nic() accepts VmNicTarget and emits network=NAME when managed is True, or nictype=bridged parent=NAME when managed is False. - Idempotency checks match network= for managed targets and parent= for unmanaged targets. - Rename get_managed_networks → get_existing_bridges to reflect that it returns all bridges (managed and unmanaged).
Update all network selection and NIC attachment tests to use VmNicTarget: - choose_existing_vm_network tests: check both .name and .managed on result - Add test for incusbr0 unmanaged (managed=False) preference - Add test for requested unmanaged bridge returns managed=False - ensure_vm_nic tests: managed target emits network=, unmanaged emits nictype=bridged parent=, noop checks match correct field per mode - Replaces old get_managed_networks references with get_existing_bridges - Runner tests mock choose_existing_vm_network to return VmNicTarget
Before provisioning, verify the selected NIC target can actually provide IPv4 connectivity to the VM: - For managed Incus networks: check 'incus network get NAME ipv4.address' returns a non-empty value (Incus provides DHCP+NAT automatically). - For unmanaged host bridges: check 'ip -4 addr show dev NAME' has a global-scope inet address (meaning the bridge is connected upstream). If the check fails, fail early with a precise message before the VM is started, rather than letting provisioning proceed with a VM that has only link-local IPv6 and no default route.
Replace the single 'cannot open TCP connection' error with a staged preflight that fails at the first missing dependency: 1. No IPv4 address on enp5s0 → 'no IPv4 DHCP lease was received' 2. IPv4 exists but no default route → 'no default route' 3. IPv4 + route but raw IP (1.1.1.1:53) fails → 'cannot reach internet by IP' 4. Raw IP works but DNS (archive.ubuntu.com) fails → 'DNS resolution is broken' 5. DNS works but TCP egress fails → 'cannot open TCP connection' Each stage has a distinct error message pointing to the likely cause.
…out IPv4 Add 6 tests: - managed with IPv4 address configured → True - managed without IPv4 address → False - unmanaged bridge with global IPv4 → True (mocked _bridge_has_ipv4) - unmanaged bridge without global IPv4 → False (mocked _bridge_has_ipv4) - _bridge_has_ipv4 with global 'inet' line → True - _bridge_has_ipv4 with only link-local IPv6 → False
…face detection Auto-detection now only considers Incus-managed networks (managed=True). Unmanaged host bridges like incusbr0 are never auto-selected because they may lack DHCP/NAT/outbound internet. If no managed bridge exists, fail with a message telling the user to use --vm-network NAME explicitly. Explicit --vm-network requests still accept unmanaged bridges but log a warning: 'Kive cannot guarantee DHCP, NAT, or outbound internet on unmanaged host bridges.' Guest network preflight now uses dynamic interface detection via the default route instead of hardcoding 'enp5s0' (which broke container mode in CI). The preflight script uses: IFACE="enp1s0" This works for both VMs (enp5s0) and containers (eth0).
…flight Update tests for new auto-select policy: - auto-select fails when only unmanaged bridges (managed=False) exist - auto-select prefers a managed bridge over unmanaged incusbr0 - requested unmanaged bridge still accepted (returns managed=False) - check_vm_network_target_usable always returns True for unmanaged (guest preflight is authoritative) Remove old test that expected auto-select to pick unmanaged incusbr0.
Two changes to the provision polling loop: 1. Once /var/lib/kive-provision/started has been observed, transport failures (incus exec EOF/timeouts during Ansible) are treated as transient — the loop logs and continues rather than aborting. 2. After 'started' is seen, the poll interval increases from 2s to 10s to reduce pressure on Incus during long-running Ansible provisioning. Fast-fail on 5 consecutive transport failures is preserved for the early phase before any marker has been observed (container mode only).
This reverts commit d87a565.
…marker seen" This reverts commit c42f3fa.
…stic preflight" This reverts commit e3cfc8e.
…ic interface detection" This reverts commit cdc5d68.
…ith/without IPv4" This reverts commit 623fcf0.
…stics" This reverts commit 03fe512.
…isks Change ensure_profile_with_root_disk() so --root-size is not silently ignored when the profile already has a root disk with a smaller size. New behavior: - Root disk with no explicit size → set to requested size. - Root disk with parsable size smaller than requested → update to requested size. - Root disk with parsable size >= requested → no-op (already sufficient). - Root disk with unparseable size → log warning and skip (keep existing). Adds _parse_gib() helper for comparing size strings like '10GiB'. Adds _incus_profile_device_set() helper wrapping the incus CLI command.
Raise the default root disk from 10GiB to 60GiB across all configuration sources to avoid 'No space left on device' during PostgreSQL/Slurm setup. The 10GiB default was too small for Ubuntu packages + Slurm + Singularity + Python 3.7/deadsnakes + virtualenv + source copy + user data. Updated: - BuildVmConfig.root_size: 10GiB → 60GiB - utils/dev build-vm --root-size default/help: 10GiB → 60GiB - smoke-local-install _build_vm_args root_size: 10GiB → 60GiB
Update all test references from 10GiB to 60GiB (the new default). - test_existing_root_disk_is_idempotent: now tests matching 60GiB → no-op. - test_log_when_existing_root_skipped: uses matching 60GiB for skip message. - Adds test_existing_root_disk_enlarged_when_too_small: verifies that an existing root disk with size: 10GiB gets updated to 60GiB via 'incus profile device set' when the requested size is larger.
Two fixes that together make locally-submitted Slurm jobs actually start: 1. CoresPerSocket=4 (was implicitly 1). The template defaults sockets/cores_per_socket/threads_per_core to 1, giving CPUs=4 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1. Slurm rejects this because CPUs != Sockets * CoresPerSocket * ThreadsPerCore, resets CPUs to 1, and marks the node INVALID_REG. Fix: explicitly set sockets=1 cores_per_socket=4 threads_per_core=1 so that CPUs == Sockets * CoresPerSocket * ThreadsPerCore (4 == 1*4*1). 2. RealMemory=7000 (was 8000). The VM has an 8GiB Incus limit but slurmd reports about 7927 MB. Configuring RealMemory=8000 makes Slurm drain the node with: 'Low RealMemory (reported:7927 < configured:8000)' Fix: RealMemory=7000 leaves headroom below what slurmd reports. Slurm accepts nodes whose reported memory is >= configured RealMemory, so 7000 is safe (7927 >= 7000). Also removes the unused typo'd 'slurmnodes' variable.
After ansible sets up Slurm, run a lightweight sanity check before marking provisioning as 'done': 1. scontrol show node head — fail if node is DOWN/DRAIN/INVALID_REG. 2. sinfo -Nel — log partition/node state for diagnostics. 3. sbatch --partition=debug --wrap='true' — submit a tiny job and poll sacct for up to 30s. Fail if the job stays PENDING or enters an unexpected state (prevents the 'PartitionConfig' / 'Nodes required are DOWN, DRAINED' regression). On failure, print slurm.conf NodeName/PartitionName lines, scontrol, sinfo, squeue, and tails of slurmd/slurmctld logs.
…s) and RealMemory headroom Replace the single 'memory: 8000' assertion with four tests: - test_dev_env_vars_slurm_node_topology_valid: checks cpus, memory, cores_per_socket, sockets, threads_per_core are set correctly. - test_dev_env_vars_realmemory_has_headroom: RealMemory < 8000 (VM limit) to avoid Slurm draining the node when reported memory is slightly less than configured. - test_dev_env_vars_cpus_equal_sockets_times_cores: verifies CPUs == Sockets * CoresPerSocket * ThreadsPerCore for the default node. - test_dev_env_vars_no_longer_has_typo_slurmnodes: verify cleanup.
The sbatch + sacct polling approach was unreliable: sacct can return empty output immediately after sbatch because Slurm accounting hasn't recorded the job yet. The readiness check treated empty output as an unexpected error state and failed. Replace with a synchronous srun --partition=debug --nodes=1 --ntasks=1 /bin/true inside a 120s timeout --foreground. srun blocks until the job completes, so there is no race with accounting. Keep the scontrol+sinfo diagnostics before the test. On failure, print scontrol, sinfo, squeue, sacct, slurm.conf node lines, and slurmd/slurmctld log tails.
…started The provision watcher was aborting after 5 consecutive transport failures in container mode even after /var/lib/kive-provision/started had been observed. During long-running Ansible/Slurm provisioning, incus exec can become temporarily unavailable without meaning provisioning failed. Changes to maybe_provision_instance(): 1. Track saw_started — set when a successful probe returns 'started'. 2. Before saw_started: fast-fail on 5 consecutive transport failures (container mode only, VM mode already tolerated them). 3. After saw_started: transport failures are logged but never abort. Continue polling until done/failed/stuck/timeout. 4. Poll interval increases from 2s to 10s after saw_started to reduce pressure on Incus during long-running provisioning. 5. Do not run _cloud_init_diagnostics() on transient post-start probe failures — only on real terminal states (failed marker, stuck, timeout, pre-start abort). New tests: - container_aborts_on_pre_start_transport_failures - container_does_not_abort_on_post_start_transport_failures - container_post_start_eventual_failed_marker - provision_does_not_abort_when_started_seen_then_transport_fails (explicit regression for the CI pattern)
Add 4 new tests:
- container_aborts_on_pre_start_transport_failures: 5 consecutive
transport failures before 'started' still abort (container mode).
- container_does_not_abort_on_post_start_transport_failures: 10+
transport failures after 'started' are tolerated; loop continues.
- container_post_start_eventual_failed_marker: after 'started' is
observed, a 'failed' marker still raises with proper diagnostics.
- provision_does_not_abort_when_started_seen_then_transport_fails:
explicit regression test for the CI pattern — 'started' observed
3 times, then 10 transport failures ('probe timed out'), then
'done'. Must not raise RuntimeError.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds
utils/devscript that makes it much easier to run and test Kive locally after the repository has been cloned.This PR also:
utils/devfunctionality,