Skip to content

Cache libvirt domain state#819

Open
ben-grande wants to merge 17 commits into
QubesOS:mainfrom
ben-grande:cache-running
Open

Cache libvirt domain state#819
ben-grande wants to merge 17 commits into
QubesOS:mainfrom
ben-grande:cache-running

Conversation

@ben-grande

@ben-grande ben-grande commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

The state changes rarely, but querying it can take a considerable time, that is blocking, when looping through the state of multiple domains.

For: QubesOS/qubes-issues#10569
For: QubesOS/qubes-issues#9902


Didn't run openqa locally, only experiment some things:

  • start
  • shutdown
  • kill
  • pause
  • unpause
  • checking Domains Widget
  • restart qubesd
  • qubed-query -e dom0 admin.vm.CurrentState QUBE

@ben-grande

Copy link
Copy Markdown
Contributor Author

PipelineRetryFailed

@ben-grande

Copy link
Copy Markdown
Contributor Author

Didn't run openqa locally, only experiment some things, start, shutdown, kill, checking Domains Widget.

@ben-grande

Copy link
Copy Markdown
Contributor Author

PipelineRetryFailed

@codecov

codecov Bot commented Jun 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 43.11927% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.12%. Comparing base (9196c9a) to head (367ae38).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
qubes/vm/qubesvm.py 31.42% 48 Missing ⚠️
qubes/device_protocol.py 0.00% 5 Missing ⚠️
qubes/vm/adminvm.py 20.00% 4 Missing ⚠️
qubes/ext/pci.py 87.50% 3 Missing ⚠️
qubes/app.py 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #819      +/-   ##
==========================================
- Coverage   70.42%   70.12%   -0.30%     
==========================================
  Files          61       61              
  Lines       14143    14174      +31     
==========================================
- Hits         9960     9940      -20     
- Misses       4183     4234      +51     
Flag Coverage Δ
unittests 70.12% <43.11%> (-0.30%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ben-grande ben-grande force-pushed the cache-running branch 2 times, most recently from 5eff83f to aa2e823 Compare June 1, 2026 21:26
@ben-grande

Copy link
Copy Markdown
Contributor Author

PipelineRetryFailed

1 similar comment
@ben-grande

Copy link
Copy Markdown
Contributor Author

PipelineRetryFailed

@ben-grande ben-grande marked this pull request as draft June 2, 2026 07:55
@ben-grande

Copy link
Copy Markdown
Contributor Author

Marked as draft as there still is some caching issues for the _power_state.

@marmarek

marmarek commented Jun 2, 2026

Copy link
Copy Markdown
Member

If necessary (or simpler), it's IMO okay to use cache only for is_running() (that is used frequently), but still do active check for get_power_state(). It could be also a safety valve for cases when cache are not reliable (like during startup or shutdown).

@ben-grande

Copy link
Copy Markdown
Contributor Author

If necessary (or simpler), it's IMO okay to use cache only for is_running() (that is used frequently), but still do active check for get_power_state(). It could be also a safety valve for cases when cache are not reliable (like during startup or shutdown).

Experimenting with cache a bit more and it's getting better, but if I don't get it to a hundred percent, will not cache the power state.

@ben-grande

ben-grande commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

From my tests, it seems to be working. My nemesis, OpenQA, should try me. Do you know a subset of tests that are interesting before joining this PR in the full run?

Looking at this recent test:

  • system_tests_suspend (soft-failed)
  • system_tests_usbproxy
  • system_tests_devices
  • system_tests_gui_tools
  • system_tests_guivm_gui_interactive

Line to use after this PR is not draft anymore:

openQArun TEST=system_tests_suspend,system_tests_usbproxy,system_tests_devices,system_tests_gui_tools,system_tests_guivm_gui_interactive

@marmarek

marmarek commented Jun 2, 2026

Copy link
Copy Markdown
Member

I'd include also at least one of the guivm tests

@ben-grande ben-grande force-pushed the cache-running branch 4 times, most recently from 9538d21 to 6011ca3 Compare June 2, 2026 21:30
@ben-grande

Copy link
Copy Markdown
Contributor Author

PipelineRetryFailed

@ben-grande

Copy link
Copy Markdown
Contributor Author

Last CI failed on Fedora 43 to due to:

======================================================================
ERROR: qubes.tests.api_admin/TC_00_VMs/test_150_pool_info
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python3.14/contextlib.py", line 85, in inner
    return func(*args, **kwds)
  File "/home/gitlab-runner/builds/QubesOS/qubes-core-admin/qubes/tests/api_admin.py", line 76, in setUp
    super().setUp()
    ~~~~~~~~~~~~~^^
  File "/home/gitlab-runner/builds/QubesOS/qubes-core-admin/qubes/tests/__init__.py", line 516, in setUp
    self.loop = asyncio.get_event_loop()
                ~~~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/lib64/python3.14/asyncio/events.py", line 715, in get_event_loop
    raise RuntimeError('There is no current event loop in thread %r.'
                       % threading.current_thread().name)
RuntimeError: There is no current event loop in thread 'MainThread'.

So it finally reached this repo.

@ben-grande

Copy link
Copy Markdown
Contributor Author

PipelineRetryFailed

@ben-grande

Copy link
Copy Markdown
Contributor Author

openQArun TEST=system_tests_suspend,system_tests_usbproxy,system_tests_devices,system_tests_gui_tools,system_tests_guivm_gui_interactive

@qubesos-bot

qubesos-bot commented Jun 3, 2026

Copy link
Copy Markdown

OpenQA test summary

Complete test suite and dependencies: https://openqa.qubes-os.org/tests/overview?distri=qubesos&version=4.3&build=202606030906-devel&flavor=pull-requests

Test run included the following:

New failures, excluding unstable

Compared to: https://openqa.qubes-os.org/tests/overview?distri=qubesos&version=4.3&build=2026050504-devel&flavor=update

  • system_tests_gui_tools
    • qui_widgets_update: unnamed test (unknown)

    • qui_widgets_update: Failed (test died)
      # Test died: no candidate needle with tag(s) 'qubes-update-finish' ...

    • qui_widgets_update: unnamed test (unknown)

Failed tests

4 failures
  • system_tests_gui_tools
    • qui_widgets_update: unnamed test (unknown)

    • qui_widgets_update: Failed (test died)
      # Test died: no candidate needle with tag(s) 'qubes-update-finish' ...

    • qui_widgets_update: unnamed test (unknown)

  • system_tests_usbproxy
    • [unstable] TC_20_USBProxy_core3_fedora-43-xfce: test_090_attach_stubdom (error)
      qubes.exc.QubesVMError: Cannot connect to qrexec agent for 120 seco...

Fixed failures

Compared to: https://openqa.qubes-os.org/tests/176874#dependencies

1 fixed
  • system_tests_usbproxy
    • system_tests: wait_serial (wait serial expected)
      # wait_serial expected: qr/h3uXO-\d+-/...

Unstable tests

Details
  • system_tests_suspend

    suspend/Failed (2/5 times with errors)
    • job 179086 # Test died: command 'qvm-run -p sys-net true' timed out at /usr/li...
    • job 179098 # Test died: command 'qvm-run -p sys-net true' timed out at /usr/li...
    suspend/Failed (3/5 times with errors)
    • job 178918 # Test died: command '! qvm-check sys-usb || qvm-run -p sys-usb tru...
    • job 179097 # Test died: command '! qvm-check sys-usb || qvm-run -p sys-usb tru...
    • job 179101 # Test died: command '! qvm-check sys-usb || qvm-run -p sys-usb tru...
    suspend/wait_serial (2/5 times with errors)
    suspend/wait_serial (3/5 times with errors)
    • job 178918 # Command: ! qvm-check sys-usb || qvm-run -p sys-usb true...
    • job 179097 # Command: ! qvm-check sys-usb || qvm-run -p sys-usb true...
    • job 179101 # Command: ! qvm-check sys-usb || qvm-run -p sys-usb true...
    suspend/wait_serial (2/5 times with errors)
    suspend/wait_serial (3/5 times with errors)
    • job 178918 # wait_serial expected: "lspci; echo 2E8vz-\$?-"...
    • job 179097 # wait_serial expected: "lspci; echo 2E8vz-\$?-"...
    • job 179101 # wait_serial expected: "lspci; echo 2E8vz-\$?-"...
  • system_tests_suspend@hw1

    suspend/Failed (2/5 times with errors)
    • job 179086 # Test died: command 'qvm-run -p sys-net true' timed out at /usr/li...
    • job 179098 # Test died: command 'qvm-run -p sys-net true' timed out at /usr/li...
    suspend/Failed (3/5 times with errors)
    • job 178918 # Test died: command '! qvm-check sys-usb || qvm-run -p sys-usb tru...
    • job 179097 # Test died: command '! qvm-check sys-usb || qvm-run -p sys-usb tru...
    • job 179101 # Test died: command '! qvm-check sys-usb || qvm-run -p sys-usb tru...
    suspend/wait_serial (2/5 times with errors)
    suspend/wait_serial (3/5 times with errors)
    • job 178918 # Command: ! qvm-check sys-usb || qvm-run -p sys-usb true...
    • job 179097 # Command: ! qvm-check sys-usb || qvm-run -p sys-usb true...
    • job 179101 # Command: ! qvm-check sys-usb || qvm-run -p sys-usb true...
    suspend/wait_serial (2/5 times with errors)
    suspend/wait_serial (3/5 times with errors)
    • job 178918 # wait_serial expected: "lspci; echo 2E8vz-\$?-"...
    • job 179097 # wait_serial expected: "lspci; echo 2E8vz-\$?-"...
    • job 179101 # wait_serial expected: "lspci; echo 2E8vz-\$?-"...
  • system_tests_usbproxy

    system_tests/Fail (3/5 times with errors)
    • job 179921 Tests qubes.tests.extra failed (exit code 1), details reported sepa...
    • job 180877 Tests qubes.tests.extra failed (exit code 1), details reported sepa...
    • job 181248 Tests qubes.tests.extra failed (exit code 1), details reported sepa...
    system_tests/Failed (3/5 times with errors)
    • job 179921 # Test died: Some tests failed at qubesos/tests/system_tests.pm lin...
    • job 180877 # Test died: Some tests failed at qubesos/tests/system_tests.pm lin...
    • job 181248 # Test died: Some tests failed at qubesos/tests/system_tests.pm lin...
    TC_00_USBProxy_fedora-43-xfce/test_000_attach_detach (1/5 times with errors)
    • job 181248 qubes.exc.QubesVMError: Cannot connect to qrexec agent for 120 seco...
    TC_20_USBProxy_core3_fedora-43-xfce/test_000_list (1/5 times with errors)
    • job 181248 qubes.exc.QubesVMError: Cannot connect to qrexec agent for 120 seco...
    TC_20_USBProxy_core3_fedora-43-xfce/test_080_attach_existing_policy (1/5 times with errors)
    • job 181248 qubes.exc.QubesVMError: Cannot connect to qrexec agent for 120 seco...
    TC_20_USBProxy_core3_fedora-43-xfce/test_090_attach_stubdom (2/5 times with errors)
    • job 179921 qubes.exc.QubesVMError: Cannot connect to qrexec agent for 120 seco...
    • job 180877 qubes.exc.QubesVMError: Cannot connect to qrexec agent for 120 seco...

Performance Tests

Performance degradation:

No issues

Remaining performance tests:

No remaining performance tests

@ben-grande

Copy link
Copy Markdown
Contributor Author
  • system_tests_gui_tools

    • qui_widgets_update: unnamed test (unknown)
    • qui_widgets_update: Failed (test died)
      # Test died: no candidate needle with tag(s) 'qubes-update-finish' ...
    • qui_widgets_update: unnamed test (unknown)

Seeing "Next and previous results", shows that this happened 3 days ago also, so unrelated to this PR: https://openqa.qubes-os.org/tests/182011#step/qui_widgets_update/18.

  • system_tests_guivm_gui_interactive

    • update_guivm: wait_serial (unknown)
      # Command: (set -o pipefail; qubesctl --all --show-output state.hig...
    • update_guivm: Failed (test died + timed out)
      # Test died: command '(set -o pipefail; qubesctl --all --show-outpu...
    • update_guivm: wait_serial (unknown)
      # Command: curl --form upload=@/var/log/libvirt/libxl/libxl-driver....

Failed recently also, and qubesctl-sys-gui.log was not uploaded. Video shows it froze at that stage.

  • system_tests_usbproxy

    • TC_20_USBProxy_core3_debian-13-xfce: test_090_attach_stubdom (error)
      qubes.exc.QubesVMError: Cannot connect to qrexec agent for 120 seco...

Doesn't seem related to this PR.


I will restart the failed tests.

@ben-grande ben-grande force-pushed the cache-running branch 3 times, most recently from bf795ff to 367ae38 Compare June 4, 2026 13:30
@marmarek

marmarek commented Jun 6, 2026

Copy link
Copy Markdown
Member

PipelineRetry

ben-grande added 10 commits June 9, 2026 13:55
Useful when debugging to know what's happening. I thought of logging the
pretty name of the detail, but that got really big and I think it's out
of scope from the Qubes OS project and in scope of the python-libvirt
package.
The state changes rarely, but querying it can take a considerable
blocking time, worse when looping through the state of multiple domains.

For: QubesOS/qubes-issues#10569
For: QubesOS/qubes-issues#9902
These data only changes once when domain is running.
This function reimports the same information over and over and is time
consuming when looping through multiple domains. Reduce listing USB
devices while there is no cache from 2.8s to 0.5s.
Method 'startwith' adds considerable overhead.
Libvirt API can be quite slow on the methods for "listAllDevices" and
"XMLDesc", aggravated in loops.
Assuming that hotplug is unsupported.
Decreases in ~10ms the time to create the cache.
The "get_vm_stats()" and "xs()" already check if Xen is supported.
It's not completely Xen agnostic because then it would not be possible
to query stubdomains, as libvirt is not aware of them. On the other
hand, non-Xen has one more API method working.

When not using Xen, construct the dictionary as "xc.domain_getinfo()"
would, so the info loop can consume from both inputs.
Each call was recording 1ms, now the calls are below that.
Although it doesn't help reduce the time to get the XML, as XMLDesc() is
super slow, this helps cleanup the code a bit.
Skip PVH when listing attached PCI devices, as they can't possibly be
there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants