Skip to content

Failed VR health check #8177

@MartinEmrich

Description

@MartinEmrich
ISSUE TYPE
  • (Maybe) Bug report
  • Improvement Request
COMPONENT NAME
VR, Management Server?
CLOUDSTACK VERSION
4.18.1.0
CONFIGURATION

Single Management server, 2 Shared Networks with non-HA VR each.

OS / ENVIRONMENT

Management Server: CentOS 7
Hypervisor: XCP-ng 8.5

SUMMARY

Here and there, we get an e-mail from the management server like this:

Health checks failed: 2 failing checks on router r-6849-VM / b8679d51-5f2b-4dfa-913c-4c73506dee2e
STEPS TO REPRODUCE

Unknown, but I suspect it might happen during VM provisioning?

EXPECTED RESULTS

Have the reason for the (supposed) failure in the email.

ACTUAL RESULTS

It only says that there are Health checks failed. But we notice no real service failure, and at first have no idea what is wrong.

Only after grepping through the management server logs, I noticed these two lines:

2023-11-02 08:57:22,320 INFO  [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-33ca5294) (logid:2e04f053) Found health check com.cloud.network.dao.RouterHealthCheckResultVO@28fabd51- check type: advanced,check name: dhcp_check.py, check result: false, check last update: Thu Nov 02 08:50:02 CET 2023, details: Missing elements in dhcphosts.txt -
1e:00:5a:00:21:8d 10.12.96.9 test-ci-12 which took running duration (ms) 85.2541923523
2023-11-02 08:57:22,322 INFO  [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-33ca5294) (logid:2e04f053) Found health check com.cloud.network.dao.RouterHealthCheckResultVO@6d617c4c- check type: advanced,check name: dns_check.py, check result: false, check last update: Thu Nov 02 08:50:02 CET 2023, details: Missing entries for VMs in /etc/hosts -

A bit above, I found this (good to have DEBUG log level on all the time), I shortened it to the relevant part:

2023-11-02 08:57:22,300 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-33ca5294) (logid:2e04f053) Parsing and updating DB health check data for router: 6849 with data: {"basic":{"lastRun <SNIP> "message": "Missing entries for VMs in /etc/hosts -\n10.12.96.9 test-ci-12", "success": "false"}, <SNIP>

That VM was just being created. Maybe the health check ran during VM creation and got an inconsistent snapshot of the situation?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions