Skip to content

VR: fix expunging vm will remove dhcp entries of another vm in VR#4627

Merged
yadvr merged 1 commit into
apache:4.14from
ustcweizhou:4.14-dhcp-entry-issue
Feb 5, 2021
Merged

VR: fix expunging vm will remove dhcp entries of another vm in VR#4627
yadvr merged 1 commit into
apache:4.14from
ustcweizhou:4.14-dhcp-entry-issue

Conversation

@ustcweizhou
Copy link
Copy Markdown
Contributor

@ustcweizhou ustcweizhou commented Jan 27, 2021

Description

This PR fixes an issue that expunging vm will remove dhcp entries of another vm in VR

Steps to reproduce the issue

(1) create two vm wei-001 wei-002, and wei-003 start them.
Assume that wei-001 has the biggest ip (order by char, not number. for example 192.168.0.99 > 192.168.0.100)

(2) check /etc/cloudstack/dhcpentry.json and /etc/dhcphosts.txt in VR
They have entries for all vms

(3) stop wei-002 and wei-003, then restart VR (or restart network with cleanup).
check /etc/cloudstack/dhcpentry.json and /etc/dhcphosts.txt in VR
They have entries for wei-001 only (as wei-002 and wei-003 are stopped)

(4) expunge wei-002. when it is done,
check /etc/cloudstack/dhcpentry.json and /etc/dhcphosts.txt in VR
The last entry ("id": "dhcpentry") is removed.

(5) expunge wei-003, when it is done.
check /etc/cloudstack/dhcpentry.json and /etc/dhcphosts.txt in VR
the entry for wei-001 is removed, and the ip is removed from /etc/dhcphosts.txt
VR health check fails at dhcp_check.py and dns_check.py

This does not always happen, as the last items in /etc/cloudstack/dhcpentry.json are order by ip addresses.
if the last item is dhcp entry for a running vm , its dhcp info will be removed from dnsmasq.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Steps to reproduce the issue

(1) create two vm wei-001 and wei-002, start them

(2) check /etc/cloudstack/dhcpentry.json and /etc/dhcphosts.txt in VR
They have entries for both of wei-001 and wei-002

(3) stop wei-002, and restart VR (or restart network with cleanup).
check /etc/cloudstack/dhcpentry.json and /etc/dhcphosts.txt in VR
They have entries for wei-001 only (as wei-002 is stopped)

(4) expunge wei-002. when it is done,
check /etc/cloudstack/dhcpentry.json and /etc/dhcphosts.txt in VR
They do not have entries for wei-001.
VR health check fails at dhcp_check.py and dns_check.py
@weizhouapache
Copy link
Copy Markdown
Member

@rhtyd @DaanHoogland @shwstppr
this looks like a major issue.

@yadvr yadvr added this to the 4.14.1.0 milestone Jan 28, 2021
@shwstppr
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@shwstppr a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@yadvr yadvr requested review from nvazquez and shwstppr January 28, 2021 08:16
@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2611

@shwstppr
Copy link
Copy Markdown
Contributor

I'm not able to reproduce the issue on a KVM and a VMware environment. Can someone else verify @vladimirpetrov @nvazquez @rhtyd

@weizhouapache
Copy link
Copy Markdown
Member

I'm not able to reproduce the issue on a KVM and a VMware environment. Can someone else verify @vladimirpetrov @nvazquez @rhtyd

@shwstppr did you follow the steps in description ? in which step you have different result ?

@shwstppr
Copy link
Copy Markdown
Contributor

@weizhouapache in final step. After expunging second VM, VR still had an entry for the first VM.
Also just for info, I was testing with an isolated network.

@weizhouapache
Copy link
Copy Markdown
Member

@weizhouapache in final step. After expunging second VM, VR still had an entry for the first VM.
Also just for info, I was testing with an isolated network.

@shwstppr it is strange. which cloudstack version did you test ?

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Feb 1, 2021

@shwstppr can you test against Wei's steps to reproduce the issue against latest 4.14 branch

@weizhouapache
Copy link
Copy Markdown
Member

@shwstppr can you test against Wei's steps to reproduce the issue against latest 4.14 branch

@rhtyd @shwstppr
I tested with 4.15. it cannot be reproduced. strange
closed this pr. I will double check with 4.14

@weizhouapache
Copy link
Copy Markdown
Member

@shwstppr I updated the steps in description.
the issue can be reproduced.

@weizhouapache weizhouapache reopened this Feb 1, 2021
@shwstppr
Copy link
Copy Markdown
Contributor

shwstppr commented Feb 2, 2021

@weizhouapache will test and update

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Feb 2, 2021

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos7 ✖centos8 ✖debian. JID-2631

@shwstppr
Copy link
Copy Markdown
Contributor

shwstppr commented Feb 2, 2021

not sure why but I'm still not able to reproduce this.
My VMs were like,
VM-001 - IP: 10.1.1.206
VM-002 - IP: 10.1.1.157
VM-003 - IP: 10.1.1.117

After I stopped VM-002 & VM-003, restarted network with cleanup and then expunged VM-002, the last entry in /etc/cloudstack/dhcpentry.json was not removed.
Later, on expunging VM-003, /etc/dhcphosts.txt still had an entry for VM-001

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2639

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-3478)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42266 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4627-t3478-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@weizhouapache
Copy link
Copy Markdown
Member

not sure why but I'm still not able to reproduce this.
My VMs were like,
VM-001 - IP: 10.1.1.206
VM-002 - IP: 10.1.1.157
VM-003 - IP: 10.1.1.117

After I stopped VM-002 & VM-003, restarted network with cleanup and then expunged VM-002, the last entry in /etc/cloudstack/dhcpentry.json was not removed.
Later, on expunging VM-003, /etc/dhcphosts.txt still had an entry for VM-001

@shwstppr strange.
I will test with cloudstack 4.16

@shwstppr
Copy link
Copy Markdown
Contributor

shwstppr commented Feb 3, 2021

@weizhouapache I was testing 4.14, #4150 to be specific

@weizhouapache
Copy link
Copy Markdown
Member

weizhouapache commented Feb 3, 2021

@weizhouapache I was testing 4.14, #4150 to be specific

@shwstppr I have tested with 4.14, 4.15 and 4.16. this issue can be reproduced.
I use ubuntu for testing.
could you please try again ? it is better to create 4 vms (2 are running, 2 are stopped) for testing.

correction: it looks the ips are randomly removed from /etc/cloudstack/dhcpentry.json, not always the last ip.

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Feb 4, 2021

@shwstppr can you test with Ubuntu 18.04 and see if that helps? @weizhouapache is the env adv zone or adv zone with SG, or some other permutation?

@weizhouapache
Copy link
Copy Markdown
Member

@shwstppr can you test with Ubuntu 18.04 and see if that helps? @weizhouapache is the env adv zone or adv zone with SG, or some other permutation?

@rhtyd @shwstppr I use advanced zone with isolated networks.
I will test with shared networks.

@weizhouapache
Copy link
Copy Markdown
Member

@shwstppr can you test with Ubuntu 18.04 and see if that helps? @weizhouapache is the env adv zone or adv zone with SG, or some other permutation?

@rhtyd @shwstppr I use advanced zone with isolated networks.
I will test with shared networks.

tested with shared network in advanced zone.
it has the same issue.

@shwstppr
Copy link
Copy Markdown
Contributor

shwstppr commented Feb 4, 2021

@weizhouapache @rhtyd I'm to reproduce the issue with 4.14 with Ubuntu18 mgmt and hosts. Testing the fix now

Copy link
Copy Markdown
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with PR packages against Ubuntu 18.04mgmt and 2xUbuntu 18.04 hosts.
/etc/dhcphosts.txt continued to have an entry for remaining VM

Screenshot from 2021-02-04 18-18-53
Fresh healthcheck returned success for all tests
Screenshot from 2021-02-04 18-21-28

@yadvr yadvr merged commit d62d5c6 into apache:4.14 Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants