Skip to content

Clean up inactive iscsi sessions when VMs get moved due to crashes#3819

Merged
DaanHoogland merged 4 commits into
apache:4.13from
syed:iscsi-session-cleanup-master
Jan 30, 2020
Merged

Clean up inactive iscsi sessions when VMs get moved due to crashes#3819
DaanHoogland merged 4 commits into
apache:4.13from
syed:iscsi-session-cleanup-master

Conversation

@skattoju4
Copy link
Copy Markdown
Contributor

@skattoju4 skattoju4 commented Jan 17, 2020

Description

Previously, iscsi sessions would not be cleaned up on KVM when

  • vms are crashed or rebooted on another host
  • a hypervisor host crashed and ha enabled vms

This leads to to long boot times from the hypervisor host and session overloading of the netapp solidfire nodes. Each node can have a maximum of 700 iSCSI sessions. The iSCSI Daemon on the hypervisor host holds iSCSI sessions that are not cleaned up. The iSCSI Daemon tried to reconnect every time resulting in holding the unused sessions. Only a logout of the iSCSI daemon to the volume will solve the problem. This changes adds a kvm.storage.IscsiStorageCleanupMonitor class that scans for and cleans up inactive iscsi sessions.

steps to reproduce:

  1. check there are a vm is running and sessions to Netapp Solidfire are there ìscsiadm -m session -P 0
  2. crash the kvm host (e.g. power off, be sure there are running vms on it) or destroy a virtual machine (virsh destroy vmname)
  3. look with ìscsiadm -m session -P 0 and you will see sessions they should not be there

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

This has been tested by observing that inactive iscsi sessions are cleaned up when VMs are removed.

@andrijapanicsb
Copy link
Copy Markdown
Contributor

Hm - when does this happen exactly @skattoju4
I recall the Mike T. implemented a fix for the same when VMs are being stopped inside the OS (originally iSCSI session would not be removed)

@svenvogel
Copy link
Copy Markdown
Contributor

svenvogel commented Jan 17, 2020 via email

@svenvogel
Copy link
Copy Markdown
Contributor

@syed @skattoju4 thanks for your work! we tested it in our production environment and it works fine.

@svenvogel
Copy link
Copy Markdown
Contributor

@syed can you review please?

@skattoju4 skattoju4 changed the title Clean up inactive iscsi sessions when VMs are removed. Clean up inactive iscsi sessions when VMs get moved due to crashes Jan 17, 2020
@syed
Copy link
Copy Markdown
Contributor

syed commented Jan 22, 2020

LGTM 👍

@DaanHoogland DaanHoogland added this to the 4.14.0.0 milestone Jan 23, 2020
@DaanHoogland
Copy link
Copy Markdown
Contributor

apart from the license, do you need this in 4.13?

@svenvogel
Copy link
Copy Markdown
Contributor

@DaanHoogland for me we dont need it in 4.13. do you see any thing that we need it?

@DaanHoogland
Copy link
Copy Markdown
Contributor

@svenvogel you marked it a bug, that's why i asked, nothing else ... ;)

Copy link
Copy Markdown
Member

@GabrielBrascher GabrielBrascher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @skattoju4. I left a question and a minor observation. Overall it looks good (+0.99 🙂 )


public class IscsiStorageCleanupMonitor implements Runnable{
private static final Logger s_logger = Logger.getLogger(IscsiStorageCleanupMonitor.class);
private static final int CLEANUP_INTERVAL_SEC = 60; // check every X seconds
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with the way it is, but I would like to raise the following question: Is it interesting to externalize CLEANUP_INTERVAL_SEC this on a global settings variable (config keys)?

diskStatusMap.put(disk.getDiskPath(), true);
s_logger.debug("active disk found by cleanup thread" + disk.getDiskPath());
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend creating a few methods (e.g. checkIfIscsSessionBelongAnyVm(Connect), etc) which would allow documenting (Javadoc) and creating unit tests (JUnit).

@GabrielBrascher
Copy link
Copy Markdown
Member

GabrielBrascher commented Jan 23, 2020

@DaanHoogland @svenvogel @andrijapanicsb considering that it is a bug, it would be awesome to have it in 4.13 branch (aiming 4.13.1 release) and then forward it to master (aiming 4.14).

@svenvogel
Copy link
Copy Markdown
Contributor

svenvogel commented Jan 24, 2020

@GabrielBrascher @DaanHoogland @andrijapanicsb Gabriel - you are right. we should move it to the 4.13 branch and then forward it to 4.14 👍

@skattoju4 can you change this?

@skattoju4 skattoju4 changed the base branch from master to 4.13 January 27, 2020 19:50
@skattoju4 skattoju4 force-pushed the iscsi-session-cleanup-master branch 2 times, most recently from cc3402f to a750ce0 Compare January 27, 2020 20:17
@skattoju4 skattoju4 changed the base branch from 4.13 to master January 27, 2020 20:19
@skattoju4 skattoju4 changed the base branch from master to 4.13 January 27, 2020 20:20
@skattoju4 skattoju4 force-pushed the iscsi-session-cleanup-master branch from a750ce0 to 932a98b Compare January 27, 2020 21:01
Comment on lines +77 to +94
// check if they belong to any VM
int[] domains = conn.listDomains();
s_logger.debug(String.format("found %d domains", domains.length));
for (int domId : domains) {
Domain dm = conn.domainLookupByID(domId);
final String domXml = dm.getXMLDesc(0);
final LibvirtDomainXMLParser parser = new LibvirtDomainXMLParser();
parser.parseDomainXML(domXml);
List<LibvirtVMDef.DiskDef> disks = parser.getDisks();

//check the volume map. If an entry exists change the status to True
for (final LibvirtVMDef.DiskDef disk : disks) {
if (diskStatusMap.containsKey(disk.getDiskPath())) {
diskStatusMap.put(disk.getDiskPath(), true);
s_logger.debug("active disk found by cleanup thread" + disk.getDiskPath());
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkDiskStatusMap();

for (final LibvirtVMDef.DiskDef disk : disks) {
if (diskStatusMap.containsKey(disk.getDiskPath())) {
diskStatusMap.put(disk.getDiskPath(), true);
s_logger.debug("active disk found by cleanup thread" + disk.getDiskPath());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space needed in log message before the path.

}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disconnectPhysicalDisks();

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no real issues in the code, but i would like to see some more partitioning of the logic in methods.

@yadvr yadvr closed this Jan 29, 2020
@yadvr yadvr reopened this Jan 29, 2020
@DaanHoogland
Copy link
Copy Markdown
Contributor

thanks for the changes @skattoju3 . looks a bit cleaner.

@DaanHoogland DaanHoogland dismissed their stale review January 29, 2020 11:40

changes look good

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@apache apache deleted a comment from blueorangutan Jan 30, 2020
@apache apache deleted a comment from blueorangutan Jan 30, 2020
@apache apache deleted a comment from blueorangutan Jan 30, 2020
@apache apache deleted a comment from blueorangutan Jan 30, 2020
@apache apache deleted a comment from GabrielBrascher Jan 30, 2020
@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-714

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-853)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30196 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3819-t853-kvm-centos7.zip
Smoke tests completed. 77 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@DaanHoogland DaanHoogland merged commit 6baa598 into apache:4.13 Jan 30, 2020
@skattoju4
Copy link
Copy Markdown
Contributor Author

thanks for the changes @skattoju3 . looks a bit cleaner.

Thanks for the feedback :)

@simon-a-james
Copy link
Copy Markdown

Would this fix cause a problem with Hypervisors that use iSCSI as its system disk?

It seems that when iscsiStorageCleanupMonitor runs, it kills the iscsi connection for the hypervisor iscsi system disk and causes the hypervisor system disk to go into a read only state. Is there a way to stop this running - I dont use iSCSI for KVM guests

@skattoju4
Copy link
Copy Markdown
Contributor Author

Yes this seems to be the case. It is currently a limitation. Iscsi disks that are not attached to VMs will be marked as inactive and attempted to be cleaned up. One solution could be to have this feature be toggle-able by a global setting.

@simon-a-james
Copy link
Copy Markdown

Thanks for the update. If I just use the agent version 4.11, will that not perform the iSCSI cleanup and the HVs can work normally again or is this a management server thing?

@skattoju4
Copy link
Copy Markdown
Contributor Author

It's part of the agent. Using an older agent should do the trick for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants