Skip to content

server: Fix update capacity for hosts take long time if there are many service offerings#4623

Merged
yadvr merged 1 commit into
apache:4.14from
ustcweizhou:4.14-optimize-update-capabity
Feb 4, 2021
Merged

server: Fix update capacity for hosts take long time if there are many service offerings#4623
yadvr merged 1 commit into
apache:4.14from
ustcweizhou:4.14-optimize-update-capabity

Conversation

@ustcweizhou
Copy link
Copy Markdown
Contributor

Description

This PR fixes the issue that update capacity for hosts take long time if there are many service offerings.

Steps to reproduce the issue:

(1)Create 10000 service offerings (by db changes or cloudmonkey).

(2) Check the total time of periodical capacity check in cloudstack.

Without this patch, it spend 2.5 seconds (2 hosts)

2021-01-15 16:10:12,793 DEBUG [c.c.a.AlertManagerImpl] (CapacityChecker:ctx-5d5f3b3b) (logid:f5eb68ba) Running Capacity Checker ...
2021-01-15 16:10:15,287 DEBUG [c.c.a.AlertManagerImpl] (CapacityChecker:ctx-5d5f3b3b) (logid:f5eb68ba) Done running Capacity Checker ...

With this patch ,it spend 1.3 seconds (2 hosts)

2021-01-15 16:12:43,604 DEBUG [c.c.a.AlertManagerImpl] (CapacityChecker:ctx-a2a7f3f1) (logid:f7e0a4c5) Running Capacity Checker ...
2021-01-15 16:12:44,927 DEBUG [c.c.a.AlertManagerImpl] (CapacityChecker:ctx-a2a7f3f1) (logid:f7e0a4c5) Done running Capacity Checker ...

If there are 100 hosts, the total time will be reduced from 100+ seconds to around 10 seconds.
This helps a lot to reduce the execution time of prometheus exporter.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

…y service offerings

Steps to reproduce the issue:

(1)Create 10000 service offerings (by db changes below or cloudmonkey).

```
DROP PROCEDURE IF EXISTS cloud.insert_service_offering;

DELIMITER $$
CREATE PROCEDURE cloud.insert_service_offering()
BEGIN
  DECLARE count INT DEFAULT 10000;
  SET @offeringid = (select max(id)+1 from disk_offering);

  WHILE count > 0 DO
    INSERT INTO disk_offering (id,name,uuid,display_text,disk_size,type,created) values (@offeringid,'test-offering-wei',uuid(), 'test-offering-wei',0,'Service',now());
    INSERT INTO service_offering (id,cpu,speed,ram_size) values (@offeringid, 1, 500,256);
    SET @offeringid = @offeringid + 1;
    SET count = count - 1;
  END WHILE;
END $$
DELIMITER ;

CALL cloud.insert_service_offering();

mysql> CALL cloud.insert_service_offering();
Query OK, 0 rows affected (2 min 30.85 sec)
```

(2) Check the total time of periodical capacity check in cloudstack.

Without this patch, it spend 2.5 seconds (2 hosts)
```
2021-01-15 16:10:12,793 DEBUG [c.c.a.AlertManagerImpl] (CapacityChecker:ctx-5d5f3b3b) (logid:f5eb68ba) Running Capacity Checker ...
2021-01-15 16:10:15,287 DEBUG [c.c.a.AlertManagerImpl] (CapacityChecker:ctx-5d5f3b3b) (logid:f5eb68ba) Done running Capacity Checker ...
```

With this patch ,it spend 1.3 seconds (2 hosts)
```
2021-01-15 16:12:43,604 DEBUG [c.c.a.AlertManagerImpl] (CapacityChecker:ctx-a2a7f3f1) (logid:f7e0a4c5) Running Capacity Checker ...
2021-01-15 16:12:44,927 DEBUG [c.c.a.AlertManagerImpl] (CapacityChecker:ctx-a2a7f3f1) (logid:f7e0a4c5) Done running Capacity Checker ...
```

If there are 100 hosts, the total time will be reduced from 100+ seconds to around 10 seconds.
@yadvr yadvr added this to the 4.14.1.0 milestone Jan 27, 2021
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jan 27, 2021

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2607

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Feb 1, 2021

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@yadvr yadvr requested a review from shwstppr February 1, 2021 08:59
@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-3457)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 36001 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4623-t3457-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_nic.py
Smoke tests completed. 81 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_03_deploy_and_upgrade_kubernetes_cluster Failure 258.93 test_kubernetes_clusters.py
test_01_nic Error 50.06 test_nic.py

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Feb 2, 2021

@ustcweizhou can you review the test_01_nic failure?

@weizhouapache
Copy link
Copy Markdown
Member

@ustcweizhou can you review the test_01_nic failure?

@rhtyd I ran the test on 4.14 and 4.15, both succeed.
can you re-kick the test ?

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Feb 2, 2021

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@shwstppr
Copy link
Copy Markdown
Contributor

shwstppr commented Feb 2, 2021

Running failed test manually @weizhouapache @rhtyd

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos7 ✖centos8 ✔debian. JID-2633

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a sensible optimisation; fetch the offerings before looping over hosts instead of re-fetch for each host. I think only regression tests are needed for this one.

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@shwstppr
Copy link
Copy Markdown
Contributor

shwstppr commented Feb 2, 2021

@ustcweizhou can you review the test_01_nic failure?

Failing test verified manually,

==== Marvin Init Started ====

=== Marvin Parse Config Successful ===

=== Marvin Setting TestData Successful===

==== Log Folder Path: /marvin/MarvinLogs/Feb_02_2021_14_27_56_UX96H4. All logs will be available here ====

=== Marvin Init Logging Successful===

==== Marvin Init Successful ====
=== TestName: test_01_nic | Status : SUCCESS ===

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-3474)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 47885 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4623-t3474-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Smoke tests completed. 82 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_deploy_kubernetes_cluster Failure 3602.88 test_kubernetes_clusters.py
test_03_deploy_and_upgrade_kubernetes_cluster Failure 251.15 test_kubernetes_clusters.py
test_08_deploy_and_upgrade_kubernetes_ha_cluster Failure 166.53 test_kubernetes_clusters.py
ContextSuite context=TestKubernetesCluster>:teardown Error 760.79 test_kubernetes_clusters.py

@yadvr yadvr merged commit 78f73c1 into apache:4.14 Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants