Skip to content

Commit c3b77cb

Browse files
authored
Fix host stuck in connecting state (apache#8502)
There are a lot of test failures due to test_vm_life_cycle.py in multiple PRs due to host not available for migration of VMs. apache#8438 (comment) apache#8433 (comment) apache#7344 (comment) While debugging I noticed that the hosts get stuck in Connecting state because MS is waiting for a response of the ReadyCommand from the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout). On the agent side, it gets stuck waiting for a response from the Script execution. To reproduce, run smoke/test_vm_life_cycle.py (TestSecuredVmMigration test class to be specific). Once the tests are complete, you will notice that some hosts are stuck in Connecting state. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query. SELECT * FROM performance_schema.metadata_locks INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID WHERE PROCESSLIST_ID <> CONNECTION_ID() \G; This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released.
1 parent 3936f7c commit c3b77cb

3 files changed

Lines changed: 3 additions & 1 deletion

File tree

engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -596,6 +596,7 @@ protected AgentAttache notifyMonitorsOfConnection(final AgentAttache attache, fi
596596

597597
final Long dcId = host.getDataCenterId();
598598
final ReadyCommand ready = new ReadyCommand(dcId, host.getId(), NumbersUtil.enableHumanReadableSizes);
599+
ready.setWait(60);
599600
final Answer answer = easySend(hostId, ready);
600601
if (answer == null || !answer.getResult()) {
601602
// this is tricky part for secondary storage

engine/storage/volume/src/main/java/org/apache/cloudstack/storage/datastore/provider/DefaultHostListener.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,7 @@ private NicTO createNicTOFromNetworkAndOffering(NetworkVO networkVO, NetworkOffe
121121
public boolean hostConnect(long hostId, long poolId) throws StorageConflictException {
122122
StoragePool pool = (StoragePool) this.dataStoreMgr.getDataStore(poolId, DataStoreRole.Primary);
123123
ModifyStoragePoolCommand cmd = new ModifyStoragePoolCommand(true, pool);
124+
cmd.setWait(60);
124125
final Answer answer = agentMgr.easySend(hostId, cmd);
125126

126127
if (answer == null) {

plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtReadyCommandWrapper.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ private boolean hostSupportsUefi(boolean isUbuntuHost) {
5555
cmd = "dpkg -l ovmf";
5656
}
5757
s_logger.debug("Running command : " + cmd);
58-
int result = Script.runSimpleBashScriptForExitValue(cmd);
58+
int result = Script.runSimpleBashScriptForExitValue(cmd, 60, false);
5959
s_logger.debug("Got result : " + result);
6060
return result == 0;
6161
}

0 commit comments

Comments
 (0)