HA considerations: Shutdown & startup

Follow

Note & disclaimer: High Availability configurations are not officially supported by Eucalyptus, and anything dealing with DB syncs should be taken with extreme caution. 

 

In cases where you are running HA and a shutdown of your components is needed (OS upgrade, maintenance, etc) there is an order of operations for best practice that you want to pay attention to in order to minimize the chance of the wrong DB coming up or creating a split-brain situation in your cloud. 

Shutting down

The first thing that you want to do is stop all Eucalyptus services on the secondary components. Once all components are down, make sure the cloud is still functioning (it should be, a euca-describe-services should show secondary opponents as stopped instead of disabled) with only the primary components active. 

Make sure the DB is stopped on the secondary CLC; split-brains can occur if something hangs before you bring down the primary components. You're going to want to make sure the DB is not running via:

ps aux | grep postgres

Once the secondary components are confirmed stopped, bring down the primary components. Perform the same tasks to ensure the cloud is completely down as per the secondary components and then perform whatever maintenance etc that is necessary. 

 

Starting the cloud back up

When you bring the components back up, it's important to bring them up in the appropriate order. You will want to bring up the primary CLC first, and let it sit and come completely up to the point where it's showing ENABLED and is responding to API calls (euca-describe-services). At that point, bring up the primary CC/SC and NC (and walrus, assuming it wasn't started with the CLC). Again, sit on euca-describe-services until the cloud is completely operational; test instances, check volumes and availability etc. Once the cloud is fully functional and you are certain it's running as it should be, go ahead and start the secondary CLC.

There will be a period of time where you lose communication with the CLC as the DB's between them sync; don't worry. As long as you brought the systems up carefully in the correct order, what's happening is the secondary CLC is purging its DB and syncing to the primary; this is a good thing. 

After the sync time your cloud should respond to API requests, the primary components should be ENABLED and the secondary DISABLED, waiting for a time where they need to step in. 

Please, forward us any questions or concerns that you have, thank you!

 

Have more questions? Submit a request

Comments

Powered by Zendesk