COPING WITH MEDIUM/LARGE VAXCLUSTERS August 1992 ------------------------------------ This small collection of VAXcluster utilities is refered to in the DECUS Australia 1992 Symposium Procedings. It forms part of the approach adopted by the High Frequency Radar Division (HFRD) of the Defence Science Technology Organisation (DSTO) for managing it 60+ member CI/NI VAXcluster. The utilites have not been generalized in any way for inclusion in the Software Collection so may contain some HFRD-specifics or idiosyncracies (certainly the CLUSTER_SHUTDOWN.COM procedure does, see below.) All source programs contain brief commentary at the begining. See these for additional information. Usual disclaimer: Provided "as-is", no warranty, explicit or implied. Internet e-mail contact: daniel@hfrd.dsto.gov.au Although an endeavour will be made to reply, correct bugs, etc., lengthly correspondence will not generally be possible. CONFGR ------ The main utility is CONFGR, allowing the configuration of SYSTARTUP_V5.COM, SYSHUTDWN.COM, DECW$PRIVATE_SERVER_SETUP.COM and MODPARAMS.DAT files. Refer to the included Bookreader and/or PostScript document for further information. This document has been "plucked" directly from the HFRD VMS Management Bookreader shelf and not generalized in any way for inclusion, so the few references to the HFRD cluster will have to be taken in context. STARTSYNC --------- This utility allows the number of satellite systems concurrently booting to be controlled. It was the author's first attempt at using cluster-wide locking so some crudities (even after only six month's reflection) will have to be overlooked. The startsync.exe image should be on each system disk: (disk:[VMS$COMMON.SYSEXE]STARTSYNC.EXE) The following should be at the beginning of sylogicals.com on each system disk: (disk:[VMS$COMMON.SYSMGR]SYLOGICALS.COM) $ STARTSYNC = "$SYS$SYSTEM:STARTSYNC" $ STARTSYNC /WAIT=60 /LIMIT=10 The "/LIMIT=n" number should be the maximum number of systems the cluster can support booting concurrently. The "/WAIT=n" is the longest time in minutes the system will be blocked before continueing on regardless. MSCPAVL ------- Blocks further startup until all supplied MSCP server systems are members of the cluster. Times-out and continues regardless after a supplied interval ("/WAIT=n".) The systems are divided into two categories, non-satellites (MSCP servers or voting members) and satellites (non-MSCP servers and non-voting members). This program operates by working through a comma-separated list of groups of "concatenated" system names, each of which it checks for availability. If all systems in a group are available it exits. If all not available it similarly checks the next group in the list. If none of the groups has all systems available it waits one minute then tries the list again. The following example; /SATELLITES="DAJAV+SERV2+DBOOT,DAMSC+SERV2+DBOOT" would be read as "(DAJAV and SERV2 and DBOOT) or (DAMSC and SERV2 and DBOOT)". The first in the list is reported as being the prefered option, and hence should always consist of all those systems required under a normal/optimal situation, with any following in the list being the minimum functional combinations of servers. Alternatively, the process-level logical names MSCPAVL$$SATELLITES and MSCPAVL$$NONSATELLITES can be defined to provide the information (to workaround excessively long command lines). Qualifier for controlling non-satellites; /NONSATELLITES="DAJAV+DAMSC+SERV2+DBOOT" CLEMBERS -------- Display a list of current members of the cluster. Information on expected votes, current votes, etc., is also included. Members that contribute votes to the cluster have the appended to the displayed system name (e.g. "DAMSC(1)".) The "/CONTINUOUS" qualifier clears the screen, displaying the list, and updating it every 5 seconds. C is required to abort this utility. CLEMBERS is used by the CLUSTER_SHUTDOWN.COM procedure. CLUSTER_SHUTDOWN.COM -------------------- Initially HFRD shut all systems in the cluster down at the one time, using the CLUSTER_SHUTDOWN qualifier. This was generally a lengthy exercise, greater than forty-five minutes, and often frought with frustration towards the end, as all the CI systems would have long since disabled user activity precluding investigation of why some laggard VAXstation 2000 was still processing and disrupting the whole affair. A two-phase shutdown has since been implemented. First, a shutdown of all satellites. Second, a shutdown of non-satellites using the CLUSTER_SHUTDOWN qualifier. This has resulted in a far more satisfactory performance. Satellites generally take between fifteen and twenty minutes to all shutdown, the rest less than five. The first phase, keeping some nodes still processing, allows problem investigation and progress monitoring. The cluster shutdown utility menu has three items. 1. Shutdown satellite systems 2. Monitor cluster membership 3. Shutdown remaining systems For HFRD's cluster, where not all MSCP serving systems are voting systems (the NI-connected VAXserver 3500s), a satellite is determined by being a system having SYSGEN parameter VOTES set to 0 and MSCP_SERVE_ALL set to 0. The cluster shutdown utility uses SYSMAN cluster-wide to create a detached process on candidate systems to do the shutdown (batch is no good, for obvious reasons) After the first phase creation of the detached processes, cluster membership can be monitored continuously (using a small utility). When only the non-satellite systems remain in the cluster the second phase can be initiated. This procedure HAS HFRD-SPECIFICS, in particular the directories: SITE$MANAGER: (the location of cluster-common VMS management images and procedures) and SITE$WORK: (a scratch area.) These would need to be modified. FILES ----- AAAREADME.TXT;1 14 BUILD_CLEMBERS.COM;1 1 BUILD_CONFGR.COM;1 1 BUILD_MSCPAVL.COM;1 1 BUILD_STARTSYNC.COM;1 1 CLEMBERS.C;1 18 CLEMBERS.EXE;1 7 CLEMBERS.OBJ;1 7 CLUSTER_SHUTDOWN.COM;1 19 CONFGR.C;1 94 CONFGR.DECW$BOOK;1 218 CONFGR.EXE;1 37 CONFGR.LINE;1 217 CONFGR.OBJ;1 46 CONFGR.PS;1 385 CONFGR_CONTENTS.LINE;1 7 CONFGR_CONTENTS.PS;1 90 MSCPAVL.C;1 39 MSCPAVL.EXE;1 15 MSCPAVL.OBJ;1 14 STARTSYNC.C;1 56 STARTSYNC.EXE;1 22 STARTSYNC.OBJ;1 23 Total of 23 files, 1332 blocks.