<<< DISK$DATA:[NOTES$LIBRARY]VAX_VMS.NOTE;1 >>> -< SIG VAX/VMS >- ================================================================================ Note 1354.0 SPIRIT/PROCESS INACTIFS 29 replies DECUSF::ROUSSEL_P 16 lines 27-MAY-1991 11:24 -------------------------------------------------------------------------------- BONJOUR Nous disposons d'un LAVC composé d'un vax 8350(Boot Node) et de deux microvax(satellites). Nous avons installé SPIRIT version 2.3 pour tuer les process inactifs. Ce logiciel est "offert" par Digital mais non supporté... Nous avons un petit problème : SPIRIT fonctionne sur le Boot node et sur le microvax 3400, mais pas sur le microvaxII. Nous avons tout essayé : installer dans les directory common ou spécifiques... même punition ! Quelqu'un peut-il nous donner des tuyaux? Le logiciel WATCHDOG(ref decus V00146) est-il équivalent? En existe-t-il d'autres? MERCI ================================================================================ Note 1354.1 SPIRIT/PROCESS INACTIFS 1 of 29 DECUSF::FAUCONNET_A "Alain, SIG Graph & messagerie" 12 lines 27-MAY-1991 12:05 -< Watchdog ou Watcher >- -------------------------------------------------------------------------------- Il existe ici un Watchdog que j'ai ecrit (en Fortran) et que B. Perrot a considerablement ameliore. Vous le trouverez dans VAXF90B:WDOG022.* Il y a aussi Watcher, sans doute beaucoup plus performant mais plus "lourd", dans VMS:WATCHER*.* PS: Spirit est plein de bugs. En particulier il ne delogge pas les process qui tournent des applis FMS ou SMG (comme Notes). Des que j'aurai un peu de temps je le remplacerai par autre chose ici sur DECUSF. PPS: la porte de votre bureau ferme-t-elle bien ? J'ai failli me faire lyncher le jour ou j'ai essaye de mettre ca en route sur mon site ! ================================================================================ Note 1354.2 SPIRIT/PROCESS INACTIFS 2 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 6 lines 27-MAY-1991 16:29 -< YAWN (Yet Another Watchdog Nobody want) >- -------------------------------------------------------------------------------- I have a cluster-wide watchdog (if you have VMS >= 5.2), written in C. You only have to run ONE copy in the whole cluster, and you can say things like "exclude users A and B" or "exclude nodes A and B" or "exclude images OA$MAIN and XYZ". It is fairly low-overhead (in my opinion). I haven't posted is up to now because there are so many floating around; let me know if this might meet your requirements. ================================================================================ Note 1354.3 SPIRIT/PROCESS INACTIFS 3 of 29 DECUSF::WERZ_P "Pascal WERZ, MagneTech, Orsay." 4 lines 27-MAY-1991 17:16 -------------------------------------------------------------------------------- Et si vous le postiez, que l'on compare les vertus respectives de WDOG et du votre? Cela m'intéresse. pw ================================================================================ Note 1354.4 SPIRIT/PROCESS INACTIFS 4 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 11 lines 27-MAY-1991 17:39 -< Enjoy >- -------------------------------------------------------------------------------- OK, posted in PUB:[VMS]DOG.BCK and .BCK_Z. As usual with this type of program, I accept NO responsibility for anything valuable it kills due to bugs, etc. However, it has been fairly extensively tested in a 6-node cluster under VMS V5.3-1 and seems to work OK; there are no known bugs at this time, as they say. That should be enough of a challenge for you... PLEASE give me feedback (even/especially negative) about this program. (BUT: I KNOW it's not very well commented, documented, etc...) ================================================================================ Note 1354.5 SPIRIT/PROCESS INACTIFS 5 of 29 DECUSF::DIAKONOFF_N "Responsable programmathèque" 3 lines 27-MAY-1991 19:13 -< WatchDog de VMS: >- -------------------------------------------------------------------------------- Nous utilisons le WatchDog qui se trouve dans VMS: dans une configuration LAVC. Tout fonctionne correctement, aucun problème et chaque machine est paramétrée différemment. ================================================================================ Note 1354.6 SPIRIT/PROCESS INACTIFS 6 of 29 DECUSF::BERGER_JP "J-Ph Berger Aerospatiale Tls" 2 lines 28-MAY-1991 09:50 -< Watcher si DECwindows >- -------------------------------------------------------------------------------- Si vous avez DECwindows utilisez plutot Watcher sous VMS: egalement je crois, sinon je peux l'y descendre. Il marche tres bien. ================================================================================ Note 1354.7 SPIRIT/PROCESS INACTIFS 7 of 29 DECUSF::ROUSSEL_P 2 lines 28-MAY-1991 10:05 -< spirit >- -------------------------------------------------------------------------------- merci,je vais essayer tout cela rapidement et sans doute laisser tomber SPIRIT... ================================================================================ Note 1354.8 SPIRIT/PROCESS INACTIFS 8 of 29 DECUSF::LEGUEVA_A "Alex Leguevaques FOCEPY Auxerre" 6 lines 28-MAY-1991 11:59 -< ...Vax without a NANNY...Child without his mother >- -------------------------------------------------------------------------------- J'utilise ici NANNY (programmatheque) et m'en porte bien. Il faut dire que l'environnement est tres simple (MicroVax II standalone),aussi je ne sais pas si cela vous conviendrait. En plus de supprimer les process inactifs,NANNY surveille l'espace libre sur les disques,peut envoyer des messages hebdomadaires ou quotidiens...etc... ================================================================================ Note 1354.9 SPIRIT/PROCESS INACTIFS 9 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 7 lines 28-MAY-1991 13:20 -< Linguistic footnote to .8 >- -------------------------------------------------------------------------------- In English, a NANNY means a "nourrice", not to be confused with one's "Nan" which is a word in some parts of the UK for grandmother. However, the sense here is probably that of the somewhat pejorative verb "to nanny", meaning to fuss over, spoil, etc, a child. Eg, "Don't nanny me !" (exclamation of fifteen year old girl being warned not to stay out too late). ================================================================================ Note 1354.10 SPIRIT/PROCESS INACTIFS 10 of 29 DECUSF::WERZ_P "Pascal WERZ, MagneTech, Orsay." 6 lines 28-MAY-1991 14:49 -< DOG is barking, la caravane passe... >- -------------------------------------------------------------------------------- A propos du DOG brownien, il me semble assez méchant: proccess tués sans warning, et j'ai cru voir entrapercevoir un problème avec les sous process. Les gère-t-il? (i.e. 'évite-t-il' les process ayant des sous-process, sinon bonjour les DEBUG...) pw ================================================================================ Note 1354.11 SPIRIT/PROCESS INACTIFS 11 of 29 DECUSF::ROUSSEL_P 7 lines 28-MAY-1991 16:32 -< mode d'emploi >- -------------------------------------------------------------------------------- J'ai récupéré watcher et DOG brownien par psicopy. Comment les lancer? En particulier comment récuperer le saveset de .bck_z? La procédure link.com de watcher fait référence à des extensions non présentes dans le saveset; à quoi correspondent les extensions .b32 signé : un débutant qui découvre... ================================================================================ Note 1354.12 SPIRIT/PROCESS INACTIFS 12 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 41 lines 28-MAY-1991 17:27 -< This is how I trained my doggie >- -------------------------------------------------------------------------------- To .10: either you have misread the code or I uploaded a corrupted version... DOG has (approximately) the following characteristics: - it wakes every DOG_WAKE_INTERVAL (default 0:5:0 == 5 minutes). - if a process has been idle for DOG_WARN_COUNT intervals (default 2) then the user of the process will be warned by SYS$BRKTHRUW that a process on a given terminal is about to be killed. This may cause some confusion if the user has logged in twice and forgotten about the idle process. I do not apologise for this. - if a process has been idle for DOG_KILL_COUNT intervals (default 3) then the process is killed. - if DOG_KILL_COUNT < DOG_WARN_COUNT then there will be no warning. There is no check to prevent this. This is a feature for those who want it. If the counts are equal... I can't remember what happens. - processes with subprocesses are not killed. If you spawn a subprocess and go home it will therefore take 2 * DOG_KILL_COUNT intervals before you are finally logged out. - only interactive mode processes are affected. - the definition of "idle" is: - normally, ANY CPU or buffered I/O or direct I/O. - during the interval after a warning, an amount of CPU or buffered I/Os or direct I/Os less than (or equal to, can't remember) DOG_CPU_DELTA, DOG_BUF_IO_DELTA, or DOG_DIR_IO_DELTA, respectively. Defaults are 6 (centiseconds) CPU, 10 buffered I/Os, 2 direct I/Os. This is to compensate for the resources used by software such as TPU to update the screen after a breakthrough write. - users, nodes, and/or images may be immune, according to a previous note. If it doesn't behave like this, please let me know (I'm sure you will if it kills enough innocent victims). To .11: I don't have any .B32 files. This is the extension of a BLISS source file (BLISS == the language much of VMS is written in). You run DOG by unpacking the saveset (see many, many other notes for how to decompress the .BCK_Z version; otherwise, just take the .BCK version which is 100% the same) and executing DOG.COM (I hope I put that in there...). Don't forget to set up your own preferences in DOG.COM first. ================================================================================ Note 1354.13 SPIRIT/PROCESS INACTIFS 13 of 29 DECUSF::WERZ_P "Pascal WERZ, MagneTech, Orsay." 7 lines 29-MAY-1991 08:53 -< C'est un pitbull, non? >- -------------------------------------------------------------------------------- > To .10: either you have misread the code or I uploaded a corrupted > version... DOG has (approximately) the following characteristics: Je n'ai pas modifié les valeurs par défaut, j'ai failli me faire lyncher... Je vais modifier ces valeurs... pw ================================================================================ Note 1354.14 SPIRIT/PROCESS INACTIFS 14 of 29 DECUSF::BERGER_JP "J-Ph Berger Aerospatiale Tls" 3 lines 31-MAY-1991 11:16 -< Oui ? >- -------------------------------------------------------------------------------- > .11 Quel probleme avec link.com ? .B32 c'est du BLISS 32 : watcher est écrit en BLISS et disponible MACRO par generation du compilo B32. ================================================================================ Note 1354.15 SPIRIT/PROCESS INACTIFS 15 of 29 DECUSF::ROUSSEL_P 1 line 31-MAY-1991 13:47 -< compilo B32 >- -------------------------------------------------------------------------------- recherche compilo B32 désespérement... ================================================================================ Note 1354.16 SPIRIT/PROCESS INACTIFS 16 of 29 DECUSF::BERGER_JP "J-Ph Berger Aerospatiale Tls" 1 line 31-MAY-1991 14:19 -< et le macro ? >- -------------------------------------------------------------------------------- Pour watcher pas besoin du bliss, le macro suffit... ================================================================================ Note 1354.17 SPIRIT/PROCESS INACTIFS 17 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 9 lines 5-JUN-1991 10:22 -< DOG BYTLM note >- -------------------------------------------------------------------------------- Small note for anyone using DOG. (Hi guys, let's hear from you, even if just to tell me why you stopped using it...) The procedure DOG.COM starts DOG with BUFFER_LIMIT=20480. Under some circumstances DOG appears to freeze or crash; the crash is caused by insufficient BYTLM and I suppose the freeze might be as well. So try doubling this BUFFER_LIMIT value. People with PQL_MBYTLM set to a high value may not need to do this. ================================================================================ Note 1354.18 SPIRIT/PROCESS INACTIFS 18 of 29 DECUSF::WERZ_P "Pascal WERZ, MagneTech, Orsay." 8 lines 5-JUN-1991 12:29 -< I killed my DOG... >- -------------------------------------------------------------------------------- J'ai enlevé DOG, remis WATCHDOG (auquel mes utilisateurs sont habitués, contre leur gré...) essentiellement parce que les 'warning: user xxx inactive' arrive sur tous les terminaux loggé sous ce username. Or ici, nous avons pas mal de stations (toutes avec plusieurs DecTerm). Lorsque qu'un utilisateur bosse et reçoit le warning, il le prend mal... et ne cherche même pas à savoir quel Decterm est en cause: il vient me voir et me menace des pires tortures... pw ================================================================================ Note 1354.19 SPIRIT/PROCESS INACTIFS 19 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 8 lines 5-JUN-1991 16:12 -< Limitation of $BRKTHRUW >- -------------------------------------------------------------------------------- I would love to be able to use BRKTHRUW to write to a single terminal on another node. If DOG were a single-node program then this would be easy. However, I am not going to produce a single-node version just to solve this problem, which I agree is likely to be annoying to a certain type of user (but possibly useful to others). Perhaps an alternative would be to send an operator message. I think I might be able to achieve the same effect with a REPLY. Will look... ================================================================================ Note 1354.20 SPIRIT/PROCESS INACTIFS 20 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 6 lines 22-OCT-1991 16:03 -< Help wanted quickly >- -------------------------------------------------------------------------------- Quick ! DOG has just frozen on our cluster (running on an 8350; there is another 8350, a 6310, and some uVAXen) after 45 days. Can anyone suggest what we should look for (with SDA) to see what is causing this freeze ? The program is still running. ================================================================================ Note 1354.21 SPIRIT/PROCESS INACTIFS 21 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 1 line 22-OCT-1991 17:08 -< PS >- -------------------------------------------------------------------------------- PS to above: current PC of process corresponds to EXE$SYNCH + 0000000C. ================================================================================ Note 1354.22 SPIRIT/PROCESS INACTIFS 22 of 29 DECUSF::FOUCHET_F "François FOUCHET - CMT" 7 lines 22-OCT-1991 19:22 -< Une idee >- -------------------------------------------------------------------------------- C'est l'addresse d'attente des system services de type "W" (GETJPIW par exemple). A mon avis, il faut regarder du cote des event flags wait mask. Si vous tombez sur un event flag zero, verifiez dans le code que, par exemple, vous ne lancez pas de system service de type "W" dans une ASt sans preciser un numero d'event flag (dans ce cas, c'est l'EVF zero qui est utilise par defaut). Il se pourrait que plusieurs services essaient de partager cet EVF, ce qui peut poser des problemes. ================================================================================ Note 1354.23 SPIRIT/PROCESS INACTIFS 23 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 15 lines 23-OCT-1991 09:36 -< Precisions >- -------------------------------------------------------------------------------- It is waiting for EVF 0, which I use for almost all my system service calls. The program has NO ASTs and (as far as I know) does no asynchronous processing. Maybe I have a call which should end in W and doesn't - will check the source again. Normally the program hibernates for 5 minutes between calls; it calls C routine sleep() which puts it into HIB. The current process is in LEF, with EF wait mask FFFFFFFE (I guess you invert this to see what it's waiting for). At 5 minutes per wait, the program has run its main loop about 45 * 24 * (60 / 5) = about 13000 times before this hangup. Is there any relatively painless way to trace back the stack to my program in SDA ? ================================================================================ Note 1354.24 SPIRIT/PROCESS INACTIFS 24 of 29 DECUSF::FOUCHET_F "François FOUCHET - CMT" 8 lines 23-OCT-1991 21:18 -< Walking in the fog ... >- -------------------------------------------------------------------------------- Si vous n'utilisez pas d'AST, je vois pas trop ... Par hasard, vous n'utiliseriez pas GETJPI sans W, avec une demande d'info clusterwide ? Le cas echeant, pensez vous a regarder le contenu de l'IOSB retourne par GETJPI ? Avez vous essaye de regarder sous SDA les quotas du process (des fois qu'il serait a court, le petit) ? ================================================================================ Note 1354.25 SPIRIT/PROCESS INACTIFS 25 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 32 lines 24-OCT-1991 10:00 -< SDA stuff here >- -------------------------------------------------------------------------------- I always call $getjpiW, and I always say status = sys$getjpiw(etc, &iosb, etc); if (status == SS$_NORMAL) { status = iosb.status; } Here is the SH PROC from SDA: VAX/VMS V5.4-2 -- System Dump Analysis 24-OCT-1991 09:57:55.60 Process index: 002F Name: WatchDOG Extended PID: 2060142F Process status: 00140001 RES,PHDRES,LOGIN PCB address 80544F20 JIB address 80AF0DE0 PHD address 815CD600 Swapfile disk address 00000000 Master internal PID 0028002F Subprocess count 0 Internal PID 0028002F Creator internal PID 00000000 Extended PID 2060142F Creator extended PID 00000000 State LEF Termination mailbox 0000 Current priority 8 AST's enabled KESU Base priority 3 AST's active NONE UIC [00001,000004] AST's remaining 24 Mutex count 0 Buffered I/O count/limit 18/18 Waiting EF cluster 0 Direct I/O count/limit 18/18 Starting wait time 1C001C1C BUFIO byte count/limit 40320/40800 Event flag wait mask FFFFFFFE # open files allowed left 14 Local EF cluster 0 E0000000 Timer entries allowed left 8 Local EF cluster 1 80000000 Active page table count 0 Global cluster 2 pointer 00000000 Process WS page count 228 Global cluster 3 pointer 00000000 Global WS page count 60 ================================================================================ Note 1354.26 SPIRIT/PROCESS INACTIFS 26 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 6 lines 28-OCT-1991 11:22 -< Too late, the dog died >- -------------------------------------------------------------------------------- Perhaps something is going wrong in cluster-wide system services. While the old DOG has been sitting there (waiting for helpful suggestions from this messagerie), we were running a new copy, which has been disconnecting people after bizarre intervals (2 minutes to 2 hours). We have now killed the old DOG (humanely !) and things seems to be back to normal. ================================================================================ Note 1354.27 SPIRIT/PROCESS INACTIFS 27 of 29 DECUSF::FOUCHET_F "François FOUCHET - CMT" 6 lines 1-NOV-1991 00:13 -< Idee tardive ... >- -------------------------------------------------------------------------------- Desole pour le delai, j'ai pas eu le temps de me connecter cette semaine ... Une idee me vient en regardant le resultat de SDA. Il semble qu'il y ait une IO qui traine (BUFIO byte count/limit), mais on ne la trouve pas dans BIO count/limit. Est il possible que ca vienne d'un broadcast sur un terminal fige (par exemple, un $BRKTHRU sans time out) ? ================================================================================ Note 1354.28 SPIRIT/PROCESS INACTIFS 28 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 2 lines 4-NOV-1991 14:53 -< Thanks for the idea >- -------------------------------------------------------------------------------- You may have something... I don't currently use a rtimeout on $BRKTHRUW. Will try it... ================================================================================ Note 1354.29 SPIRIT/PROCESS INACTIFS 29 of 29 DECUSF::BROWN_N "Nick BROWN, Conseil de l'Europe" 2 lines 4-NOV-1991 18:00 -< Fixed, let's see if that was the problem >- -------------------------------------------------------------------------------- Well, I added a 10 second timeout to my $BRKTHRUW call. The new version is in PUB:[VMS]DOG.BCK* for anyone using DOG.