.PAGE SIZE 62, 60 .RIGHT MARGIN 60 .CENTER ^&DATATRIEVE and RMS\& .BLANK 2 .CENTER Joe H. Gallagher .BLANK .CENTER Research Medical Center .BLANK .CENTER Kansas City, MO .BLANK 2 .CENTER Gary Friedman .BLANK .CENTER Montgomery Engineering .BLANK 2 .CENTER B.#Z.#Lederman .BLANK .CENTER 2572 E.#22nd St. .CENTER Brooklyn, N.Y. 11235-2504 .BLANK 2 .CENTER Transcribed by B.#Z.#Lederman .TITLE DATATRIEVE and RMS .SUBTITLE DT024 .NOTE Abstract .BLANK 2 This is a transcription of a panel presentation on some of the important features of RMS as seen from the perspective of the DATATRIEVE user. It will give some basic definitions, list some of the options available to users, and how some tools may be used to optimize performance. The usual convention of placing square brackets around material interpreted or supplied by the editor is followed in this paper, as is the use of DTR as an abbreviation for DATATRIEVE. .END NOTE .RIGHT MARGIN 55 .COMMENT B. Z. Lederman .BLANK 2.TEST PAGE 5.CENTER What is RMS? .PARAGRAPH RMS stands for Record Management Services: it is a set of system services which provide for a uniform method of accessing data in files. It is built into VMS, and comes with the PDP-11 operating systems, it supports several types of file access (sequential, relative and indexed), several types of data records (fixed and variable length), arbitrates file sharing and block locking (some of these functions are being moved to other parts of the operating system, particularly within VMS clusters), and controls the conversion of data on the disk (tape) to within your program. In short, it's the way of getting stored data into your program. .BLANK 2.TEST PAGE 5.CENTER Types of files. .PARAGRAPH Sequential files are the simplest, they are the smallest (least amount of storage space) for a given amount of data, they are compatible with programs that don't use RMS, can be stored on magnetic tape, are easier to transmit over communications lines, and generally have the fastest access and least overhead when the data is going to be accessed sequentially. The catch is that most applications don't access data sequentially, and even when they do there are some possible drawbacks to sequential files. For example, new records may only be inserted at the end: if the file is sorted in some order and you have to add a new record, you must then re-sort the file. They can also normally only be deleted from the end of the file as well. As processing must be sequential, if you want to retrieve a record in the middle or end of the file the only way to get to it is to start at the beginning and read every record until you get the one you want. Sharing the file is limited to read-only for all accessers (this may change in the latest release of VMS). .PARAGRAPH Indexed files can be read sequentially or by one of the keys. Keys can have duplicate entries or no duplicates, there is automatic sorting in that data is automatically kept in order by the primary key so a sequential read is automatically sorted, and files can be shared for read and write. There is higher overhead in accessing the file (than for sequential), they can only be stored on disks (it will automatically be converted by most backup utilities when stored on tape, but you cannot have indexed access directly to a file on tape), and the file is larger as you are storing both the data and the index information in the file. Records may be added and deleted from any point in the file, but deleting a record does not recover all of the space used until the file is compressed or reorganized. Indexed files must have one primary key, and the data in that field may not be modified: it can only be deleted and the entire record replaced. (Secondary keys may be modified, or modification may be prohibited as you choose when you create the file.)# In nearly all applications, the advantages over sequential files far outweigh the disadvantages, and indexed files will be used in nearly all DTR applications. .BLANK 2.TEST PAGE 5.CENTER Buckets. .PARAGRAPH A bucket is a logical division of the space on the disk. Disks are divided into blocks: in order to use a disk, the space must be divided into manageable chunks, and all DEC disks are divided into blocks of 512 bytes. [A few older devices have smaller "blocks", but the software makes them look as if they had 512 byte blocks.]# The data records you are using may be larger or smaller than 512 bytes, so the file is divided into buckets: each bucket holds one or more of your data records, and buckets are stored on the disk (as one or more blocks). One of the options available is to select the bucket size for a file. On the PDP-11, in order to conserve pool space, it should almost always be the smallest bucket size into which your record will fit, and DATATRIEVE-11 will automatically select this size. On the VAX, the option exists to choose a larger bucket size, which may or may not help performance. A larger bucket size means there are more records stored in one place, and retrieving one bucket from the disk gets you several records from the same area of the file. If you are processing the file sequentially, when you read one record you automatically get the next few records at the same time, and so when you are ready to process the next record you already have it. This will save disk accesses, and generally improve the performance of the program. If, however, your accesses are scattered more or less randomly through the file (for example, you have a telephone directory file and you are not looking up people in alphabetical order so that file retrievals are scattered throughout the whole file in no particular order) then a larger bucket size won't help, and may even hurt a little by having to read extra data from the disk that won't be used. If you have an application where you read a record and then will probably need the next few records for related processing, or you have multiple records with the same key and may need to read some or all of them at the same time, then a larger bucket size may help by obtaining more data with each disk access. .BLANK 2.TEST PAGE 5.CENTER File Prolog Type. .PARAGRAPH When you create a file (or display the attributes of an existing file), one of the attributes is the file Prolog, which can be Prolog 1, 2 or 3. Indexed files can be Prolog 2 or 3. Prolog 3 files have a tradeoff between speed and size: the index can be compressed, which will save space on disk, but will require more CPU work to compress and expand the data when needed. Also, it was mentioned before that deleting a record from an indexed file left a little unusable space in the file: with a Prolog 3 file, this space may be reclaimed more easily than with a Prolog 2 file with the "CONVERT/RECLAIM" command. [See also some discussion at the end of the paper about access speed. .BLANK 2.TEST PAGE 5.CENTER Alternate Keys. .PARAGRAPH For each key in an indexed file, some work must be done to store the index information whenever a record is added to the file. The graph in Figure#1 shows the result of a test comparing the time needed to store records in a file with DTR when there was one key, and two keys, and shows the increased overhead. However, Figure#2 shows how much work can be saved when retrieving with a key as opposed to retrieving without a key: in the case shown here, retrieving the second file in a CROSS (or a VIEW) without a key requires several orders of magnitude more work than if a key is used (this is cause of one of the most common complaints heard from DTR users, that a CROSS or a VIEW is slow: the second file was not being retrieved with a key). The result of this is: if you will be retrieving data fairly often by a particular field, it should be keyed as the time saved in retrieval is much greater than the time taken during storage. If, however, a field won't be used for retrieval it should not be a key, as the extra work for storing all records won't be recovered. .BLANK 2.TEST PAGE 5.CENTER Creating a file. .PARAGRAPH You can create a file with the "DEFINE#FILE" command in DATATRIEVE. This will give you a file which will always work, and will have all of the keys in the right place with the correct data type. It may not give you a file which is optimum for your particular application and data, however. You can also use one of the RMS utilities (DFN on the PDP-11, CREATE or EDIT/FDL on the VAX): this can be quite a lot of work as you have to figure out where (in bytes) in the file each key is, and what data type it is. I recommend that you first create the file with DTR, then use the RMS utilities to examine, and if necessary modify, the file to your particular needs. .COMMENT (Joe Gallagher). .BLANK 2.TEST PAGE 5.CENTER Loading a file. .PARAGRAPH Loading a file all at once, with one of the RMS utilities, is a different operation than storing single records (as with DTR). The utilities (IFL on the 11s, CONVERT on the VAX) do an optimized file load: they sort records, pre-allocate disk space (for data and index), and store the information. If you store individual records, it will be inserted in the middle of the file if it can: if the file was loaded with a fill factor of less than 100_%, there will be empty space in the middle of the file to receive extra records. The resulting file will still be in good order, and performance will be good. If the file is already full, then RMS does what is called a bucket split: a pointer must be put into the file to point to another area of the file, which will then contain the data. Subsequent read operations are slowed a little as you have to "jump around" in the file to follow the pointers to an alternate area and back again. Generally, you want to avoid this. Similarly, there is the space that the file has requested on the disk: if there is extra empty space in the file, new records will be added into this space and all of the data will be close together. If the file has run out of space, then the operating system will try to find more on the disk (subject to user quotas, etc.). If space is available next to the existing file, it will be added, but it is possible that the next free space on disk may be physically distant. This is known as a "fragmented" file, as it is stored in several separate pieces on the disk. This results in a performance degradation, and should be avoided. When a record is deleted, there is a small amount of space which cannot be immediately recovered (until the file is reorganized). .PARAGRAPH DATATRIEVE is not the best tool for doing a complete file reorganization: the RMS utilities which were written specifically for this purpose will yield better results. If a file is being added to or modified frequently, then it is a good idea to use one of the utilities to re-organize the file at regular intervals. .COMMENT B. Z. Lederman .BLANK 2.TEST PAGE 5.CENTER Some Tools. .PARAGRAPH There are a number of utilities that give useful information about your files. On the VAX, ANALYZE/RMS/FDL will yield a file full of information about the analyzed file, and you can look at this with ordinary editors or with EDIT/FDL. Much of the information will not be of interest to casual DATATRIEVE users, but a few items are important. .PARAGRAPH ALLOCATION is how many blocks on the disk are reserved for this file. .PARAGRAPH BEST#TRY#CONTIGUOUS means that if the file runs out of disk space and the operating system must get more, it will first try to get contiguous blocks (those immediately next to the existing file), but if it can't it will get what space it can. If the file was marked as CONTIGUOUS only, when it runs out of space if there is no more contiguous space available the attempt to expand the file will fail with an error message. .PARAGRAPH CLUSTER#SIZE is generally set by the system manager for a given disk. .PARAGRAPH EXTENSION is how many blocks of disk space are added to a file by default when it has to be extended. If you know that you will be adding many records to your file, especially if they will be added at one time, then you should specify a larger extension value to get a large piece of disk space each time the file is extended: this will minimize fragmentation, and will make the application run faster as adding space to a file is a relatively slow process. .PARAGRAPH GLOBAL BUFFER has to do with sharing files, and will not be set by most users: it needs to be considered on a system wide basis. .PARAGRAPH BLOCK SPANNING: if a record spans a disk block, then you have to read both disk blocks to retrieve the record: this may be a little slower than if the record were in one block only. If your records are smaller than one block (512 bytes) in size, and you want every fraction of performance, and are willing to leave a little space at the end of each block empty (wasting a little disk space), then NO#SPAN#BLOCKS may give you a little extra performance, though I suspect that in most cases the improvement will be minimal. .PARAGRAPH Allocations for areas have to do with how much space is reserved for data and keys. Rather than attempt to calculate these, let the RMS utilities set them, or use one of the optimization scripts in EDIT/FDL to set them. .PARAGRAPH Each key has it's own section for description. One of the fields in this is the name of the key. DTR does not put this information into the file: I highly recommend for documentation and maintenance purposes that you obtain an FDL description for all of your DTR data files, and that you fill in the name of the field into the FDL key name so that you will know which keys correspond to which field in your DTR application. .PARAGRAPH You can use this file to select some options that are also available in DTR, such as allowing or preventing duplicates and changes. You can also enable or disable data compression and key compression. As noted before, this may save disk space as the cost of performance. It is difficult to predict how much compression will be done on a file, so you will probably want to load the data, then ANALYZE the file and see what happened. There may be a performance trade off between compressing to save space, and the work needed to compress and decompress the data and keys. [See also the discussion transcribed below.] .PARAGRAPH FILL can be a very important parameter. If you are working with a file whose data is fairly static (does not change often), you will probably want a fill factor of 100_%, or very close, to save disk space and keep the data close together on disk for fast access. If you have a file which changes often, or to which data will be added, then you want to avoid the bucket splitting problem by using a lower fill factor during the initial load so there will be empty space scattered throughout the file to receive new records. The RMS utilities honor the fill factors to leave empty space in the file: DTR does not, so it will use the extra space for new records. .PARAGRAPH If the file has data in it, ANALYZE/RMS/FDL will also give you information about how much reclaimable space is in the file which is not being used, and how much compression is being done on the keys and data. This is a good indication of the state the file is in, and can also be used by EDIT/FDL optimization scripts to design a better file. .PARAGRAPH Built into the FDL editor are some scripts which work quite well in designing better files, especially if you analyze an existing file filled with data. Generally, a "flatter" file is one with better performance: the fewer the number of index levels, the fewer disk accesses will be needed to find a particular record, and this results in better performance. DIRECTORY/FULL on the VAX (DIR/ATT or DSP on the 11) gives information about the number of keys, Prolog type, number of blocks allocated, etc. .COMMENT Gary Freidman .BLANK 2.TEST PAGE 5.CENTER Example of reorganization. .PARAGRAPH We found that when a number of records had to be added to a large indexed file with many keys, it was better to add the new records to a separate sequential file, convert the large indexed file to a sequential file, append the file with the new records, then use the combined data to re-populate an indexed file. This process can be done in a batch mode, to save time and I/O processing. It avoids the problems of adding records to indexed files (such as bucket splitting), especially when the updating is done in "chunks". .PARAGRAPH We did some benchmarks with a file containing fixed length 110 byte long records, with 5 keys, containing a total of 1778 records. DATATRIEVE was used to write the new data to the end of the sequential file that contains the updates, and again to write the sequential records back in an indexed file. This was found to take much less time than to put records into an indexed file. .PARAGRAPH We also tested the use of the CONVERT utility to do the file conversions. This was found to take much less time and system resources: the improvement is much greater than that which can be obtained from typical "system tuning" efforts, which usually try to adjust system parameters for a 5_% or so improvement. Adjusting the data file parameters can yield a much greater improvement. .PARAGRAPH Tests were performed on a stand-alone 11/780 (no other users), comparing DATATRIEVE with CONVERT to do the batch update. CPU usage was reduced by about 3 orders of magnitude, I/O operations were reduced from about 35,000 for DTR to 1700 for CONVERT, Elapsed time was reduced from 14 minutes to 2 minutes. The reduction in I/O operations is especially significant for many applications. .PARAGRAPH These tests were done using the same DATATRIEVE file definition. By using one of the FDL optimize scripts, the file can be further tuned to the particular application. You should load the data into a file first for best performance, and look at values for compression, etc. If compression shows up as a negative value, turn compression off. The scripts are nearly automatic in operation: simply select the optimize script. (Also look at the ^&VMS Guide to File Applications\& manual: it may take a while to absorb all of the contents, but it's worth the effort.) We did a one pass optimization of our test file: not a lot of tuning, just some changes to bucket size, key compression, etc. When we ran the load test again, I/O operations were cut in half, Elapsed time was cut in half, and CPU time improved a bit. It took no more than 15 minutes to do the optimization with FDL. .BLANK 2.TEST PAGE 5.CENTER General Hints. .PARAGRAPH Big updates will cost you, as you have to work with all of the indices: use CONVERT rather than DTR. Tune your files: a little effort here yields considerable benefits. Update in batch: you can offload operations to times when the system is less used. .BLANK 2.TEST PAGE 5.CENTER Some final examples. .COMMENT (Bart.) .PARAGRAPH I compared the time it takes to populate the YACHTS file, as is done during installation of DTR. These tests were done with the PDP-11, copying the installation verification procedure. An empty indexed file is created using DEFINE#FILE (and I even improved the procedure by adding the ALLOCATION clause), and DTR then reads the sequential file from the distribution kit and stores the indexed file using a FOR loop. This took 2 minutes and 46 seconds. I then create an identical empty file using DTR again (no optimization work done), and use IFL (the PDP-11's equivalent of CONVERT) to populate the indexed file and it takes 17 seconds. The reason for the difference is that DTR is a general query and report language whereas IFL is written for the sole purpose of populating indexed files. .PARAGRAPH Disk fragmentation: this is not directly an RMS problem, but it is something to know about. A fragmented disk causes performance degradation, and usually implies fragmented files as well. As you create and delete files, and extend files, your free space tends to be scattered about the disk in small chunks. All of the normal backup utilities (BACKUP, BRU, DSC) do disk compression as they copy disks. One way to find out if your disk is fragmented is a utility called FRAG, available through the DECUS library or SIG tapes for VMS and RSX. Because fragmentation happens gradually while the system is used, it can be a rather subtle performance degradation that may not be immediately obvious. Fragmentation is an important reason for making backup copies of your disks fairly often. If you have fixed media disks (Winchesters) and are backing up your disks to tape, you are not compressing the disks, which are probably getting more and more fragmented. You should either get extra disks (if you can) and copy disk to disk, or else you have to copy from disk to tape, then back from tape to disk to get the benefits of compression. If you have removable disks, you should copy from disk to disk, put the old disk on the shelf as a backup and run with the new disk. This not only gives you the advantages of disk compaction, it also tells you very quickly if the copy procedure worked, and if it didn't your original disk is safely on the shelf. (There are many people who backup to tape only, or backup to disk and put the new disk on the shelf, and it's only when a disaster occurs that they find out that their backup copies aren't usable.)# .BLANK 2.TEST PAGE 5.CENTER Questions and Answers. .BLANK Steve Hicks, Rockwell International. When you create a file with DTR, is it CONTIGUOUS or CONTIGUOUS__BEST__TRY? .BLANK Answer: sometimes. .BLANK Will using FDL to set it CONTIGUOUS gain anything? .BLANK Answer: perhaps, but certainly you should get the FDL file and check. .BLANK Can you change a Prolog-2 to a Prolog-3 with a set command in FDL? .BLANK Answer: you can specify it with the FDL editor. .BLANK I have a file which which has fields which are not keys. Will they be compressed? .BLANK Answer: you can specify data compression independently of key data compression for this. (Joe:) Note that compression works on adjacent bytes with the same value. If your field is filled with different characters, you don't get compression. If your field is all blanks or zeroes, or some characters followed by trailing blanks, for example, then you do get compression. You need strings longer than 4 or 5 bytes to get an advantage from compression. (Gary:) There is a system parameter to set Prolog-2 or Prolog-3 as the system wide default, but there is no system wide parameter to set data compression on or off as a default. (Unidentified comment:) There may be a speedup in looking up keys when compression is on. [Much of the later comments were off the microphone: apparently the user saw a considerable improvement.] (Gary:) We just ran into problems where we tried data key compression, and found the system was deleting records. There were about 60,000 records in the file, and CPU usage greatly increased. (Joe:) It depends on the data: you cannot make a general rule of thumb about files going faster or slower with key compression. The other issue has to do with the nature of the key. If adjacent keys have a lot of commonalty (for example, the keys are values such as 12345, 12346, 12347, 12348), then you get a lot of compression: if the keys are vary different, you get much less compression. A file which has a lot of compression may well see a performance improvement. [Audience comment not audible: apparently the key in question was a ZIP code.] A ZIP code would an ideal case where you have a lot of duplicate or similar keys, resulting in considerable compression. Other types of data might not work as well. .PARAGRAPH Warren Alcar, Abalene Christian University. Once you have your FDL file, do you ever need to edit it again (once it's optimized). .BLANK Answer: once you have the file in a state where you are satisfied with the performance, quit. Sometimes you can keep on tuning, other times FDL my indicate that this is the best you can get. .BLANK I may convert the file, then a week or two later may need to convert it again. Do I use the same FDL file? .BLANK Answer: if the nature of the data you have has changed significantly, then maybe you want to optimize again. If the nature of the data hasn't changed, and you are just adding records, then you can use the same FDL file. .PARAGRAPH Rick Trane, (?) Engineers, Milwaukee. You stated that if you delete a record, it is physically still in the bucket, and is not removed until you do a reclaim on the file, is that correct? .BLANK Answer: some of the space does get reused. What happens is that there is a pointer in the bucket saying that there used to be a record here. If there is still enough free space in that bucket to hold a new record and a new pointer, then the space may be reused, but there is still an extra pointer from the original record. A lot depends on the size of the record: if the record is, for example, 480 bytes long and you have a bucket size of 1 which is 512 bytes, there is only room for one record and pointer. If you delete the record, there is not enough room for a record and two pointers, so the space does not get reused at all until the file is compressed. If, however, the records are only 20 bytes long, there is room for several records and pointers in a single bucket, and some space may be reused. (Joe:) If the file has a single primary key, under version 4 of VMS it can be deleted and reclaimed immediately: if there are secondary keys, it can't reclaim space, and you have to use CONVERT. .PARAGRAPH (Unidentified) Our problem is with a file that has alternate keys which allow duplicates. Any hints for access time? .BLANK Answer: you must understand that if a key is not unique and there are many duplicates (for example, if sex is a key in a personnel file and about 51_% have the same key), you are essentially using half of a sequential file, and there isn't much that can be done for performance. The best performance occurs when keys are unique or nearly unique. [Remainder lost from end of tape.] .BLANK 2.TEST PAGE 8 .CENTER Procedure used to compare loading a file .CENTER with DATATRIEVE and IFL. .BLANK This was placed in an indirect command procedure so no time would be lost typing in the commands. It was run on a PDP-11/70 using RSX and no other users. .BLANK.NO JUSTIFY.NO FILL >PIP YACHT.DAT;*/DE >DTR @DY DEFINE FILE YACHTS KEY=TYPE(NO DUP), KEY=MODEL(DUP, NO CHANGE), ALLOCATION=30, SUPERSEDE >TIM 09:02:04 02-DEC-82 >DTR @T1 READY YACHTS WRITE READY YACHTS-SEQUENTIAL FOR YACHTS-SEQUENTAIL STORE YACHTS USING BOAT=BOAT >TIM 09:05:04 02-DEC-82 [Elapsed time 2:46] >PIP YACHT.DAT;*/DE >DTR @DY DEFINE FILE YACHTS KEY=TYPE(NO DUP), KEY=MODEL(DUP, NO CHANGE), ALLOCATION=30, SUPERSEDE >TIM 09:05:04 02-DEC-82 IFL YACHT.DAT=YACHT.SEQ PRIMARY KEY: SORT HAS STARTED SORT MERGE PHASE HAS FINISHED ALTERNATE KEY(S) SORT MERGE PHASE HAS FINISHED ALTERNATE KEY(S) 1: NUMBER OF INPUT RECORDS: 113 NUMBER OF OUTPUT RECORDS: 113 NUMBER OF EXCEPTION RECORDS: 0 >TIM 09:05:21 02-DEC-82 [Elapsed time 0:17] >@ .BLANK 5.TEST PAGE.TEST PAGE 27 .CENTER Example of output from FRAG utility .BLANK Disk fragmentation statistics for DW1: .BLANK 28-APR-86 09:18:40 .BLANK Contiguous free blocks (Holes) .BLANK Hole Range Frequency Number of blocks 1 - 8 23 76 9 - 16 6 79 17 - 32 0 0 33 - 64 0 0 65 - 125 3 225 126 - 250 1 212 251 - 500 1 304 501 - 1000 0 0 1001 - 2000 0 0 2001 - 4000 1 3311 4001 - 8000 0 0 8001 - 16000 0 0 16001 and up 0 0 .BLANK Largest free block 3311 Total free blocks 4207