.PAGE SIZE 62, 60
.RIGHT MARGIN 60
.CENTER
^&DATATRIEVE and RMS\&
.BLANK 2
.CENTER
Joe H. Gallagher
.BLANK
.CENTER
Research Medical Center
.BLANK
.CENTER
Kansas City, MO
.BLANK 2
.CENTER
Gary Friedman
.BLANK
.CENTER
Montgomery Engineering
.BLANK 2
.CENTER
B.#Z.#Lederman
.BLANK
.CENTER
2572 E.#22nd St.
.CENTER
Brooklyn, N.Y. 11235-2504
.BLANK 2
.CENTER
Transcribed by B.#Z.#Lederman
.TITLE DATATRIEVE and RMS
.SUBTITLE DT024
.NOTE Abstract
.BLANK 2
This is a transcription of a panel presentation on some of
the important features of RMS as seen from the perspective
of the DATATRIEVE user. It will give some basic definitions,
list some of the options available to users, and how some
tools may be used to optimize performance. The usual
convention of placing square brackets around material
interpreted or supplied by the editor is followed in this
paper, as is the use of DTR as an abbreviation for
DATATRIEVE.
.END NOTE
.RIGHT MARGIN 55
.COMMENT B. Z. Lederman
.BLANK 2.TEST PAGE 5.CENTER
What is RMS?
.PARAGRAPH
RMS stands for Record Management Services: it is a set
of system services which provide for a uniform method
of accessing data in files. It is built into VMS, and
comes with the PDP-11 operating systems, it supports
several types of file access (sequential, relative and
indexed), several types of data records (fixed and
variable length), arbitrates file sharing and block
locking (some of these functions are being moved to
other parts of the operating system, particularly
within VMS clusters), and controls the conversion of
data on the disk (tape) to within your program. In
short, it's the way of getting stored data into your
program.
.BLANK 2.TEST PAGE 5.CENTER
Types of files.
.PARAGRAPH
Sequential files are the simplest, they are the
smallest (least amount of storage space) for a given
amount of data, they are compatible with programs that
don't use RMS, can be stored on magnetic tape, are
easier to transmit over communications lines, and
generally have the fastest access and least overhead
when the data is going to be accessed sequentially. The
catch is that most applications don't access data
sequentially, and even when they do there are some
possible drawbacks to sequential files. For example,
new records may only be inserted at the end: if the
file is sorted in some order and you have to add a new
record, you must then re-sort the file. They can also
normally only be deleted from the end of the file as
well. As processing must be sequential, if you want to
retrieve a record in the middle or end of the file the
only way to get to it is to start at the beginning and
read every record until you get the one you want.
Sharing the file is limited to read-only for all
accessers (this may change in the latest release of
VMS).
.PARAGRAPH
Indexed files can be read sequentially or by one of the
keys. Keys can have duplicate entries or no duplicates,
there is automatic sorting in that data is
automatically kept in order by the primary key so a
sequential read is automatically sorted, and files can
be shared for read and write. There is higher overhead
in accessing the file (than for sequential), they can
only be stored on disks (it will automatically be
converted by most backup utilities when stored on tape,
but you cannot have indexed access directly to a file
on tape), and the file is larger as you are storing
both the data and the index information in the file.
Records may be added and deleted from any point in the
file, but deleting a record does not recover all of the
space used until the file is compressed or reorganized.
Indexed files must have one primary key, and the data
in that field may not be modified: it can only be
deleted and the entire record replaced. (Secondary keys
may be modified, or modification may be prohibited as
you choose when you create the file.)# In nearly all
applications, the advantages over sequential files far
outweigh the disadvantages, and indexed files will be
used in nearly all DTR applications.
.BLANK 2.TEST PAGE 5.CENTER
Buckets.
.PARAGRAPH
A bucket is a logical division of the space on the
disk. Disks are divided into blocks: in order to use a
disk, the space must be divided into manageable chunks,
and all DEC disks are divided into blocks of 512 bytes.
[A few older devices have smaller "blocks", but the
software makes them look as if they had 512 byte
blocks.]# The data records you are using may be larger
or smaller than 512 bytes, so the file is divided into
buckets: each bucket holds one or more of your data
records, and buckets are stored on the disk (as one or
more blocks). One of the options available is to select
the bucket size for a file. On the PDP-11, in order to
conserve pool space, it should almost always be the
smallest bucket size into which your record will fit,
and DATATRIEVE-11 will automatically select this size.
On the VAX, the option exists to choose a larger bucket
size, which may or may not help performance. A larger
bucket size means there are more records stored in one
place, and retrieving one bucket from the disk gets you
several records from the same area of the file. If you
are processing the file sequentially, when you read one
record you automatically get the next few records at
the same time, and so when you are ready to process the
next record you already have it. This will save disk
accesses, and generally improve the performance of the
program. If, however, your accesses are scattered more
or less randomly through the file (for example, you
have a telephone directory file and you are not looking
up people in alphabetical order so that file retrievals
are scattered throughout the whole file in no
particular order) then a larger bucket size won't help,
and may even hurt a little by having to read extra data
from the disk that won't be used. If you have an
application where you read a record and then will
probably need the next few records for related
processing, or you have multiple records with the same
key and may need to read some or all of them at the
same time, then a larger bucket size may help by
obtaining more data with each disk access.
.BLANK 2.TEST PAGE 5.CENTER
File Prolog Type.
.PARAGRAPH
When you create a file (or display the attributes of an
existing file), one of the attributes is the file
Prolog, which can be Prolog 1, 2 or 3. Indexed files
can be Prolog 2 or 3. Prolog 3 files have a tradeoff
between speed and size: the index can be compressed,
which will save space on disk, but will require more
CPU work to compress and expand the data when needed.
Also, it was mentioned before that deleting a record
from an indexed file left a little unusable space in
the file: with a Prolog 3 file, this space may be
reclaimed more easily than with a Prolog 2 file with
the "CONVERT/RECLAIM" command. [See also some
discussion at the end of the paper about access speed.
.BLANK 2.TEST PAGE 5.CENTER
Alternate Keys.
.PARAGRAPH
For each key in an indexed file, some work must be done
to store the index information whenever a record is
added to the file. The graph in Figure#1 shows the
result of a test comparing the time needed to store
records in a file with DTR when there was one key, and
two keys, and shows the increased overhead. However,
Figure#2 shows how much work can be saved when
retrieving with a key as opposed to retrieving without
a key: in the case shown here, retrieving the second
file in a CROSS (or a VIEW) without a key requires
several orders of magnitude more work than if a key is
used (this is cause of one of the most common
complaints heard from DTR users, that a CROSS or a VIEW
is slow: the second file was not being retrieved with a
key). The result of this is: if you will be retrieving
data fairly often by a particular field, it should be
keyed as the time saved in retrieval is much greater
than the time taken during storage. If, however, a
field won't be used for retrieval it should not be a
key, as the extra work for storing all records won't be
recovered.
.BLANK 2.TEST PAGE 5.CENTER
Creating a file.
.PARAGRAPH
You can create a file with the "DEFINE#FILE" command in
DATATRIEVE. This will give you a file which will always
work, and will have all of the keys in the right place
with the correct data type. It may not give you a file
which is optimum for your particular application and
data, however. You can also use one of the RMS
utilities (DFN on the PDP-11, CREATE or EDIT/FDL on the
VAX): this can be quite a lot of work as you have to
figure out where (in bytes) in the file each key is,
and what data type it is. I recommend that you first
create the file with DTR, then use the RMS utilities to
examine, and if necessary modify, the file to your
particular needs.
.COMMENT (Joe Gallagher).
.BLANK 2.TEST PAGE 5.CENTER
Loading a file.
.PARAGRAPH
Loading a file all at once, with one of the RMS
utilities, is a different operation than storing single
records (as with DTR). The utilities (IFL on the 11s,
CONVERT on the VAX) do an optimized file load: they
sort records, pre-allocate disk space (for data and
index), and store the information. If you store
individual records, it will be inserted in the middle
of the file if it can: if the file was loaded with a
fill factor of less than 100_%, there will be empty
space in the middle of the file to receive extra
records. The resulting file will still be in good
order, and performance will be good. If the file is
already full, then RMS does what is called a bucket
split: a pointer must be put into the file to point to
another area of the file, which will then contain the
data. Subsequent read operations are slowed a little as
you have to "jump around" in the file to follow the
pointers to an alternate area and back again.
Generally, you want to avoid this. Similarly, there is
the space that the file has requested on the disk: if
there is extra empty space in the file, new records
will be added into this space and all of the data will
be close together. If the file has run out of space,
then the operating system will try to find more on the
disk (subject to user quotas, etc.). If space is
available next to the existing file, it will be added,
but it is possible that the next free space on disk may
be physically distant. This is known as a "fragmented"
file, as it is stored in several separate pieces on the
disk. This results in a performance degradation, and
should be avoided. When a record is deleted, there is a
small amount of space which cannot be immediately
recovered (until the file is reorganized).
.PARAGRAPH
DATATRIEVE is not the best tool for doing a complete
file reorganization: the RMS utilities which were
written specifically for this purpose will yield better
results. If a file is being added to or modified
frequently, then it is a good idea to use one of the
utilities to re-organize the file at regular intervals.
.COMMENT B. Z. Lederman
.BLANK 2.TEST PAGE 5.CENTER
Some Tools.
.PARAGRAPH
There are a number of utilities that give useful
information about your files. On the VAX,
ANALYZE/RMS/FDL will yield a file full of information
about the analyzed file, and you can look at this with
ordinary editors or with EDIT/FDL. Much of the
information will not be of interest to casual
DATATRIEVE users, but a few items are important.
.PARAGRAPH
ALLOCATION is how many blocks on the disk are reserved
for this file.
.PARAGRAPH
BEST#TRY#CONTIGUOUS means that if the file runs out of
disk space and the operating system must get more, it
will first try to get contiguous blocks (those
immediately next to the existing file), but if it can't
it will get what space it can. If the file was marked
as CONTIGUOUS only, when it runs out of space if there
is no more contiguous space available the attempt to
expand the file will fail with an error message.
.PARAGRAPH
CLUSTER#SIZE is generally set by the system manager for
a given disk.
.PARAGRAPH
EXTENSION is how many blocks of disk space are added to
a file by default when it has to be extended. If you
know that you will be adding many records to your file,
especially if they will be added at one time, then you
should specify a larger extension value to get a large
piece of disk space each time the file is extended:
this will minimize fragmentation, and will make the
application run faster as adding space to a file is a
relatively slow process.
.PARAGRAPH
GLOBAL BUFFER has to do with sharing files, and will
not be set by most users: it needs to be considered on
a system wide basis.
.PARAGRAPH
BLOCK SPANNING: if a record spans a disk block, then
you have to read both disk blocks to retrieve the
record: this may be a little slower than if the record
were in one block only. If your records are smaller
than one block (512 bytes) in size, and you want every
fraction of performance, and are willing to leave a
little space at the end of each block empty (wasting a
little disk space), then NO#SPAN#BLOCKS may give you a
little extra performance, though I suspect that in most
cases the improvement will be minimal.
.PARAGRAPH
Allocations for areas have to do with how much space is
reserved for data and keys. Rather than attempt to
calculate these, let the RMS utilities set them, or use
one of the optimization scripts in EDIT/FDL to set
them.
.PARAGRAPH
Each key has it's own section for description. One of
the fields in this is the name of the key. DTR does not
put this information into the file: I highly recommend
for documentation and maintenance purposes that you
obtain an FDL description for all of your DTR data
files, and that you fill in the name of the field into
the FDL key name so that you will know which keys
correspond to which field in your DTR application.
.PARAGRAPH
You can use this file to select some options that are
also available in DTR, such as allowing or preventing
duplicates and changes. You can also enable or disable
data compression and key compression. As noted before,
this may save disk space as the cost of performance. It
is difficult to predict how much compression will be
done on a file, so you will probably want to load the
data, then ANALYZE the file and see what happened.
There may be a performance trade off between
compressing to save space, and the work needed to
compress and decompress the data and keys. [See also
the discussion transcribed below.]
.PARAGRAPH
FILL can be a very important parameter. If you are
working with a file whose data is fairly static (does
not change often), you will probably want a fill factor
of 100_%, or very close, to save disk space and keep
the data close together on disk for fast access. If you
have a file which changes often, or to which data will
be added, then you want to avoid the bucket splitting
problem by using a lower fill factor during the initial
load so there will be empty space scattered throughout
the file to receive new records. The RMS utilities
honor the fill factors to leave empty space in the
file: DTR does not, so it will use the extra space for
new records.
.PARAGRAPH
If the file has data in it, ANALYZE/RMS/FDL will also
give you information about how much reclaimable space
is in the file which is not being used, and how much
compression is being done on the keys and data. This is
a good indication of the state the file is in, and can
also be used by EDIT/FDL optimization scripts to design
a better file.
.PARAGRAPH
Built into the FDL editor are some scripts which work
quite well in designing better files, especially if you
analyze an existing file filled with data. Generally, a
"flatter" file is one with better performance: the
fewer the number of index levels, the fewer disk
accesses will be needed to find a particular record,
and this results in better performance. DIRECTORY/FULL
on the VAX (DIR/ATT or DSP on the 11) gives information
about the number of keys, Prolog type, number of blocks
allocated, etc.
.COMMENT Gary Freidman
.BLANK 2.TEST PAGE 5.CENTER
Example of reorganization.
.PARAGRAPH
We found that when a number of records had to be added
to a large indexed file with many keys, it was better
to add the new records to a separate sequential file,
convert the large indexed file to a sequential file,
append the file with the new records, then use the
combined data to re-populate an indexed file. This
process can be done in a batch mode, to save time and
I/O processing. It avoids the problems of adding
records to indexed files (such as bucket splitting),
especially when the updating is done in "chunks".
.PARAGRAPH
We did some benchmarks with a file containing fixed
length 110 byte long records, with 5 keys, containing a
total of 1778 records. DATATRIEVE was used to write the
new data to the end of the sequential file that
contains the updates, and again to write the sequential
records back in an indexed file. This was found to take
much less time than to put records into an indexed
file.
.PARAGRAPH
We also tested the use of the CONVERT utility to do the
file conversions. This was found to take much less time
and system resources: the improvement is much greater
than that which can be obtained from typical "system
tuning" efforts, which usually try to adjust system
parameters for a 5_% or so improvement. Adjusting the
data file parameters can yield a much greater
improvement.
.PARAGRAPH
Tests were performed on a stand-alone 11/780 (no other
users), comparing DATATRIEVE with CONVERT to do the
batch update. CPU usage was reduced by about 3 orders
of magnitude, I/O operations were reduced from about
35,000 for DTR to 1700 for CONVERT, Elapsed time was
reduced from 14 minutes to 2 minutes. The reduction in
I/O operations is especially significant for many
applications.
.PARAGRAPH
These tests were done using the same DATATRIEVE file
definition. By using one of the FDL optimize scripts,
the file can be further tuned to the particular
application. You should load the data into a file first
for best performance, and look at values for
compression, etc. If compression shows up as a negative
value, turn compression off. The scripts are nearly
automatic in operation: simply select the optimize
script. (Also look at the ^&VMS Guide to File
Applications\& manual: it may take a while to absorb
all of the contents, but it's worth the effort.) We did
a one pass optimization of our test file: not a lot of
tuning, just some changes to bucket size, key
compression, etc. When we ran the load test again, I/O
operations were cut in half, Elapsed time was cut in
half, and CPU time improved a bit. It took no more than
15 minutes to do the optimization with FDL.
.BLANK 2.TEST PAGE 5.CENTER
General Hints.
.PARAGRAPH
Big updates will cost you, as you have to work with all
of the indices: use CONVERT rather than DTR. Tune your
files: a little effort here yields considerable
benefits. Update in batch: you can offload operations
to times when the system is less used.
.BLANK 2.TEST PAGE 5.CENTER
Some final examples.
.COMMENT (Bart.)
.PARAGRAPH
I compared the time it takes to populate the YACHTS
file, as is done during installation of DTR. These
tests were done with the PDP-11, copying the
installation verification procedure. An empty indexed
file is created using DEFINE#FILE (and I even improved
the procedure by adding the ALLOCATION clause), and DTR
then reads the sequential file from the distribution
kit and stores the indexed file using a FOR loop. This
took 2 minutes and 46 seconds. I then create an
identical empty file using DTR again (no optimization
work done), and use IFL (the PDP-11's equivalent of
CONVERT) to populate the indexed file and it takes 17
seconds. The reason for the difference is that DTR is a
general query and report language whereas IFL is
written for the sole purpose of populating indexed
files.
.PARAGRAPH
Disk fragmentation: this is not directly an RMS
problem, but it is something to know about. A
fragmented disk causes performance degradation, and
usually implies fragmented files as well. As you create
and delete files, and extend files, your free space
tends to be scattered about the disk in small chunks.
All of the normal backup utilities (BACKUP, BRU, DSC)
do disk compression as they copy disks. One way to find
out if your disk is fragmented is a utility called
FRAG, available through the DECUS library or SIG tapes
for VMS and RSX. Because fragmentation happens
gradually while the system is used, it can be a rather
subtle performance degradation that may not be
immediately obvious. Fragmentation is an important
reason for making backup copies of your disks fairly
often. If you have fixed media disks (Winchesters) and
are backing up your disks to tape, you are not
compressing the disks, which are probably getting more
and more fragmented. You should either get extra disks
(if you can) and copy disk to disk, or else you have to
copy from disk to tape, then back from tape to disk to
get the benefits of compression. If you have removable
disks, you should copy from disk to disk, put the old
disk on the shelf as a backup and run with the new
disk. This not only gives you the advantages of disk
compaction, it also tells you very quickly if the copy
procedure worked, and if it didn't your original disk
is safely on the shelf. (There are many people who
backup to tape only, or backup to disk and put the new
disk on the shelf, and it's only when a disaster occurs
that they find out that their backup copies aren't
usable.)#
.BLANK 2.TEST PAGE 5.CENTER
Questions and Answers.
.BLANK
Steve Hicks, Rockwell International. When you create a
file with DTR, is it CONTIGUOUS or
CONTIGUOUS__BEST__TRY?
.BLANK
Answer: sometimes.
.BLANK
Will using FDL to set it CONTIGUOUS gain anything?
.BLANK
Answer: perhaps, but certainly you should get the FDL
file and check.
.BLANK
Can you change a Prolog-2 to a Prolog-3 with a set
command in FDL?
.BLANK
Answer: you can specify it with the FDL editor.
.BLANK
I have a file which which has fields which are not
keys. Will they be compressed?
.BLANK
Answer: you can specify data compression independently
of key data compression for this. (Joe:) Note that
compression works on adjacent bytes with the same
value. If your field is filled with different
characters, you don't get compression. If your field is
all blanks or zeroes, or some characters followed by
trailing blanks, for example, then you do get
compression. You need strings longer than 4 or 5 bytes
to get an advantage from compression. (Gary:) There is
a system parameter to set Prolog-2 or Prolog-3 as the
system wide default, but there is no system wide
parameter to set data compression on or off as a
default. (Unidentified comment:) There may be a speedup
in looking up keys when compression is on. [Much of the
later comments were off the microphone: apparently the
user saw a considerable improvement.] (Gary:) We just
ran into problems where we tried data key compression,
and found the system was deleting records. There were
about 60,000 records in the file, and CPU usage greatly
increased. (Joe:) It depends on the data: you cannot
make a general rule of thumb about files going faster
or slower with key compression. The other issue has to
do with the nature of the key. If adjacent keys have a
lot of commonalty (for example, the keys are values
such as 12345, 12346, 12347, 12348), then you get a lot
of compression: if the keys are vary different, you get
much less compression. A file which has a lot of
compression may well see a performance improvement.
[Audience comment not audible: apparently the key in
question was a ZIP code.] A ZIP code would an ideal
case where you have a lot of duplicate or similar keys,
resulting in considerable compression. Other types of
data might not work as well.
.PARAGRAPH
Warren Alcar, Abalene Christian University. Once you
have your FDL file, do you ever need to edit it again
(once it's optimized).
.BLANK
Answer: once you have the file in a state where you are
satisfied with the performance, quit. Sometimes you can
keep on tuning, other times FDL my indicate that this
is the best you can get.
.BLANK
I may convert the file, then a week or two later may
need to convert it again. Do I use the same FDL file?
.BLANK
Answer: if the nature of the data you have has changed
significantly, then maybe you want to optimize again.
If the nature of the data hasn't changed, and you are
just adding records, then you can use the same FDL
file.
.PARAGRAPH
Rick Trane, (?) Engineers, Milwaukee. You stated that
if you delete a record, it is physically still in the
bucket, and is not removed until you do a reclaim on
the file, is that correct?
.BLANK
Answer: some of the space does get reused. What happens
is that there is a pointer in the bucket saying that
there used to be a record here. If there is still
enough free space in that bucket to hold a new record
and a new pointer, then the space may be reused, but
there is still an extra pointer from the original
record. A lot depends on the size of the record: if the
record is, for example, 480 bytes long and you have a
bucket size of 1 which is 512 bytes, there is only room
for one record and pointer. If you delete the record,
there is not enough room for a record and two pointers,
so the space does not get reused at all until the file
is compressed. If, however, the records are only 20
bytes long, there is room for several records and
pointers in a single bucket, and some space may be
reused. (Joe:) If the file has a single primary key,
under version 4 of VMS it can be deleted and reclaimed
immediately: if there are secondary keys, it can't
reclaim space, and you have to use CONVERT.
.PARAGRAPH
(Unidentified) Our problem is with a file that has
alternate keys which allow duplicates. Any hints for
access time?
.BLANK
Answer: you must understand that if a key is not unique
and there are many duplicates (for example, if sex is a
key in a personnel file and about 51_% have the same
key), you are essentially using half of a sequential
file, and there isn't much that can be done for
performance. The best performance occurs when keys are
unique or nearly unique. [Remainder lost from end of
tape.]
.BLANK 2.TEST PAGE 8
.CENTER
Procedure used to compare loading a file
.CENTER
with DATATRIEVE and IFL.
.BLANK
This was placed in an indirect command procedure
so no time would be lost typing in the commands.
It was run on a PDP-11/70 using RSX and no other
users.
.BLANK.NO JUSTIFY.NO FILL
>PIP YACHT.DAT;*/DE
>DTR @DY
DEFINE FILE YACHTS KEY=TYPE(NO DUP),
     KEY=MODEL(DUP, NO CHANGE),
     ALLOCATION=30, SUPERSEDE
>TIM
09:02:04 02-DEC-82
>DTR @T1
READY YACHTS WRITE
READY YACHTS-SEQUENTIAL
FOR YACHTS-SEQUENTAIL STORE YACHTS USING
      BOAT=BOAT
>TIM
09:05:04 02-DEC-82     [Elapsed time 2:46]
>PIP YACHT.DAT;*/DE
>DTR @DY
DEFINE FILE YACHTS KEY=TYPE(NO DUP),
     KEY=MODEL(DUP, NO CHANGE),
     ALLOCATION=30, SUPERSEDE
>TIM
09:05:04 02-DEC-82
IFL YACHT.DAT=YACHT.SEQ
           PRIMARY KEY:
       SORT HAS STARTED
       SORT MERGE PHASE HAS FINISHED
           ALTERNATE KEY(S)
       SORT MERGE PHASE HAS FINISHED
           ALTERNATE KEY(S)  1:
       NUMBER OF INPUT RECORDS: 113
       NUMBER OF OUTPUT RECORDS: 113
       NUMBER OF EXCEPTION RECORDS: 0
>TIM
09:05:21 02-DEC-82   [Elapsed time 0:17]
>@ <EOF>
.BLANK 5.TEST PAGE.TEST PAGE 27
.CENTER
Example of output from FRAG utility
.BLANK
     Disk fragmentation statistics for DW1:
.BLANK
           28-APR-86      09:18:40
.BLANK
         Contiguous free blocks (Holes)
.BLANK
Hole Range      Frequency   Number of blocks
    1 -     8   23          76
    9 -    16   6           79
   17 -    32   0           0
   33 -    64   0           0
   65 -   125   3           225
  126 -   250   1           212
  251 -   500   1           304
  501 -  1000   0           0
 1001 -  2000   0           0
 2001 -  4000   1           3311
 4001 -  8000   0           0
 8001 - 16000   0           0
16001 and up    0           0
.BLANK
        Largest free block  3311
        Total free blocks   4207