Fast Master Dataset Expansion
1999-08-12
Changes: 1999-08-12
- some typos fixed; added URL for 3rd bear;
Changes: 1999-08-10
- tighter wording on definition;
- explanation of second inflation;
- explanation of code needed to do subsequent inflations;
This note proposes an enhancement to IMAGE which would
allow for very fast (linear time) capacity increases of master datasets.
Background:
There are two ways to expand a master dataset today: the
traditional "rehashing" method, and the MDX method.
The rehashing method requires that each old entry be read, and a
new hash value calculated to determine the entry's new location.
Because of the nature of hashing, the locality of the old entries
may not be preserved, resulting in the new entries being distri-
buted in varying manners, potentially increasing the cost (time)
of the rehashing dramatically. Rehashing has the potential of
reducing chain lengths, and also the potential for increasing chain
lengths. Depending upon data layout, the time required for later
DBPUTs to find a free entry can become extremely large (This is
known as the "third bear of IMAGE", and is described by the
originator of the term, Fred White, at
http://www.adager.com/TechnicalPapers.html).
This happens when a new entry hashes to the start of a large run
on contiguous in-use entries.
The MDX method involves using secondary entries above the
original capacity (now also called the "hashing capacity",
and utilizes a free chain / highwater mark for that area,
just like detail datasets. Using MDX (as opposed to rehashing)
guarantees that chain lengths will continue to grow. However,
the time required to find a free entry is short.
The "hashing algorithm" is:
hash_value := (computed 32-bit value) mod original_capacity + 1;
primary_entry_location := hash_value;
(The computed-32 bit value is the direct I/J value in some cases,
and a more complicated value in other cases ... but we're not
interested in the internals of this value here.)
Proposal:
Introduce a new kind of "expansion", called "inflation".
Every master has a scale factor, which is initially 1.
The "hashing algorithm" would be changed to:
hash_value := (computed 32-bit value) mod original_capacity + 1;
(i.e., no change in the above)
primary_entry_location := hash_value * scale_factor;
(round down)
(Note: deflation, where the scale_factor is less than 1,
would not be allowed. See Discussion #2 below for reasons.)
Discussion #1: doubling the capacity (scale_factor of 2)
Let's see how this works for the easy case, where we are
inflating the dataset by a factor of 2 (e.g., capacity of 1000
being inflated to a capacity of 2000). In this case, we say
that the scale_factor is 2.
Let's assume we have five entries in the original dataset:
entry 99 ... a primary, linked to entry 101
entry 100 ... a primary, linked to entry 102
entry 101 ... a secondary (primary is 99)
entry 102 ... a secondary (primary is 100)
entry 103 ... a primary with no synonyms.
Note that there are no free entries between 99 and 103.
When we expand the capacity via the inflation method (with
a scale_factor of 2), those entries are moved to:
entry 198 ... a primary, linked to entry 202 (old 99)
entry 200 ... a primary, linked to entry 204 (old 100)
entry 202 ... a secondary (primary is 198) (old 101)
entry 204 ... a secondary (primary is 200) (old 102)
entry 206 ... a primary with no synonyms. (old 103)
Note that there are now free entries at ever other slot,
thus allowing room for growth (attacking the "third bear
of IMAGE" problem).
How complicated was this expansion? Trivial!
The expansion can be done with a single sequential read of the
old master, with each entry being written to the newly computed
location to the new master. We also update the pointers in
the synonym chain area of each entry, inflating them with the
same algorithm.
Note that each new entry gets written to a higher entry number
than the prior new entry. I.e., the entry ordering is maintained!
We are not, in any way, "rehashing"!
Unlike a "rehash" expansion, we never have to work hard to figure out
where to put each entry.
Discussion #2: increasing capacity by 50% (scale_factor of 1.5)
What if the scale factor was 1.5? Using the same five
entries from above (the ones at 99..103), we would get:
entry 148 ... a primary, linked to entry 151 (old 99)
entry 150 ... a primary, linked to entry 153 (old 100)
entry 151 ... a secondary (primary is 148) (old 101)
entry 153 ... a secondary (primary is 150) (old 102)
entry 154 ... a primary with no synonyms. (old 103)
Ok...some of those were easy (e.g., 100 -> 150). What
happened to 99, 101, and 103?
99 * 1.5 = 148.5, so we round down to 148.
101 * 1.5 = 151.5, so we round down to 151.
103 * 1.5 = 154.5, so we round down to 154.
Scale factors of less than 1 would be disallowed (because they
would result in two different hash values being deflated into the
same entry value).
Discussion #3: second inflation
What happens if we originally inflated by 50% (scale_factor 1.5)
(which resulted in a capacity of 1500),
and *then* we come back and inflate by 100% (scale_factor of 2),
resulting in a capacity of 3000?
The second inflation would do:
entry 297 ... a primary, linked to entry 303 (old 148) (old 99)
entry 300 ... a primary, linked to entry 306 (old 150) (old 100)
entry 303 ... a secondary (primary is 297) (old 151) (old 101)
entry 306 ... a secondary (primary is 300) (old 153) (old 102)
entry 309 ... a primary with no synonyms. (old 154) (old 103)
The dataset's scale factor is now 3!
It's clear how we got 300 and 306 (150 * 2, and 153 * 2), but
how did we get 297, 303, and 309?
We couldn't simply take the current entry number times the new
scale factor (e.g., 148 * 1.5), because the following *MUST*
hold true:
A primary entry must reside at the expected location.
If the cumulative scale factor is 3, and the hash_value
is 98, then the primary *MUST* be at 297 (99 * 3).
When we did the first expansion, 99 went to 148 ... which
was a *TRUNCATED* value. (99 * 1.5 = 148.5)
So, to calculate the new entry location we always have to
do the following:
orig_entry# := some_function (current_entry#)
new_entry# := orig_entry# * new scale_factor;
(round down)
How do we obtain the original entry number? Relatively
simple:
orig_entry# := current_entry# / prior_scale_value
(perhaps plus 1)
(The "perhaps plus 1" is because we don't know whether the
current_entry# was the result of rounding down or not.)
The actual code to compute it is:
test := trunc (current_entry_num / prior_scale_value);
if trunc (test * prior_scale_value) = current_entry_num then
orig_entry_num := test
else if trunc ((test + 1) * prior_scale_value) = current_entry_num then
orig_entry_num := test + 1
else
complain ... shouldn't happen.
I've tested the above code for a large number of cases and it appears
to work.
Other topics
Other changes needed in IMAGE code:
a) Storing (and reporting) original and current primary capacities.
(E.g., DBUTIL, QUERY, DBINFO)
b) Updating the "capability mask" to include inflation.
(This would prevent older IMAGEs from opening databases with
inflated datasets.)
Third party tools should be made aware of the change...both
database modifying tools and programs like SUPRTOOL and QueryCalc.
The average chain length is not affected whatsoever. (This could
be good, or it could be bad.)
Extra Enhancement
In some cases, it should be possible to inflate some secondary
entries into locations one higher than normally would be inflated.
E.g., with a scale_factor of 2, a secondary entry at 100
could be relocated to 201 instead of 200. Why do this? Because
it leaves 200 (the inflated primary 100) free for a new primary.
It's easy to do, because we know when we're adjusting pointers
for the various entries that any pointer we encounter is a pointer
to a secondary, thus when converting a forward pointer to 100, we
automatically know that 100 is a secondary.
Summary
NOTHING IS REHASHED! As a result, the expansion (inflation) is
extremely quick.
The benefits are basically twofold:
a) reduction of third bear problem, because "ventilation holes"
have been added, making DBPUT faster (until they fill up,
of course)
b) it's a method of expansion that requires a single sequential read of
the old master, and a single sequential write of the new master,
with no other (disk) work involved!
Stan & Gavin