|
Go to the previous, next chapter.
The file name databases used by locate contain
lists of files that were in particular directory trees when the
databases were last updated. The file name of the default
database is determined when locate and updatedb
are configured and installed. The frequency with which the
databases are updated and the directories for which they contain
entries depend on how often updatedb is run, and
with which arguments.
There can be multiple file name databases. Users can select
which databases locate searches using an environment
variable or a command line option. The system administrator can
choose the file name of the default database, the frequency with
which the databases are updated, and the directories for which
they contain entries. File name databases are updated by running
the updatedb program, typically nightly.
In networked environments, it often makes sense to build a
database at the root of each filesystem, containing the entries
for that filesystem. updatedb is then run for each
filesystem on the fileserver where that filesystem is on a local
disk, to prevent thrashing the network. Here are the options to updatedb
to select which directories each database contains entries for:
- --localpaths='path...'
Non-network directories to put in the database. Default
is /.
--netpaths='path...'
Network (NFS, AFS, RFS, etc.) directories to put in the
database. Default is none.
--prunepaths='path...'
Directories to not put in the database, which would
otherwise be. Default is /tmp /usr/tmp /var/tmp /afs.
--output=dbfile The database
file to build. Default is system-dependent, but typically /usr/local/var/locatedb.
--netuser=user The user to
search network directories as, using su.
Default is daemon.
The file name databases contain lists of files that were in
particular directory trees when the databases were last updated.
The file name database format changed starting with GNU locate
version 4.0 to allow machines with different byte orderings to
share the databases. The new GNU locate can read
both the old and new database formats. However, old versions of locate
and find produce incorrect results if given a
new-format database.
updatedb runs a program called frcode
to front-compress the list of file names, which
reduces the database size by a factor of 4 to 5.
Front-compression (also known as incremental encoding) works as
follows.
The database entries are a sorted list (case-insensitively,
for users' convenience). Since the list is sorted, each entry is
likely to share a prefix (initial string) with the previous
entry. Each database entry begins with an offset-differential
count byte, which is the additional number of characters of
prefix of the preceding entry to use beyond the number that the
preceding entry is using of its predecessor. (The counts can be
negative.) Following the count is a null-terminated ASCII
remainder---the part of the name that follows the shared prefix.
If the offset-differential count is larger than can be stored
in a byte (+/-127), the byte has the value 0x80 and the count
follows in a 2-byte word, with the high byte first (network byte
order).
Every database begins with a dummy entry for a file called LOCATE02,
which locate checks for to ensure that the database
file has the correct format; it ignores the entry in doing the
search.
Databases can not be concatenated together, even if the first
(dummy) entry is trimmed from all but the first database. This is
because the offset-differential count in the first entry of the
second and following databases will be wrong.
Sample input to frcode:
/usr/src
/usr/src/cmd/aardvark.c
/usr/src/cmd/armadillo.c
/usr/tmp/zoo
Length of the longest prefix of the preceding entry to share:
0 /usr/src
8 /cmd/aardvark.c
14 rmadillo.c
5 tmp/zoo
Output from frcode, with trailing nulls changed
to newlines and count bytes made printable:
0 LOCATE02
0 /usr/src
8 /cmd/aardvark.c
6 rmadillo.c
-9 tmp/zoo
(6 = 14 - 8, and -9 = 5 - 14)
The old database format is used by Unix locate
and find programs and earlier releases of the GNU
ones. updatedb produces this format if given the --old-format
option.
updatedb runs programs called bigram
and code to produce old-format databases. The old
format differs from the new one in the following ways. Instead of
each entry starting with an offset-differential count byte and
ending with a null, byte values from 0 through 28 indicate
offset-differential counts from -14 through 14. The byte value
indicating that a long offset-differential count follows is 0x1e
(30), not 0x80. The long counts are stored in host byte order,
which is not necessarily network byte order, and host integer
word size, which is usually 4 bytes. They also represent a count
14 less than their value. The database lines have no termination
byte; the start of the next line is indicated by its first byte
having a value In addition, instead of starting with a dummy
entry, the old database format starts with a 256 byte table
containing the 128 most common bigrams in the file list. A bigram
is a pair of adjacent bytes. Bytes in the database that have the
high bit set are indexes (with the high bit cleared) into the
bigram table. The bigram and offset-differential count coding
makes these databases 20-25% smaller than the new format, but
makes them not 8-bit clean. Any byte in a file name that is in
the ranges used for the special codes is replaced in the database
by a question mark, which not coincidentally is the shell
wildcard to match a single character.
To return to the Ready-to-Run Software Win95Pak Table of Contents please press here.
|