Since we have small strings, they can benefit from some shared context,
and zstd supports this. plocate-build now reads the mlocate database
twice; the first pass samples 1000 random blocks, which it uses to train
a 1 kB dictionary. (zstd recommends much larger dictionaries, but practical
testing seems to indicate this doesn't help us much, and might actually
be harmful.)
We get ~20% slower builds and ~7% smaller .db files -- but more
interestingly, linear search speed is up ~20% (which indicates that
decompression in itself benefits more). We need to read the 1 kB
dictionary, but it's practically free since it's stored next to the
header and so small.
This is a version bump (to version 1), so we're not forward-compatible,
but we're backward-compatible (plocate still reads version 0 files
just fine). Since we're adding more fields to the header anyway,
we can add a new “max_version” field that allows for marking
backwards-compatible changes in the future, ie., if plocate-build
adds more information that plocate would like to use but that older
plocate versions can simply ignore.
Do the access checking asynchronously if possible.
There are many issues involved:
- There's no access() support in io_uring (yet?), so we fake it
by doing statx() on the directory first, which primes the
dentry cache so that synchronous access() becomes very fast.
It is a bit tricky, since multiple access checks could be
going on at the same time, which the need to all wait
for the same statx() call.
- Not even all kernels support statx() in io_uring (support starts
from 5.6+).
- Serialization now becomes two-level, and more involved.
We don't have an obvious single counter anymore, so we need
to be able to start a docid without knowing how many candidates
there are (and thus, be able to tell Serializer that we are
at the end).
- Limit becomes more tricky, since there can be more calls on
the way back. We solve this by moving limit into Serializer,
and hard-exiting when we hit the limit.
- We need to prioritize statx() calls ahead of read(), so that
we don't end up with very delayed output when the new read()
calls generate even more statx() calls and we get a huge
backlog of calls. (We can't prioritize in the kernel, but we
can on the overflow queue we're managing ourselves.) This is
especially important with --limit.
This matches mlocate behavior; even the sort-of strange behavior
of having them non-anchored. Case-insensitive matching has also
been changed away from regex, since fnmatch() is seemingly slightly
faster.
Without changing the database format, this causes a bunch of extra
lookups. But somehow, it appears to go fairly well in practice.
Of course, case-sensitive will always be faster.
Move TurboPFor compilation to its own compilation unit.
This file takes so long to compile, especially with optimization
and/or ASan on, that it became a real annoyance whenever we were
modifying plocate.cpp for anything else. Takes away some genericness
we don't really use.
We could do the same thing with the encoder if need be.
Allocate 16 bytes extra as slop after every read buffer, so that
we know we never read outside allocated memory. (This is much easier
now that we have a TurboPFor implementation with clearly defined slop.)
This is much slower (plocate-build becomes ~6% slower or so),
but allows us to ditch the external TurboPFor dependency entirely,
and with it, the SSE4.1 demand. This should make us much more palatable
for most distributions.
The benchmark program is extended with some tests that all posting lists
in plocate.db round-trip properly through our encoder, which found a
lot of bugs during development.
By asking GCC to unroll the loop, and specializing for the bit width
using templatizing, we can get rid of a lot of the control overhead.
This takes us up from 60% to 80% of reference performance, still
without requiring anything more than SSE2.
This roughly doubles our speed, to 60% of the reference.
Unfortunate, we require some fairly elaborate gymnastics
to be able to use multiversioning and templates together,
and the new code isn't necessarily as easy to understand.
Support decoding the SIMD interleaved TurboPFor formats.
Our decoder is even slower for these than for the regular formats,
but at least it allows us to switch the format back. We'll see about
some mild optimization next.
std::unordered_map isn't the most performant hash table around;
replace it with a simple array of pointers. (An array of objects
would take >1GB RAM.) This costs ~120 MB fixed RAM overhead
(roughly doubling the RAM usage again for moderate-size corpora),
but also roughly doubles the build speed.
Make the builder write out filenames as they get compressed.
This saves yet more RAM in the builder; roughly halving the peak
usage, in fact. The layout of the file changes somewhat
(the filenames come first instead of last), but it shouldn't
matter for performance.
Loosen up serialization to be about printing only.
This allows us to decompress, match and check access() for docids
that arrive out-of-order, instead of serializing their processing.
Especially the access() checking is good to have overlapping other
I/O if possible, even though it's synchronous.
If we ever get async access(), we'll need to rework Serializer
again, probably.