Do the access checking asynchronously if possible.
There are many issues involved:
- There's no access() support in io_uring (yet?), so we fake it
by doing statx() on the directory first, which primes the
dentry cache so that synchronous access() becomes very fast.
It is a bit tricky, since multiple access checks could be
going on at the same time, which the need to all wait
for the same statx() call.
- Not even all kernels support statx() in io_uring (support starts
from 5.6+).
- Serialization now becomes two-level, and more involved.
We don't have an obvious single counter anymore, so we need
to be able to start a docid without knowing how many candidates
there are (and thus, be able to tell Serializer that we are
at the end).
- Limit becomes more tricky, since there can be more calls on
the way back. We solve this by moving limit into Serializer,
and hard-exiting when we hit the limit.
- We need to prioritize statx() calls ahead of read(), so that
we don't end up with very delayed output when the new read()
calls generate even more statx() calls and we get a huge
backlog of calls. (We can't prioritize in the kernel, but we
can on the overflow queue we're managing ourselves.) This is
especially important with --limit.
This matches mlocate behavior; even the sort-of strange behavior
of having them non-anchored. Case-insensitive matching has also
been changed away from regex, since fnmatch() is seemingly slightly
faster.
Without changing the database format, this causes a bunch of extra
lookups. But somehow, it appears to go fairly well in practice.
Of course, case-sensitive will always be faster.
Move TurboPFor compilation to its own compilation unit.
This file takes so long to compile, especially with optimization
and/or ASan on, that it became a real annoyance whenever we were
modifying plocate.cpp for anything else. Takes away some genericness
we don't really use.
We could do the same thing with the encoder if need be.
Allocate 16 bytes extra as slop after every read buffer, so that
we know we never read outside allocated memory. (This is much easier
now that we have a TurboPFor implementation with clearly defined slop.)
This is much slower (plocate-build becomes ~6% slower or so),
but allows us to ditch the external TurboPFor dependency entirely,
and with it, the SSE4.1 demand. This should make us much more palatable
for most distributions.
The benchmark program is extended with some tests that all posting lists
in plocate.db round-trip properly through our encoder, which found a
lot of bugs during development.
By asking GCC to unroll the loop, and specializing for the bit width
using templatizing, we can get rid of a lot of the control overhead.
This takes us up from 60% to 80% of reference performance, still
without requiring anything more than SSE2.
This roughly doubles our speed, to 60% of the reference.
Unfortunate, we require some fairly elaborate gymnastics
to be able to use multiversioning and templates together,
and the new code isn't necessarily as easy to understand.
Support decoding the SIMD interleaved TurboPFor formats.
Our decoder is even slower for these than for the regular formats,
but at least it allows us to switch the format back. We'll see about
some mild optimization next.
std::unordered_map isn't the most performant hash table around;
replace it with a simple array of pointers. (An array of objects
would take >1GB RAM.) This costs ~120 MB fixed RAM overhead
(roughly doubling the RAM usage again for moderate-size corpora),
but also roughly doubles the build speed.
Make the builder write out filenames as they get compressed.
This saves yet more RAM in the builder; roughly halving the peak
usage, in fact. The layout of the file changes somewhat
(the filenames come first instead of last), but it shouldn't
matter for performance.
Loosen up serialization to be about printing only.
This allows us to decompress, match and check access() for docids
that arrive out-of-order, instead of serializing their processing.
Especially the access() checking is good to have overlapping other
I/O if possible, even though it's synchronous.
If we ever get async access(), we'll need to rework Serializer
again, probably.
Switch trigram lookup from binary search to a hash table.
Binary search was fine when we just wanted simplicity, but for I/O
optimization on rotating media, we want as few seeks as possible.
A hash table with open addressing gives us just that; Robin Hood
hashing makes it possible for us to guarantee maximum probe length,
so we can just read 256 bytes (plus a little slop) for each lookup
and that's it. This kills ~30 ms or so cold-cache.
This breaks the format, so we use the chance to add a magic and
a proper header to provide some more flexibility in case we want
to change the builder.
This moves to explicit, asynchronous I/O through io_uring (Linux 5.1+),
which speeds up cold-cache behavior on rotating media by 3x or so.
It also removes any issues we might have with not fitting into 32-bit
address spaces.
If io_uring is not available, regular synchronous I/O will be used instead.
For now, there's a dependency on liburing, but it will be optional soon
for older systems.