This roughly doubles our speed, to 60% of the reference.
Unfortunate, we require some fairly elaborate gymnastics
to be able to use multiversioning and templates together,
and the new code isn't necessarily as easy to understand.
Support decoding the SIMD interleaved TurboPFor formats.
Our decoder is even slower for these than for the regular formats,
but at least it allows us to switch the format back. We'll see about
some mild optimization next.
std::unordered_map isn't the most performant hash table around;
replace it with a simple array of pointers. (An array of objects
would take >1GB RAM.) This costs ~120 MB fixed RAM overhead
(roughly doubling the RAM usage again for moderate-size corpora),
but also roughly doubles the build speed.
Make the builder write out filenames as they get compressed.
This saves yet more RAM in the builder; roughly halving the peak
usage, in fact. The layout of the file changes somewhat
(the filenames come first instead of last), but it shouldn't
matter for performance.
Loosen up serialization to be about printing only.
This allows us to decompress, match and check access() for docids
that arrive out-of-order, instead of serializing their processing.
Especially the access() checking is good to have overlapping other
I/O if possible, even though it's synchronous.
If we ever get async access(), we'll need to rework Serializer
again, probably.
Switch trigram lookup from binary search to a hash table.
Binary search was fine when we just wanted simplicity, but for I/O
optimization on rotating media, we want as few seeks as possible.
A hash table with open addressing gives us just that; Robin Hood
hashing makes it possible for us to guarantee maximum probe length,
so we can just read 256 bytes (plus a little slop) for each lookup
and that's it. This kills ~30 ms or so cold-cache.
This breaks the format, so we use the chance to add a magic and
a proper header to provide some more flexibility in case we want
to change the builder.
This moves to explicit, asynchronous I/O through io_uring (Linux 5.1+),
which speeds up cold-cache behavior on rotating media by 3x or so.
It also removes any issues we might have with not fitting into 32-bit
address spaces.
If io_uring is not available, regular synchronous I/O will be used instead.
For now, there's a dependency on liburing, but it will be optional soon
for older systems.
This was surprisingly undocumented (I had to read preprocessed
TurboPFor to figure it out), but we can now build posting lists
directly compressed in memory 128 elements at a time, which saves
another 30% RAM in plocate-build.
Hold compressed filenames more efficiently in memory.
std::string::resize() will rarely give memory back, so allocate
the string with the right size to begin with. This means we don't
carry around a lot of slop, saving ~15% RAM during build.
This saves ~50% RAM in the build step, now that we have blocking
(there's a lot of deduplication going on), and seemingly also
~15% execution time, possibly because of less memory allocation
(I haven't checked thoroughly).
Make blocks of 32 and 32 filenames, and compress then with zstd -6
(the level is fairly arbitrarily chosen). This compresses the repetitive
path information very well, and also allows us to have shorter posting
lists, as they can point into the blocks (allowing dedup).
32 was chosen after eyeballing some compressed sizes, looking for
diminishing returns and then verifying it didn't cost much in terms
of search performance.