]> git.sesse.net Git - plocate/log
plocate
3 years agoStart some basic command line options.
Steinar H. Gunderson [Wed, 30 Sep 2020 21:50:47 +0000 (23:50 +0200)]
Start some basic command line options.

3 years agoRerun clang-format.
Steinar H. Gunderson [Wed, 30 Sep 2020 19:52:16 +0000 (21:52 +0200)]
Rerun clang-format.

3 years agoSupport building without io_uring.
Steinar H. Gunderson [Wed, 30 Sep 2020 19:20:33 +0000 (21:20 +0200)]
Support building without io_uring.

This is pretty hackish! It would be nice to be able to switch to
meson, but TurboPFor makes that a bit tricky right now.

3 years agoAdd a missing .o to make clean.
Steinar H. Gunderson [Wed, 30 Sep 2020 19:20:17 +0000 (21:20 +0200)]
Add a missing .o to make clean.

3 years agoFix a bug where posting lists intersecting to nothing would not early-abort properly.
Steinar H. Gunderson [Wed, 30 Sep 2020 18:03:54 +0000 (20:03 +0200)]
Fix a bug where posting lists intersecting to nothing would not early-abort properly.

3 years agoSwitch trigram lookup from binary search to a hash table.
Steinar H. Gunderson [Wed, 30 Sep 2020 17:44:28 +0000 (19:44 +0200)]
Switch trigram lookup from binary search to a hash table.

Binary search was fine when we just wanted simplicity, but for I/O
optimization on rotating media, we want as few seeks as possible.
A hash table with open addressing gives us just that; Robin Hood
hashing makes it possible for us to guarantee maximum probe length,
so we can just read 256 bytes (plus a little slop) for each lookup
and that's it. This kills ~30 ms or so cold-cache.

This breaks the format, so we use the chance to add a magic and
a proper header to provide some more flexibility in case we want
to change the builder.

3 years agoRemove unused variable.
Steinar H. Gunderson [Wed, 30 Sep 2020 17:46:44 +0000 (19:46 +0200)]
Remove unused variable.

3 years agoTest for errors from zstd.
Steinar H. Gunderson [Wed, 30 Sep 2020 17:30:34 +0000 (19:30 +0200)]
Test for errors from zstd.

3 years agoSome dprintf fixes for plocate-build.
Steinar H. Gunderson [Wed, 30 Sep 2020 16:17:30 +0000 (18:17 +0200)]
Some dprintf fixes for plocate-build.

3 years agoReplace mmap with io_uring.
Steinar H. Gunderson [Wed, 30 Sep 2020 08:20:10 +0000 (10:20 +0200)]
Replace mmap with io_uring.

This moves to explicit, asynchronous I/O through io_uring (Linux 5.1+),
which speeds up cold-cache behavior on rotating media by 3x or so.
It also removes any issues we might have with not fitting into 32-bit
address spaces.

If io_uring is not available, regular synchronous I/O will be used instead.
For now, there's a dependency on liburing, but it will be optional soon
for older systems.

3 years agoRemove an unused macro.
Steinar H. Gunderson [Wed, 30 Sep 2020 08:20:01 +0000 (10:20 +0200)]
Remove an unused macro.

3 years agoSupport patterns shorter than 3 bytes.
Steinar H. Gunderson [Mon, 28 Sep 2020 22:27:06 +0000 (00:27 +0200)]
Support patterns shorter than 3 bytes.

This isn't very efficient, but it appears to still be slightly better
than mlocate.

3 years agoFormat everything with clang-format.
Steinar H. Gunderson [Mon, 28 Sep 2020 22:13:18 +0000 (00:13 +0200)]
Format everything with clang-format.

clang-format isn't ideal, but it's better than manual formatting
in the long run.

3 years agoRemove some commented-out code.
Steinar H. Gunderson [Mon, 28 Sep 2020 22:10:48 +0000 (00:10 +0200)]
Remove some commented-out code.

3 years agoAbstract out some details of reading the corpus into a class.
Steinar H. Gunderson [Mon, 28 Sep 2020 22:10:02 +0000 (00:10 +0200)]
Abstract out some details of reading the corpus into a class.

3 years agoRefactor scanning through a filename block into its own function.
Steinar H. Gunderson [Mon, 28 Sep 2020 21:52:46 +0000 (23:52 +0200)]
Refactor scanning through a filename block into its own function.

3 years agoRemove a redundant #include.
Steinar H. Gunderson [Mon, 28 Sep 2020 21:12:16 +0000 (23:12 +0200)]
Remove a redundant #include.

3 years agoOptimize pending_docids storage for smaller posting lists.
Steinar H. Gunderson [Mon, 28 Sep 2020 19:55:58 +0000 (21:55 +0200)]
Optimize pending_docids storage for smaller posting lists.

The trigram distribution is long-tail, so allocating 128 docids
up-front was seemingly a waste. Saves ~20% more RAM in plocate-build.

3 years agoEncode posting lists as we go.
Steinar H. Gunderson [Mon, 28 Sep 2020 19:45:04 +0000 (21:45 +0200)]
Encode posting lists as we go.

This was surprisingly undocumented (I had to read preprocessed
TurboPFor to figure it out), but we can now build posting lists
directly compressed in memory 128 elements at a time, which saves
another 30% RAM in plocate-build.

3 years agoCompress filename blocks as we read them.
Steinar H. Gunderson [Mon, 28 Sep 2020 16:57:21 +0000 (18:57 +0200)]
Compress filename blocks as we read them.

This saves ~70% RAM in plocate-build, as we don't have to keep
all the uncompressed filenames at the same time.

3 years agoHold compressed filenames more efficiently in memory.
Steinar H. Gunderson [Mon, 28 Sep 2020 07:54:49 +0000 (09:54 +0200)]
Hold compressed filenames more efficiently in memory.

std::string::resize() will rarely give memory back, so allocate
the string with the right size to begin with. This means we don't
carry around a lot of slop, saving ~15% RAM during build.

3 years agoDeduplicate docids as we go.
Steinar H. Gunderson [Mon, 28 Sep 2020 07:34:31 +0000 (09:34 +0200)]
Deduplicate docids as we go.

This saves ~50% RAM in the build step, now that we have blocking
(there's a lot of deduplication going on), and seemingly also
~15% execution time, possibly because of less memory allocation
(I haven't checked thoroughly).

3 years agoCompress filenames with zstd.
Steinar H. Gunderson [Sun, 27 Sep 2020 22:28:49 +0000 (00:28 +0200)]
Compress filenames with zstd.

Make blocks of 32 and 32 filenames, and compress then with zstd -6
(the level is fairly arbitrarily chosen). This compresses the repetitive
path information very well, and also allows us to have shorter posting
lists, as they can point into the blocks (allowing dedup).

32 was chosen after eyeballing some compressed sizes, looking for
diminishing returns and then verifying it didn't cost much in terms
of search performance.

3 years agoIn build debug output, print the total size.
Steinar H. Gunderson [Sun, 27 Sep 2020 22:28:43 +0000 (00:28 +0200)]
In build debug output, print the total size.

3 years agoInitial checkin.
Steinar H. Gunderson [Sun, 27 Sep 2020 20:53:35 +0000 (22:53 +0200)]
Initial checkin.