]> git.sesse.net Git - plocate/log
plocate
3 years agoSupport patterns shorter than 3 bytes.
Steinar H. Gunderson [Mon, 28 Sep 2020 22:27:06 +0000 (00:27 +0200)]
Support patterns shorter than 3 bytes.

This isn't very efficient, but it appears to still be slightly better
than mlocate.

3 years agoFormat everything with clang-format.
Steinar H. Gunderson [Mon, 28 Sep 2020 22:13:18 +0000 (00:13 +0200)]
Format everything with clang-format.

clang-format isn't ideal, but it's better than manual formatting
in the long run.

3 years agoRemove some commented-out code.
Steinar H. Gunderson [Mon, 28 Sep 2020 22:10:48 +0000 (00:10 +0200)]
Remove some commented-out code.

3 years agoAbstract out some details of reading the corpus into a class.
Steinar H. Gunderson [Mon, 28 Sep 2020 22:10:02 +0000 (00:10 +0200)]
Abstract out some details of reading the corpus into a class.

3 years agoRefactor scanning through a filename block into its own function.
Steinar H. Gunderson [Mon, 28 Sep 2020 21:52:46 +0000 (23:52 +0200)]
Refactor scanning through a filename block into its own function.

3 years agoRemove a redundant #include.
Steinar H. Gunderson [Mon, 28 Sep 2020 21:12:16 +0000 (23:12 +0200)]
Remove a redundant #include.

3 years agoOptimize pending_docids storage for smaller posting lists.
Steinar H. Gunderson [Mon, 28 Sep 2020 19:55:58 +0000 (21:55 +0200)]
Optimize pending_docids storage for smaller posting lists.

The trigram distribution is long-tail, so allocating 128 docids
up-front was seemingly a waste. Saves ~20% more RAM in plocate-build.

3 years agoEncode posting lists as we go.
Steinar H. Gunderson [Mon, 28 Sep 2020 19:45:04 +0000 (21:45 +0200)]
Encode posting lists as we go.

This was surprisingly undocumented (I had to read preprocessed
TurboPFor to figure it out), but we can now build posting lists
directly compressed in memory 128 elements at a time, which saves
another 30% RAM in plocate-build.

3 years agoCompress filename blocks as we read them.
Steinar H. Gunderson [Mon, 28 Sep 2020 16:57:21 +0000 (18:57 +0200)]
Compress filename blocks as we read them.

This saves ~70% RAM in plocate-build, as we don't have to keep
all the uncompressed filenames at the same time.

3 years agoHold compressed filenames more efficiently in memory.
Steinar H. Gunderson [Mon, 28 Sep 2020 07:54:49 +0000 (09:54 +0200)]
Hold compressed filenames more efficiently in memory.

std::string::resize() will rarely give memory back, so allocate
the string with the right size to begin with. This means we don't
carry around a lot of slop, saving ~15% RAM during build.

3 years agoDeduplicate docids as we go.
Steinar H. Gunderson [Mon, 28 Sep 2020 07:34:31 +0000 (09:34 +0200)]
Deduplicate docids as we go.

This saves ~50% RAM in the build step, now that we have blocking
(there's a lot of deduplication going on), and seemingly also
~15% execution time, possibly because of less memory allocation
(I haven't checked thoroughly).

3 years agoCompress filenames with zstd.
Steinar H. Gunderson [Sun, 27 Sep 2020 22:28:49 +0000 (00:28 +0200)]
Compress filenames with zstd.

Make blocks of 32 and 32 filenames, and compress then with zstd -6
(the level is fairly arbitrarily chosen). This compresses the repetitive
path information very well, and also allows us to have shorter posting
lists, as they can point into the blocks (allowing dedup).

32 was chosen after eyeballing some compressed sizes, looking for
diminishing returns and then verifying it didn't cost much in terms
of search performance.

3 years agoIn build debug output, print the total size.
Steinar H. Gunderson [Sun, 27 Sep 2020 22:28:43 +0000 (00:28 +0200)]
In build debug output, print the total size.

3 years agoInitial checkin.
Steinar H. Gunderson [Sun, 27 Sep 2020 20:53:35 +0000 (22:53 +0200)]
Initial checkin.