git.sesse.net Git - plocate/log

]> git.sesse.net Git - plocate/log

projects / plocate / log

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 18:39:48 +0000 (20:39 +0200)]

Convert the SSE2 delta decoder state into a class.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 07:30:39 +0000 (09:30 +0200)]

Add some benchmark calculations to bench.cpp.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 5 Oct 2020 22:41:22 +0000 (00:41 +0200)]

Add SSE2 versions of the _interleaved codecs.

This roughly doubles our speed, to 60% of the reference.
Unfortunate, we require some fairly elaborate gymnastics
to be able to use multiversioning and templates together,
and the new code isn't necessarily as easy to understand.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 5 Oct 2020 22:24:24 +0000 (00:24 +0200)]

Small refactoring to reduce code duplication.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 5 Oct 2020 20:17:52 +0000 (22:17 +0200)]

Speed up delta-decoding, giving us 50% (!) speed boost and taking us to 30% of reference speed.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 5 Oct 2020 19:49:05 +0000 (21:49 +0200)]

Support decoding the SIMD interleaved TurboPFor formats.

Our decoder is even slower for these than for the regular formats,
but at least it allows us to switch the format back. We'll see about
some mild optimization next.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 5 Oct 2020 19:39:26 +0000 (21:39 +0200)]

Add bench to clean.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 4 Oct 2020 22:35:18 +0000 (00:35 +0200)]

Support setting LDFLAGS.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 4 Oct 2020 22:32:43 +0000 (00:32 +0200)]

When failing the benchmark tests, stop printing out differences after the first ten.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 4 Oct 2020 21:49:52 +0000 (23:49 +0200)]

Start reimplementing the TurboPFor decoding functions.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 4 Oct 2020 21:42:59 +0000 (23:42 +0200)]

Turn off the SIMD temporarily.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 3 Oct 2020 09:55:54 +0000 (11:55 +0200)]

Inline a function, for ~10% faster building.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 3 Oct 2020 09:47:02 +0000 (11:47 +0200)]

Get rid of the hash table in plocate-build.

std::unordered_map isn't the most performant hash table around;
replace it with a simple array of pointers. (An array of objects
would take >1GB RAM.) This costs ~120 MB fixed RAM overhead
(roughly doubling the RAM usage again for moderate-size corpora),
but also roughly doubles the build speed.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 3 Oct 2020 09:25:32 +0000 (11:25 +0200)]

Simplify docid deduplication in plocate-builder.

Doesn't matter much for speed, but feels easier to understand.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 3 Oct 2020 08:49:10 +0000 (10:49 +0200)]

Fix searching for very short (1 or 2 bytes) queries.

plocate had assumptions about the layout of the file, that no longer
held. Use the pad field to simplify things.

This requires a database rebuild, but only for short queries.
Normal queries will continue to work, so there's no version bump.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 2 Oct 2020 16:38:44 +0000 (18:38 +0200)]

Format the usage slightly differently.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 2 Oct 2020 16:36:46 +0000 (18:36 +0200)]

Make some padding in the header explicit.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 2 Oct 2020 16:35:50 +0000 (18:35 +0200)]

Read mlocate.db using stdio.

This removes the last instance of mmap-ing, so should fix issues with
large mlocate databases on 32-bit systems.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 2 Oct 2020 16:16:58 +0000 (18:16 +0200)]

Make the builder write out filenames as they get compressed.

This saves yet more RAM in the builder; roughly halving the peak
usage, in fact. The layout of the file changes somewhat
(the filenames come first instead of last), but it shouldn't
matter for performance.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 1 Oct 2020 21:40:28 +0000 (23:40 +0200)]

More microoptimization of the io_uring polling.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 1 Oct 2020 20:53:17 +0000 (22:53 +0200)]

Support batch io_uring completions.

Should save a fair amount of syscalls in heavy situations.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 1 Oct 2020 16:09:47 +0000 (18:09 +0200)]

Do early reject of trigrams we can say up-front will be too large; saves loading their posting lists from disk in extreme cases.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 1 Oct 2020 16:09:11 +0000 (18:09 +0200)]

clang-format.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 1 Oct 2020 16:02:32 +0000 (18:02 +0200)]

Fix the early abort for zero-trigrams again.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 1 Oct 2020 08:16:31 +0000 (10:16 +0200)]

Loosen up serialization to be about printing only.

This allows us to decompress, match and check access() for docids
that arrive out-of-order, instead of serializing their processing.
Especially the access() checking is good to have overlapping other
I/O if possible, even though it's synchronous.

If we ever get async access(), we'll need to rework Serializer
again, probably.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 22:00:29 +0000 (00:00 +0200)]

Quit on unknown option.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 22:00:04 +0000 (00:00 +0200)]

Fix typo; plocate, not slocate.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 21:59:27 +0000 (23:59 +0200)]

Credit the source text in usage().

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 21:59:14 +0000 (23:59 +0200)]

Support the --null option.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 21:57:02 +0000 (23:57 +0200)]

Support scanning for multiple patterns.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 21:50:47 +0000 (23:50 +0200)]

Start some basic command line options.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 19:52:16 +0000 (21:52 +0200)]

Rerun clang-format.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 19:20:33 +0000 (21:20 +0200)]

Support building without io_uring.

This is pretty hackish! It would be nice to be able to switch to
meson, but TurboPFor makes that a bit tricky right now.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 19:20:17 +0000 (21:20 +0200)]

Add a missing .o to make clean.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 18:03:54 +0000 (20:03 +0200)]

Fix a bug where posting lists intersecting to nothing would not early-abort properly.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 17:44:28 +0000 (19:44 +0200)]

Switch trigram lookup from binary search to a hash table.

Binary search was fine when we just wanted simplicity, but for I/O
optimization on rotating media, we want as few seeks as possible.
A hash table with open addressing gives us just that; Robin Hood
hashing makes it possible for us to guarantee maximum probe length,
so we can just read 256 bytes (plus a little slop) for each lookup
and that's it. This kills ~30 ms or so cold-cache.

This breaks the format, so we use the chance to add a magic and
a proper header to provide some more flexibility in case we want
to change the builder.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 17:46:44 +0000 (19:46 +0200)]

Remove unused variable.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 17:30:34 +0000 (19:30 +0200)]

Test for errors from zstd.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 16:17:30 +0000 (18:17 +0200)]

Some dprintf fixes for plocate-build.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 08:20:10 +0000 (10:20 +0200)]

Replace mmap with io_uring.

This moves to explicit, asynchronous I/O through io_uring (Linux 5.1+),
which speeds up cold-cache behavior on rotating media by 3x or so.
It also removes any issues we might have with not fitting into 32-bit
address spaces.

If io_uring is not available, regular synchronous I/O will be used instead.
For now, there's a dependency on liburing, but it will be optional soon
for older systems.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 08:20:01 +0000 (10:20 +0200)]

Remove an unused macro.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 28 Sep 2020 22:27:06 +0000 (00:27 +0200)]

Support patterns shorter than 3 bytes.

This isn't very efficient, but it appears to still be slightly better
than mlocate.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 28 Sep 2020 22:13:18 +0000 (00:13 +0200)]

Format everything with clang-format.

clang-format isn't ideal, but it's better than manual formatting
in the long run.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 28 Sep 2020 22:10:48 +0000 (00:10 +0200)]

Remove some commented-out code.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 28 Sep 2020 22:10:02 +0000 (00:10 +0200)]

Abstract out some details of reading the corpus into a class.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 28 Sep 2020 21:52:46 +0000 (23:52 +0200)]

Refactor scanning through a filename block into its own function.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 28 Sep 2020 21:12:16 +0000 (23:12 +0200)]

Remove a redundant #include.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 28 Sep 2020 19:55:58 +0000 (21:55 +0200)]

Optimize pending_docids storage for smaller posting lists.

The trigram distribution is long-tail, so allocating 128 docids
up-front was seemingly a waste. Saves ~20% more RAM in plocate-build.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 28 Sep 2020 19:45:04 +0000 (21:45 +0200)]

Encode posting lists as we go.

This was surprisingly undocumented (I had to read preprocessed
TurboPFor to figure it out), but we can now build posting lists
directly compressed in memory 128 elements at a time, which saves
another 30% RAM in plocate-build.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 28 Sep 2020 16:57:21 +0000 (18:57 +0200)]

Compress filename blocks as we read them.

This saves ~70% RAM in plocate-build, as we don't have to keep
all the uncompressed filenames at the same time.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 28 Sep 2020 07:54:49 +0000 (09:54 +0200)]

Hold compressed filenames more efficiently in memory.

std::string::resize() will rarely give memory back, so allocate
the string with the right size to begin with. This means we don't
carry around a lot of slop, saving ~15% RAM during build.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 28 Sep 2020 07:34:31 +0000 (09:34 +0200)]

Deduplicate docids as we go.

This saves ~50% RAM in the build step, now that we have blocking
(there's a lot of deduplication going on), and seemingly also
~15% execution time, possibly because of less memory allocation
(I haven't checked thoroughly).

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 27 Sep 2020 22:28:49 +0000 (00:28 +0200)]

Compress filenames with zstd.

Make blocks of 32 and 32 filenames, and compress then with zstd -6
(the level is fairly arbitrarily chosen). This compresses the repetitive
path information very well, and also allows us to have shorter posting
lists, as they can point into the blocks (allowing dedup).

32 was chosen after eyeballing some compressed sizes, looking for
diminishing returns and then verifying it didn't cost much in terms
of search performance.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 27 Sep 2020 22:28:43 +0000 (00:28 +0200)]

In build debug output, print the total size.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 27 Sep 2020 20:53:35 +0000 (22:53 +0200)]

Initial checkin.

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom