git.sesse.net Git - plocate/log

Use zstd dictionaries.

Since we have small strings, they can benefit from some shared context,
and zstd supports this. plocate-build now reads the mlocate database
twice; the first pass samples 1000 random blocks, which it uses to train
a 1 kB dictionary. (zstd recommends much larger dictionaries, but practical
testing seems to indicate this doesn't help us much, and might actually
be harmful.)

We get ~20% slower builds and ~7% smaller .db files -- but more
interestingly, linear search speed is up ~20% (which indicates that
decompression in itself benefits more). We need to read the 1 kB
dictionary, but it's practically free since it's stored next to the
header and so small.

This is a version bump (to version 1), so we're not forward-compatible,
but we're backward-compatible (plocate still reads version 0 files
just fine). Since we're adding more fields to the header anyway,
we can add a new “max_version” field that allows for marking
backwards-compatible changes in the future, ie., if plocate-build
adds more information that plocate would like to use but that older
plocate versions can simply ignore.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 12 Oct 2020 18:08:58 +0000 (20:08 +0200)]

Reuse zstd compression contexts, for a tiny speed boost.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 12 Oct 2020 18:04:26 +0000 (20:04 +0200)]

Bump version number.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 12 Oct 2020 07:56:32 +0000 (09:56 +0200)]

Release plocate 1.0.2.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 12 Oct 2020 07:56:21 +0000 (09:56 +0200)]

Add a NEWS file (pretty boring currently).

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 12 Oct 2020 07:52:07 +0000 (09:52 +0200)]

Fix some 32-bit issues.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 22:57:39 +0000 (00:57 +0200)]

Update the correct (generated) version of update-plocate.sh.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 22:57:28 +0000 (00:57 +0200)]

Bump the version number.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 22:22:51 +0000 (00:22 +0200)]

Release plocate 1.0.1.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 22:22:38 +0000 (00:22 +0200)]

Make update-plocate.sh work properly if installed to /usr.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 21:58:41 +0000 (23:58 +0200)]

Unbreak non-trigram matches after we changed to asynchronous access().

Non-trigram matches don't use async I/O, so they also can't use
async access(). Fix so that they don't segfault anymore.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 19:33:23 +0000 (21:33 +0200)]

Correct section of plocate-build manpage.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 19:33:13 +0000 (21:33 +0200)]

Bump version number.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 18:45:23 +0000 (20:45 +0200)]

Release plocate 1.0.0.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 17:59:11 +0000 (19:59 +0200)]

Do the access checking asynchronously if possible.

There are many issues involved:

- There's no access() support in io_uring (yet?), so we fake it
   by doing statx() on the directory first, which primes the
   dentry cache so that synchronous access() becomes very fast.
   It is a bit tricky, since multiple access checks could be
   going on at the same time, which the need to all wait
   for the same statx() call.
- Not even all kernels support statx() in io_uring (support starts
   from 5.6+).
- Serialization now becomes two-level, and more involved.
   We don't have an obvious single counter anymore, so we need
   to be able to start a docid without knowing how many candidates
   there are (and thus, be able to tell Serializer that we are
   at the end).
- Limit becomes more tricky, since there can be more calls on
   the way back. We solve this by moving limit into Serializer,
   and hard-exiting when we hit the limit.
- We need to prioritize statx() calls ahead of read(), so that
   we don't end up with very delayed output when the new read()
   calls generate even more statx() calls and we get a huge
   backlog of calls. (We can't prioritize in the kernel, but we
   can on the overflow queue we're managing ourselves.) This is
   especially important with --limit.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 17:57:20 +0000 (19:57 +0200)]

Use the PRId64 #define for formatting int64.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 17:09:18 +0000 (19:09 +0200)]

Add debug output if io_uring initialization fails.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 09:49:35 +0000 (11:49 +0200)]

Fix #include order.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 09:49:00 +0000 (11:49 +0200)]

Remove some unneeded whitespace.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 09:48:43 +0000 (11:48 +0200)]

Disallow limit <= 0.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 08:11:43 +0000 (10:11 +0200)]

README updates.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 08:07:38 +0000 (10:07 +0200)]

Add some man pages.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 20:24:47 +0000 (22:24 +0200)]

Add support for some basic options in plocate-build; specifically, block size.

This also means it will stop segfaulting if no options are given.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 20:24:27 +0000 (22:24 +0200)]

Implement support for larger basevals in TurboPFor.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 19:43:20 +0000 (21:43 +0200)]

Support searching by regexp (brute force only).

Mostly for compatibility completeness.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 19:26:30 +0000 (21:26 +0200)]

Write new --help text from scratch, so that we have nothing from mlocate except some structs.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 18:39:44 +0000 (20:39 +0200)]

Add a --version option.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 17:30:13 +0000 (19:30 +0200)]

Allow giving --debug to enable debugging (but drops setgid).

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 17:29:01 +0000 (19:29 +0200)]

Unbreak the --null long option.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 17:18:52 +0000 (19:18 +0200)]

Use globs if there are wildcards in the pattern.

This matches mlocate behavior; even the sort-of strange behavior
of having them non-anchored. Case-insensitive matching has also
been changed away from regex, since fnmatch() is seemingly slightly
faster.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 17:09:36 +0000 (19:09 +0200)]

Some clang-formatting.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 09:36:17 +0000 (11:36 +0200)]

Support case-insensitive searches.

Without changing the database format, this causes a bunch of extra
lookups. But somehow, it appears to go fairly well in practice.
Of course, case-sensitive will always be faster.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 08:35:41 +0000 (10:35 +0200)]

Generalize the sort+unique+erase pattern into unique_sort().

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 08:33:36 +0000 (10:33 +0200)]

Remove the double filtering of too large posting lists; we would not even start I/O for it anyway, so there is less to save than was assumed.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 9 Oct 2020 22:52:26 +0000 (00:52 +0200)]

Better printing of trigrams in debug messages, especially with non-ASCII characters.

Also ends up setting locale, which we'll be needing soon.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 9 Oct 2020 21:48:56 +0000 (23:48 +0200)]

Full scans (not trigram-based) would always print counts, even without -c. Fix.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 9 Oct 2020 08:06:00 +0000 (10:06 +0200)]

clang-format again (IWYU and clang-format seemingly disagree).

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 22:09:33 +0000 (00:09 +0200)]

Run include-what-you-use.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 21:57:23 +0000 (23:57 +0200)]

Move TurboPFor compilation to its own compilation unit.

This file takes so long to compile, especially with optimization
and/or ASan on, that it became a real annoyance whenever we were
modifying plocate.cpp for anything else. Takes away some genericness
we don't really use.

We could do the same thing with the encoder if need be.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 21:51:55 +0000 (23:51 +0200)]

clang-format.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 20:52:47 +0000 (22:52 +0200)]

Fix a harmless memory leak in plocate-build.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 20:51:04 +0000 (22:51 +0200)]

Fix some Valgrind issues in plocate-build.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 20:41:41 +0000 (22:41 +0200)]

Make the searcher ASan-clean.

Allocate 16 bytes extra as slop after every read buffer, so that
we know we never read outside allocated memory. (This is much easier
now that we have a TurboPFor implementation with clearly defined slop.)

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 20:35:28 +0000 (22:35 +0200)]

Unbreak runs with no --limit.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 20:23:40 +0000 (22:23 +0200)]

Document slop requirements for TurboPFor decoding.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 18:49:30 +0000 (20:49 +0200)]

Implement the --limit option.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 18:05:41 +0000 (20:05 +0200)]

Implement the --count option.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 08:09:56 +0000 (10:09 +0200)]

Switch build systems to Meson.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 7 Oct 2020 23:12:48 +0000 (01:12 +0200)]

Fix some warnings found by Clang.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 7 Oct 2020 23:05:30 +0000 (01:05 +0200)]

clang-format again.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 7 Oct 2020 23:01:55 +0000 (01:01 +0200)]

Switch to our own TurboPFor encoder.

This is much slower (plocate-build becomes ~6% slower or so),
but allows us to ditch the external TurboPFor dependency entirely,
and with it, the SSE4.1 demand. This should make us much more palatable
for most distributions.

The benchmark program is extended with some tests that all posting lists
in plocate.db round-trip properly through our encoder, which found a
lot of bugs during development.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 7 Oct 2020 22:42:09 +0000 (00:42 +0200)]

Remove unneeded vp4.h #include from plocate.cpp.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 20:54:44 +0000 (22:54 +0200)]

Make the builder delta-encode posting lists as we go.

It's slightly faster, and puts less complexity load on the encoder.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 20:46:46 +0000 (22:46 +0200)]

Run clang-format.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 20:20:14 +0000 (22:20 +0200)]

Fix a warning.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 20:20:00 +0000 (22:20 +0200)]

Hand-roll zeroing of destination docids for SSE2; takes us seemingly up from ~84% to ~89% of reference.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 20:01:22 +0000 (22:01 +0200)]

Fix 32-bit compile (without -msse2).

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 19:45:03 +0000 (21:45 +0200)]

Move exception shifting to later; allows us to get it into SSE2.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 19:30:08 +0000 (21:30 +0200)]

Unroll and specialize decode_bitmap_sse2().

By asking GCC to unroll the loop, and specializing for the bit width
using templatizing, we can get rid of a lot of the control overhead.
This takes us up from 60% to 80% of reference performance, still
without requiring anything more than SSE2.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 19:27:05 +0000 (21:27 +0200)]

Fix undefined behavior when bit_width == 32.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 18:58:19 +0000 (20:58 +0200)]

Move SSE2 bit reader functionality out into a class.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 18:48:17 +0000 (20:48 +0200)]

Fuse delta decoding into the decoding loops where appropriate.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 18:39:48 +0000 (20:39 +0200)]

Convert the SSE2 delta decoder state into a class.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 6 Oct 2020 07:30:39 +0000 (09:30 +0200)]

Add some benchmark calculations to bench.cpp.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 5 Oct 2020 22:41:22 +0000 (00:41 +0200)]

Add SSE2 versions of the _interleaved codecs.

This roughly doubles our speed, to 60% of the reference.
Unfortunate, we require some fairly elaborate gymnastics
to be able to use multiversioning and templates together,
and the new code isn't necessarily as easy to understand.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 5 Oct 2020 22:24:24 +0000 (00:24 +0200)]

Small refactoring to reduce code duplication.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 5 Oct 2020 20:17:52 +0000 (22:17 +0200)]

Speed up delta-decoding, giving us 50% (!) speed boost and taking us to 30% of reference speed.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 5 Oct 2020 19:49:05 +0000 (21:49 +0200)]

Support decoding the SIMD interleaved TurboPFor formats.

Our decoder is even slower for these than for the regular formats,
but at least it allows us to switch the format back. We'll see about
some mild optimization next.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 5 Oct 2020 19:39:26 +0000 (21:39 +0200)]

Add bench to clean.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 4 Oct 2020 22:35:18 +0000 (00:35 +0200)]

Support setting LDFLAGS.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 4 Oct 2020 22:32:43 +0000 (00:32 +0200)]

When failing the benchmark tests, stop printing out differences after the first ten.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 4 Oct 2020 21:49:52 +0000 (23:49 +0200)]

Start reimplementing the TurboPFor decoding functions.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 4 Oct 2020 21:42:59 +0000 (23:42 +0200)]

Turn off the SIMD temporarily.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 3 Oct 2020 09:55:54 +0000 (11:55 +0200)]

Inline a function, for ~10% faster building.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 3 Oct 2020 09:47:02 +0000 (11:47 +0200)]

Get rid of the hash table in plocate-build.

std::unordered_map isn't the most performant hash table around;
replace it with a simple array of pointers. (An array of objects
would take >1GB RAM.) This costs ~120 MB fixed RAM overhead
(roughly doubling the RAM usage again for moderate-size corpora),
but also roughly doubles the build speed.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 3 Oct 2020 09:25:32 +0000 (11:25 +0200)]

Simplify docid deduplication in plocate-builder.

Doesn't matter much for speed, but feels easier to understand.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 3 Oct 2020 08:49:10 +0000 (10:49 +0200)]

Fix searching for very short (1 or 2 bytes) queries.

plocate had assumptions about the layout of the file, that no longer
held. Use the pad field to simplify things.

This requires a database rebuild, but only for short queries.
Normal queries will continue to work, so there's no version bump.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 2 Oct 2020 16:38:44 +0000 (18:38 +0200)]

Format the usage slightly differently.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 2 Oct 2020 16:36:46 +0000 (18:36 +0200)]

Make some padding in the header explicit.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 2 Oct 2020 16:35:50 +0000 (18:35 +0200)]

Read mlocate.db using stdio.

This removes the last instance of mmap-ing, so should fix issues with
large mlocate databases on 32-bit systems.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 2 Oct 2020 16:16:58 +0000 (18:16 +0200)]

Make the builder write out filenames as they get compressed.

This saves yet more RAM in the builder; roughly halving the peak
usage, in fact. The layout of the file changes somewhat
(the filenames come first instead of last), but it shouldn't
matter for performance.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 1 Oct 2020 21:40:28 +0000 (23:40 +0200)]

More microoptimization of the io_uring polling.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 1 Oct 2020 20:53:17 +0000 (22:53 +0200)]

Support batch io_uring completions.

Should save a fair amount of syscalls in heavy situations.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 1 Oct 2020 16:09:47 +0000 (18:09 +0200)]

Do early reject of trigrams we can say up-front will be too large; saves loading their posting lists from disk in extreme cases.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 1 Oct 2020 16:09:11 +0000 (18:09 +0200)]

clang-format.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 1 Oct 2020 16:02:32 +0000 (18:02 +0200)]

Fix the early abort for zero-trigrams again.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 1 Oct 2020 08:16:31 +0000 (10:16 +0200)]

Loosen up serialization to be about printing only.

This allows us to decompress, match and check access() for docids
that arrive out-of-order, instead of serializing their processing.
Especially the access() checking is good to have overlapping other
I/O if possible, even though it's synchronous.

If we ever get async access(), we'll need to rework Serializer
again, probably.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 22:00:29 +0000 (00:00 +0200)]

Quit on unknown option.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 22:00:04 +0000 (00:00 +0200)]

Fix typo; plocate, not slocate.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 30 Sep 2020 21:59:27 +0000 (23:59 +0200)]

Credit the source text in usage().

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom