git.sesse.net Git - plocate/log

]> git.sesse.net Git - plocate/log

Steinar H. Gunderson [Sat, 21 Nov 2020 14:34:59 +0000 (15:34 +0100)]

Make DatabaseBuilder write the file atomically.

By opening with O_TMPFILE, we guarantee we'll never be leaving
an unfinished file visible on the filesystem. The move across the
old one isn't atomic, but the window of failure is very small now.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 10 Nov 2020 18:01:48 +0000 (19:01 +0100)]

Remove unfinished debug code.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 10 Nov 2020 00:09:31 +0000 (01:09 +0100)]

Split DatabaseBuilder into its own compilation unit.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 9 Nov 2020 23:19:58 +0000 (00:19 +0100)]

When reading mlocate.db, properly skip the configuration block.

This could cause some entries to be skipped until we regained sync,
especially in the root directory.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 9 Nov 2020 22:50:20 +0000 (23:50 +0100)]

Encapsulate some database-building logic into a class.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 7 Nov 2020 10:22:52 +0000 (11:22 +0100)]

Escape file names with backticks in them.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 7 Nov 2020 10:16:46 +0000 (11:16 +0100)]

Bump version number.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 31 Oct 2020 22:04:39 +0000 (23:04 +0100)]

Release plocate 1.0.7.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 31 Oct 2020 21:27:41 +0000 (22:27 +0100)]

Fix an infinite loop when encountering invalid UTF-8 in file names.

Bug report and patch by Leah Neukirchen.

commit | commitdiff | tree

Leah Neukirchen [Sat, 31 Oct 2020 18:52:34 +0000 (19:52 +0100)]

Fix two typos in manpages.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 31 Oct 2020 14:13:38 +0000 (15:13 +0100)]

Run clang-format.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 31 Oct 2020 11:25:36 +0000 (12:25 +0100)]

Work around brokenness in FreeBSD mbtowc().

The manpage claims the return value should be 0 on a null byte,
just like on Linux, but in practice, it returns -1, so we need to
check for end-of-string manually.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 31 Oct 2020 11:24:34 +0000 (12:24 +0100)]

Check for endian.h before including it. Fixes compilation on FreeBSD.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 31 Oct 2020 10:57:29 +0000 (11:57 +0100)]

Add missing <endian.h> include.

Seemingly improves musl compatibility. Taken from the Void Linux
packaging repository.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 30 Oct 2020 23:44:14 +0000 (00:44 +0100)]

Bump version number.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 29 Oct 2020 23:06:10 +0000 (00:06 +0100)]

Release plocate 1.0.6.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 29 Oct 2020 22:42:01 +0000 (23:42 +0100)]

Escape unprintable characters when outputting filenames to a terminal.

Filenames are generally untrusted, and can contain any kind of cruft.
In particular, there have been terminals (hopefully not in wide use anymore!)
that will do insanity like running specific commands when seeing a
specific escape sequence. More prosaically, embedded newlines can
make for confusing output.

Thus, escape any nonprintable characters in a shell-parseable way,
much the same way GNU ls does these days. Also escape quotes, backslashes
and the likes to make sure nothing unescaped looks like it's escaped.
This doesn't mean it's safe to take whatever and parse it uncritically
(we don't escape $, for instance), but it's generally good enough.

Escaping is disabled when doing zero-terminated output, or when printing
to a pipe or file.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 20 Oct 2020 16:55:37 +0000 (18:55 +0200)]

Fix a crash when we have a too few blocks to train a dictionary.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 20 Oct 2020 16:53:58 +0000 (18:53 +0200)]

Support building databases from plaintext files.

This was already possible by uncommenting some code, but has now
given a switch and also being made more robust.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 17 Oct 2020 13:03:39 +0000 (15:03 +0200)]

Add an alternative for __builtin_clz.

Speed isn't critical here, and this was ostensibly the last GCC-ism.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 17 Oct 2020 12:37:25 +0000 (14:37 +0200)]

Remove some unneeded __attribute__((unused)).

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 17 Oct 2020 12:33:12 +0000 (14:33 +0200)]

Fix the function multiversioning Meson test.

The old one was seemingly too lenient, and would have false positives.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 17 Oct 2020 12:33:05 +0000 (14:33 +0200)]

Bump version number.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 17 Oct 2020 09:39:54 +0000 (11:39 +0200)]

Release plocate 1.0.5.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 17 Oct 2020 09:32:17 +0000 (11:32 +0200)]

Fix the -r short option.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 17 Oct 2020 09:10:41 +0000 (11:10 +0200)]

Support compiling on x86 platforms without working function multiversioning.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 17 Oct 2020 07:55:45 +0000 (09:55 +0200)]

clang-format.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 17 Oct 2020 07:55:18 +0000 (09:55 +0200)]

Add the missing end timing if linear scan and --debug is used together.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 17 Oct 2020 07:47:13 +0000 (09:47 +0200)]

Fix some inconsistencies in the man page.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 17 Oct 2020 07:46:55 +0000 (09:46 +0200)]

Implement the -b (--basename) option.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 16 Oct 2020 08:02:28 +0000 (10:02 +0200)]

Fix a wrong IWYU include.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 16 Oct 2020 08:01:28 +0000 (10:01 +0200)]

Fix detection of -latomic (it doesn't come from pkg-config).

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 16 Oct 2020 07:26:41 +0000 (09:26 +0200)]

Add -latomic if it exists; seems to be required on armel and sh4.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 16 Oct 2020 07:27:14 +0000 (09:27 +0200)]

Bump the version number.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 15 Oct 2020 22:50:22 +0000 (00:50 +0200)]

Release plocate 1.0.4.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 15 Oct 2020 22:48:23 +0000 (00:48 +0200)]

Move the cache-flushing behavior into an undocumented option, so that one does not have to recompile to test it. (Drops setgid.)

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 15 Oct 2020 22:42:25 +0000 (00:42 +0200)]

Move several needle/searching related functions into its own file.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 15 Oct 2020 22:36:20 +0000 (00:36 +0200)]

Move AccessRXCache into its own file.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 15 Oct 2020 22:23:11 +0000 (00:23 +0200)]

Run clang-format.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 15 Oct 2020 22:22:59 +0000 (00:22 +0200)]

Move Serializer into its own file.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 15 Oct 2020 21:41:16 +0000 (23:41 +0200)]

Merge non-results from worker threads to put less load on Serializer.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 15 Oct 2020 21:40:42 +0000 (23:40 +0200)]

Give the WorkerThread results a proper struct instead of std::tuple.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 15 Oct 2020 20:41:42 +0000 (22:41 +0200)]

Multithread linear scans.

When we have a scan that we cannot accelerate with trigrams
(very short patterns, or regexes), we need to go through all of
the file names like mlocate does. This is usually CPU-bound,
so fire up threads. We leave one core/hyperthread for the I/O
and add a thread for each of the rest (this is probably bad
on dualcore, but it's a simple thing that will do for now,
and should be fairly safe).

The bottleneck now is Serializer. I first tried just putting a
mutex on it, which worked fine on eight hyperthreads
(ie., four real cores, my laptop), but caused huge contention with 40
(20 cores, my old dual-socket Haswell). Sending data back through
per-thread queues seems to work a lot better, but we're still
spending a lot of time in Serializer; witness that --count is
much faster for such a search.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 14 Oct 2020 22:56:37 +0000 (00:56 +0200)]

Don't flush the cache on plocate.db.

This was changed by mistake in an earlier patch.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 14 Oct 2020 22:56:23 +0000 (00:56 +0200)]

Bump version number.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 14 Oct 2020 22:13:52 +0000 (00:13 +0200)]

Release plocate 1.0.3.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 14 Oct 2020 21:35:54 +0000 (23:35 +0200)]

In plocate-build, open the file only once.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 14 Oct 2020 21:32:02 +0000 (23:32 +0200)]

If plocate-build cannot open the output file, give a proper error instead of crashing.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 14 Oct 2020 21:31:08 +0000 (23:31 +0200)]

Add some options for controlling installation and processing of the cron.daily script.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 14 Oct 2020 17:01:38 +0000 (19:01 +0200)]

Unbreak compilation for non-x86.

commit | commitdiff | tree

Steinar H. Gunderson [Wed, 14 Oct 2020 16:54:28 +0000 (18:54 +0200)]

Unbreak compilation of bench.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 13 Oct 2020 16:08:05 +0000 (18:08 +0200)]

Support --debug for plocate-build, and unbreak some debug printfs there.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 13 Oct 2020 15:55:53 +0000 (17:55 +0200)]

Fix --version in plocate-build.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 13 Oct 2020 15:46:20 +0000 (17:46 +0200)]

Use zstd dictionaries.

Since we have small strings, they can benefit from some shared context,
and zstd supports this. plocate-build now reads the mlocate database
twice; the first pass samples 1000 random blocks, which it uses to train
a 1 kB dictionary. (zstd recommends much larger dictionaries, but practical
testing seems to indicate this doesn't help us much, and might actually
be harmful.)

We get ~20% slower builds and ~7% smaller .db files -- but more
interestingly, linear search speed is up ~20% (which indicates that
decompression in itself benefits more). We need to read the 1 kB
dictionary, but it's practically free since it's stored next to the
header and so small.

This is a version bump (to version 1), so we're not forward-compatible,
but we're backward-compatible (plocate still reads version 0 files
just fine). Since we're adding more fields to the header anyway,
we can add a new “max_version” field that allows for marking
backwards-compatible changes in the future, ie., if plocate-build
adds more information that plocate would like to use but that older
plocate versions can simply ignore.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 12 Oct 2020 18:08:58 +0000 (20:08 +0200)]

Reuse zstd compression contexts, for a tiny speed boost.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 12 Oct 2020 18:04:26 +0000 (20:04 +0200)]

Bump version number.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 12 Oct 2020 07:56:32 +0000 (09:56 +0200)]

Release plocate 1.0.2.

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 12 Oct 2020 07:56:21 +0000 (09:56 +0200)]

Add a NEWS file (pretty boring currently).

commit | commitdiff | tree

Steinar H. Gunderson [Mon, 12 Oct 2020 07:52:07 +0000 (09:52 +0200)]

Fix some 32-bit issues.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 22:57:39 +0000 (00:57 +0200)]

Update the correct (generated) version of update-plocate.sh.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 22:57:28 +0000 (00:57 +0200)]

Bump the version number.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 22:22:51 +0000 (00:22 +0200)]

Release plocate 1.0.1.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 22:22:38 +0000 (00:22 +0200)]

Make update-plocate.sh work properly if installed to /usr.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 21:58:41 +0000 (23:58 +0200)]

Unbreak non-trigram matches after we changed to asynchronous access().

Non-trigram matches don't use async I/O, so they also can't use
async access(). Fix so that they don't segfault anymore.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 19:33:23 +0000 (21:33 +0200)]

Correct section of plocate-build manpage.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 19:33:13 +0000 (21:33 +0200)]

Bump version number.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 18:45:23 +0000 (20:45 +0200)]

Release plocate 1.0.0.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 17:59:11 +0000 (19:59 +0200)]

Do the access checking asynchronously if possible.

There are many issues involved:

- There's no access() support in io_uring (yet?), so we fake it
   by doing statx() on the directory first, which primes the
   dentry cache so that synchronous access() becomes very fast.
   It is a bit tricky, since multiple access checks could be
   going on at the same time, which the need to all wait
   for the same statx() call.
- Not even all kernels support statx() in io_uring (support starts
   from 5.6+).
- Serialization now becomes two-level, and more involved.
   We don't have an obvious single counter anymore, so we need
   to be able to start a docid without knowing how many candidates
   there are (and thus, be able to tell Serializer that we are
   at the end).
- Limit becomes more tricky, since there can be more calls on
   the way back. We solve this by moving limit into Serializer,
   and hard-exiting when we hit the limit.
- We need to prioritize statx() calls ahead of read(), so that
   we don't end up with very delayed output when the new read()
   calls generate even more statx() calls and we get a huge
   backlog of calls. (We can't prioritize in the kernel, but we
   can on the overflow queue we're managing ourselves.) This is
   especially important with --limit.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 17:57:20 +0000 (19:57 +0200)]

Use the PRId64 #define for formatting int64.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 17:09:18 +0000 (19:09 +0200)]

Add debug output if io_uring initialization fails.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 09:49:35 +0000 (11:49 +0200)]

Fix #include order.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 09:49:00 +0000 (11:49 +0200)]

Remove some unneeded whitespace.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 09:48:43 +0000 (11:48 +0200)]

Disallow limit <= 0.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 08:11:43 +0000 (10:11 +0200)]

README updates.

commit | commitdiff | tree

Steinar H. Gunderson [Sun, 11 Oct 2020 08:07:38 +0000 (10:07 +0200)]

Add some man pages.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 20:24:47 +0000 (22:24 +0200)]

Add support for some basic options in plocate-build; specifically, block size.

This also means it will stop segfaulting if no options are given.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 20:24:27 +0000 (22:24 +0200)]

Implement support for larger basevals in TurboPFor.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 19:43:20 +0000 (21:43 +0200)]

Support searching by regexp (brute force only).

Mostly for compatibility completeness.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 19:26:30 +0000 (21:26 +0200)]

Write new --help text from scratch, so that we have nothing from mlocate except some structs.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 18:39:44 +0000 (20:39 +0200)]

Add a --version option.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 17:30:13 +0000 (19:30 +0200)]

Allow giving --debug to enable debugging (but drops setgid).

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 17:29:01 +0000 (19:29 +0200)]

Unbreak the --null long option.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 17:18:52 +0000 (19:18 +0200)]

Use globs if there are wildcards in the pattern.

This matches mlocate behavior; even the sort-of strange behavior
of having them non-anchored. Case-insensitive matching has also
been changed away from regex, since fnmatch() is seemingly slightly
faster.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 17:09:36 +0000 (19:09 +0200)]

Some clang-formatting.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 09:36:17 +0000 (11:36 +0200)]

Support case-insensitive searches.

Without changing the database format, this causes a bunch of extra
lookups. But somehow, it appears to go fairly well in practice.
Of course, case-sensitive will always be faster.

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 08:35:41 +0000 (10:35 +0200)]

Generalize the sort+unique+erase pattern into unique_sort().

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 10 Oct 2020 08:33:36 +0000 (10:33 +0200)]

Remove the double filtering of too large posting lists; we would not even start I/O for it anyway, so there is less to save than was assumed.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 9 Oct 2020 22:52:26 +0000 (00:52 +0200)]

Better printing of trigrams in debug messages, especially with non-ASCII characters.

Also ends up setting locale, which we'll be needing soon.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 9 Oct 2020 21:48:56 +0000 (23:48 +0200)]

Full scans (not trigram-based) would always print counts, even without -c. Fix.

commit | commitdiff | tree

Steinar H. Gunderson [Fri, 9 Oct 2020 08:06:00 +0000 (10:06 +0200)]

clang-format again (IWYU and clang-format seemingly disagree).

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 22:09:33 +0000 (00:09 +0200)]

Run include-what-you-use.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 21:57:23 +0000 (23:57 +0200)]

Move TurboPFor compilation to its own compilation unit.

This file takes so long to compile, especially with optimization
and/or ASan on, that it became a real annoyance whenever we were
modifying plocate.cpp for anything else. Takes away some genericness
we don't really use.

We could do the same thing with the encoder if need be.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 21:51:55 +0000 (23:51 +0200)]

clang-format.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 20:52:47 +0000 (22:52 +0200)]

Fix a harmless memory leak in plocate-build.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 20:51:04 +0000 (22:51 +0200)]

Fix some Valgrind issues in plocate-build.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 20:41:41 +0000 (22:41 +0200)]

Make the searcher ASan-clean.

Allocate 16 bytes extra as slop after every read buffer, so that
we know we never read outside allocated memory. (This is much easier
now that we have a TurboPFor implementation with clearly defined slop.)

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 20:35:28 +0000 (22:35 +0200)]

Unbreak runs with no --limit.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 20:23:40 +0000 (22:23 +0200)]

Document slop requirements for TurboPFor decoding.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 18:49:30 +0000 (20:49 +0200)]

Implement the --limit option.

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 8 Oct 2020 18:05:41 +0000 (20:05 +0200)]

Implement the --count option.

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom