Hi, Here is the second iteration of my Xapian Guix package search patchset. I have found the reason the earlier patchset did not show significant speedup. It turns out that most of the time is spent in printing and texinfo rendering of the search results. So, in this patchset, I pre-render the search results while building the Xapian index and stuff them into the Xapian database itself. Therefore, during `guix search`, I just pull out the pre-rendered search results and print it on the screen. This is much faster. See comparison below. --8<---------------cut here---------------start------------->8--- With a warm cache, $ time guix search inkscape real 0m1.787s user 0m1.745s sys 0m0.111s --8<---------------cut here---------------end--------------->8--- --8<---------------cut here---------------start------------->8--- $ time /tmp/test/bin/guix search inkscape real 0m0.199s user 0m0.182s sys 0m0.024s --8<---------------cut here---------------end--------------->8--- If most of the speedup comes from pre-rendering the results, it might seem that the Xapian search is not so useful. We might as well have stuffed the pre-rendered search results into the existing package cache generated by generate-package-cache, or so it might seem. But, there are the following arguments in favor of Xapian. - The package cache would grow in size, and lookup would be slowed down because we need to load the entire cache into memory. Xapian, on the other hand, need only look up the specific packages that match the search query. - Xapian can provide superior search results due to it stemming and language models. - Xapian can provide spelling correction and query expansion -- that is, suggest search terms to improve search results. Note that I haven't implemented this yet and is out of scope in this patchset. * Simplify our package search results Why not use a simpler package search results format like Arch Linux or Debian does? We could just display the package name, version and synopsis like so. inkscape 0.92.4 Vector graphics editor inklingreader 0.8 Wacom Inkling sketch format conversion and manipulation Why do we need the entire recutils format? If the user is interested, they can always use `guix package --show` to get the full recutils formatted info. Having shorter search results will make everything even faster and much more readable. WDYT? * How to test this patchset To get guile-xapian, run a `guix pull`, if you haven't already. Then in your Guix source directory, drop into an environment with guix dependencies and guile-xapian. $ guix environment guix --ad-hoc guile-xapian Apply patches and build. $ git am v2-0000-cover-letter.patch v2-0002-gnu-Generate-Xapian-package-search-index.patch v2-0001-build-self-Add-guile-xapian-to-Guix-dependencies.patch v2-0003-gnu-Use-Xapian-index-for-package-search.patch $ make Run a test guix pull. $ ./pre-inst-env guix pull --url=$PWD --branch=xapian -p /tmp/test where xapian is the name of the branch you committed the patches to. Then, run the guix search in /tmp/test. $ /tmp/test/bin/guix search game * Comments Pierre Neidhardt writes: >> +(define (search-package-index profile querystring) > > Maybe `query-string'? Done in this patchset. >> + (define (regexp? str) >> + (string-any >> + (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$) >> + str)) >> + >> + (if (and (current-profile) >> + (not (any regexp? patterns))) > > I would not put characters like ".", "$", or "+" here, lest we mistake a > Xapian pattern for a regexp. > > As you said, I don't think both are compatible without ambiguity > anyways, so we should probably drop regexp (or at least toggle them with > a command line argument). I agree. zimoun writes: > In the commit message, I would capitalize Xapian. Done in this patchset. >> +(define (generate-package-search-index directory) >> + "Generate under DIRECTORY a xapian index of all the available packages." > > Xapian with capital. Done in this patchset. > Is (make-stem "en") for the locale? I still have English hard-coded. I haven't yet figured out how to detect the locale and stem accordingly. But, there is a larger problem. Since we cannot anticipate what locale the user will run guix search with, should we build the Xapian index for all locales? That is, should we index not only the English versions of the packages but also all other translations as well? > package-search-index and package-cache-file could be refactored > because they share all the same code. Yes, they could be. However, I'll postpone to the next iteration of the patchset. > I do not know what is the convention for the bindings. > But there is 'fold-packages' so I would be inclined to 'fold-msets' or > something in this flavour. Well, everywhere else in guile we have such things as vhash-fold, string-fold, hash-fold, stream-fold, etc. That's why I went with mset-fold. Also, we are folding over a single mset (match-set). So, mset should be in the singular. > And more importantly, 'make as-derivations' to avoid a "guix pull" breakage, > Ah do not forget to adapt some tests. Will do this once we have consensus about the other features of this patchset. > b. The xapian relevance should truncated Done in this patchset. > Xapian does not return the package 'emacs' itself as the first. And worse, > it is not returned at all. In this patchset, since we're indexing the package name as well, emacs is returned but it is still far from the beginning. > I propose the value of 4294967295 for pagesize. In this patchset, I pass (database-document-count db) as the #:maximum-items keyword argument to enquire-mset. This is the upstream recommended way to get all search results. I hadn't done this earlier since I hadn't yet wrapped database-document-count in guile-xapian. >> In this patchset, I have only indexed the package descriptions. In the next >> version of this patchset, I will index all other terms as specified in >> %package-metrics of guix/ui.scm. > > Yes, it appears to me a detail that should be easy to fix. I mean, it > does not seems blocking. Done in this patchset. Ludovic Courtès writes: > Note that ‘guix search’ time is largely dominated by I/O. Yes, `guix search` is I/O intensive. That is why I expect Xapian to do better since it only needs to access matching packages not all packages. Also, the Xapian index is fast at all times. It is not very dependent on a warm filesystem cache. > On my laptop, > I get (first measurement is cold cache, second one is warm cache): > > --8<---------------cut here---------------start------------->8--- > $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' > $ time guix search foo >/dev/null > > real 0m2.631s > user 0m1.134s > sys 0m0.124s > $ time guix search foo >/dev/null > > real 0m0.836s > user 0m1.027s > sys 0m0.053s > --8<---------------cut here---------------end--------------->8--- > > It’s hard to do better on the warm cache case because at this level, > there may be other things to optimize having little to do with searching > itself. > > Note that this is on an SSD; the cold-cache case must be worse on NFS or > on a spinning disk, and there we could gain a lot. My laptop is quite old with a particularly slow HDD. Hence my motivation to improve guix search performance! > I think we should weigh the pros and cons on all these aspects: speed, > complexity and maintenance cost, search result quality, search features, > etc. I agree. > PS: I have not yet looked at the whole series as I’m just coming back to > the keyboard. :-) Welcome back! :-) Arun Isaac (3): build-self: Add guile-xapian to Guix dependencies. gnu: Generate Xapian package search index. gnu: Use Xapian index for package search. build-aux/build-self.scm | 11 +++++++ gnu/packages.scm | 62 +++++++++++++++++++++++++++++++++++++++- guix/channels.scm | 34 +++++++++++++++++++++- guix/scripts/package.scm | 7 +++-- guix/self.scm | 7 ++++- guix/ui.scm | 37 ++++++++++++++++++++++++ 6 files changed, 153 insertions(+), 5 deletions(-) -- 2.25.1