Databases

Sunday, August 11. 2013

A Flask of full-text search in PostgreSQL

Update: More conventional versions of the slides are available from Google Docs or in on Speakerdeck (PDF)

On August 10, 2013, I gave the following talk at the PyCon Canada 2013 conference:

I’m a systems librarian at Laurentian University.

For the past six years, my day job and research have enabled me to contribute pretty heavily to Evergreen, an open source library system written largely in Perl and built on PostgreSQL.

But when I have the opportunity to create a project from scratch, for work or play, Python is my go-to language.

I promised to provide an example of a full-text search engine built with Flask and PostgreSQL written in under 200 lines of code; you can find that at either Gitorious or Github.

Last summer, the Laurentian University Library digitized 50 years of student newspapers - over 1,000 issues.

We posted them all to the Internet Archive and got OCR’d text as a result.

But finding things within a particular collection on the Internet Archive can be difficult for most people, so we felt the need to create a local search solution.

We were already using PostgreSQL to track all of the individual issues, with attributes like newspaper name, volume, edition, publication date.

This gave us the ability to filter through the issues by year and issue number, which was a good start. But we also wanted to be able to search the full text for strings like “Pierre Trudeau” or “Mike Harris”.

A common approach is to feed the data into a dedicated search engine like Solr or Sphinx, and build a search interface on top of that.

But PostgreSQL has featured full-text search support since the 8.3 release in 2008. So I opted to keep the moving parts to a minimum and reuse my existing database as my search platform.

Our example table contains a “doc” column of type TEXT; that’s where we store the text that we want to search and display. We also have a “tsv” column of type TSVECTOR to store the normalized text.

The TSVECTOR column is typically maintained by a trigger that fires whenever the corresponding row is created or updated. So... you just insert TEXT into the doc column, and the trigger maintains the tsv column for you.

PostgreSQL includes the tsvector_update_trigger() for your convenience, as well as a trigger that uses different language-oriented normalizations based on the value of a specified column. Naturally, you can define your own trigger that invokes the to_tsvector() function.

To provide good performance, as with any relational table, you need to define appropriate indexes. In the case of full-text search, you want to define a GIN (or GiST) index on the TSVECTOR column.

Note: GIN indexes take longer to build, but provide faster lookup than GiST.

Finally, we INSERT the text documents into the database. The ID and TSVECTOR columns are automatically generated for us.

Each text document is:

Parsed into tokens (words, white space, URIs, file paths)
Each token is then normalized with one or more dictionaries (word tokens may get stemming and stop words; URIs might have “www.” stripped; everything gets folded to lower case)
Et voila: the text-search vector for our text document has been created!

PostgreSQL bundles a number of language-specific dictionaries to support different stemming algorithms and default sets of stopwords.

In this example, we can see that PostgreSQL has stemmed “sketching” to remove the verb suffix, removed “the” altogether as a stop word, and stemmed “trees” to its singular form.

You can also see that the TSVECTOR stores the position where each token occurs, to support relevancy ranking algorithms that boost the relevancy of terms that are located closely together.

But it's a doozy of a query!

Yes, there is a lot going on here.

First, this is just a regular SQL statement that happens to use the WITH clause to define named subqueries (“q” and “ranked”).

The to_tsquery() function takes an incoming full-text search query and converts that into a parsed, normalized query.

The ts_rank_cd() function compares the TS_VECTOR column against the query to determine its relevancy score.

We need to restrict it to rows that match our query, so we use the @@ operator (PostgreSQL allows data-type specific operators like this) and then take the top ten.

Note: the query, limit, and offset are hardcoded here for illustrative purposes, but in the actual application these are supplied as parameters to the prepared SQL statement.

Finally, we use the ts_headline() function to give us a highlighted snippet of the results.

The harvester is a Python 3 application that uses the ‘postgresql’ module to connect to the database.

The REST and Web applications in the IA Harvester application are Python 2, largely because they use Flask (which iswas Python 2 only). But in the demo application, I’ve converted them to Python 3.

While I could have simply written the Web and REST applications as a single Flask Web app that talks directly to the database, I opted to couple them via a JSON interface for a few reasons:

Decoupling the web front end from the search back end means that I can augment or swap out either piece entirely without affecting the overall experience.
If performance becomes an issue, I could add caching where profiling warrants it.
If the service ends up being incredibly popular and PostgreSQL turns into a bottleneck, even with caching, I could set up a Solr or Sphinx instance instead.

I can change from Flask to another Web application framework on either piece.

I can separate the hosts if I need to throw more hardware at the service, and/or virtualize it on something like Google App Engine.

With so many Python web frameworks available, why Flask?

Routes as decorators was admittedly the biggest draw, simplifying routing enormously!
Solid Unicode support was a must as well, as our newspapers are bilingual (English / French).

At the time I opted to try out Flask, the project’s public stance towards Python 3 was not warm. However, with the 0.10 release in June 2013, all that has (thankfully!) changed; Python 3.3 is supported.

Flexible format:

query: gives us the original query so we can echo that back to the user
results: is a list of the highlighted snippets that we returned from our full-text search
years: a facet that for the years in which hits were found, including the number of hits per year
collections: a facet for the collections in which hits were found, including the number of hits per collection
meta:
- total: the total number of hits
- page: the page we’re on
- limit: the number of hits per page
- results: the number of hits on this page of results

We’ve already seen the decorator-route pattern before, and of course we need to quote, encode, and decode our search URL and results.

The Flask-specific parts are helper methods for getting GET params from the query string, and rendering the template (in this case, via Jinja2), by passing in values for the template placeholders.

At this point, the UI is functional but spartan. I’m a database / code guy, not a designer. Luckily, I have a student working on improving the front end (hi Emily!)

Further information:

Demonstration application:
- Gitorious repository / Github
Internet Archive harvester project: https://gitorious.org/ia-harvester
PostgreSQL full-text search:
- PostgreSQL official documentation
- Seek and ye shall find: PostgreSQL full-text search (presentation at PostgresOpen 2012)
Flask official docs
Jinja2 official docs

Posted by Dan Scott in Coding, PostgreSQL, Python at 11:38 | Comments (5)

Defined tags for this entry: coding, postgresql, python

Friday, July 5. 2013

PyCon Canada 2013 - PostgreSQL full-text search and Flask

On August 10, 2013, I'll be giving a twenty-minute talk at PyCon Canada on A Flask of full-text search with PostgreSQL. I'm very excited to be talking about Python, at a Python conference, and to be giving the Python audience a peek at PostgreSQL's full-text search capabilities. With a twenty minute slot, I'll be leaning on my code4lib experience to compress the right amount of technical information into an entertaining package.

Setting aside my talk, the line-up for PyCon Canada looks fantastic; the keynote speakers are Karen Brennan, Hilary Mason, and Jakob Kaplan-Moss, and there are a ton of great talks. Did I mention that I'm really looking forward to this conference?

Update 2013-07-11: Now that the schedule is official, the presentation URL needed to be updated. Also, the impetus for this proposal came straight from PGCon 2013, where the PostgreSQL community was urged to get the good word out about PostgreSQL to other communities. Et voila!

Posted by Dan Scott in Coding, FSOSS, PostgreSQL, Python at 05:00 | Comments (0)

Defined tags for this entry: coding, fsoss, postgresql, python

Friday, February 15. 2013

Introducing SQL to Evergreen administrators, round two

Three years ago I was asked to create and deliver a two-day course introducing SQL to Evergreen users. Things went well and I was able to share the resulting materials with the Evergreen and PostgreSQL community. Perhaps one of my happiest moments at the Evergreen conference last year was when one of the participants in that course, told me that many of his fellow participants were still successfully writing SQL queries and getting work done. Huzzah!

Time goes by and another group, OHIONET, was running into difficulties getting started with PostgreSQL and Evergreen. They asked me if I would be willing to give the same sort of training I had given a few years back. "Sure", I said, thinking it would be a great opportunity to polish the materials and add some updates to cover new features in PostgreSQL and Evergreen. We also opted to skip the travel and do an entirely virtual training session via Google Hangouts, which worked out rather nicely (but that's a different story).

As it turned out, I probably ended up putting about four days worth of effort (crammed into lots of late nights, weekends, and vacation days) into overhauling the instruction materials. But the results were worth it, in my opinion; I'm rather proud of the content, and while I believe it stands up on its own, the guidance that I was able to provide during the live instruction sessions was well-received by the participants.

Thus, I am pleased to be able to offer to the broader community the latest version of the Introduction to SQL for Evergreen Administrators, under a Creative Commons Attribution-ShareAlike 3.0 (Unported) license.

Reference documentation--30 pages introducing SQL with examples drawn from the Evergreen schema: (HTML) (PDF) (ePub) (AsciiDoc)
Presentation: (LibreOffice Impress) (PDF)
Solutions to exercises: (Day 1) (Day 2)

So, a huge thanks to OHIONET for giving me the impetus to overhaul this material, and for giving me a chance to introduce them to the wonders of SQL with PostgreSQL, and to the inner workings of the Evergreen schema. It was a blast! And thanks for agreeing to let me share these materials with the broader community.

Posted by Dan Scott in Evergreen, PostgreSQL at 21:32 | Comment (1)

Defined tags for this entry: evergreen, postgresql

Thursday, October 18. 2012

Triumph of the tiny brain: Dan vs. Drupal / Panels

A while ago I inherited responsibility for a Drupal 6 instance and a rather out-of-date server. (You know it's not good when your production operating system is so old that it is no longer getting security updates).

I'm not a Drupal person. I dabbled with Drupal years and years ago when I was heavily into PHP, but it never stuck with me. Every time I poked around at the database schema, with serialized objects stuck inside columns, I found something else that I wanted to work on instead. Thus, inheriting a Drupal instance wasn't something I had been looking forward to. As this production server was running a number of different services that were in use by our library, I went through a number of trial runs to ensure that the base packages wouldn't introduce regressions or outages. Fast-forward past a reasonably successful early-morning upgrade from Debian Lenny to Squeeze and I was able to start looking at addressing the Drupal instance that was also approximately 18 months out of date.

Initially, after I worked out the how-to of Drupal upgrades (in short: upgrade just Drupal core, then upgrade the modules), I thought all was well. I even got over the hump of realizing that our instance had had all of the modules dumped into Drupal's core directory, rather than sites/all/modules, and (even more impressively) got over the problem that the core bluemarine them had been hacked directly rather than having been separated out into a new custom theme. After working through those learning pains, I realized that somewhere in all of the Drupal and module upgrades, that something got "more secure" and started truncating IMG links to files with spaces in them at the first space. So "foo%20bar.jpg" was becoming "foo.jpg" and we were getting 404s everywhere.

Did I mention that I didn't notice this until I upgraded our production instance? Oh yes, I went through iteration after iteration of upgrades on the test server, and dutifully fixed up the problems that I found in the subset of content that I was testing against. I discovered and fixed problems like the production server content linking directly to the test server (slight copy-and-paste errors on the part of the content creators, I suppose). But I didn't notice all of the 404s, because who uploads images with spaces in their filename?

Turns out, everyone else in my library does that. Of course! And from what I was able to piece together via Google and browsing drupal.org, there was supposed to be some sanitization of the incoming filenames so that spaces would be normalized, etc. But either that wasn't introduced until well after our content had been created, or my predecessor had lightly hacked one of the modules, or Drupal itself, and hadn't bothered to use a source code repository to track those customizations. So, realizing that I needed to make some bulk changes, I went at it with a two-step plan:

Create symbolic links for both the truncated filename and the spaces-normalized-to-underscores filenames. Creating symlinks for the truncated filenames would fix the 404s immediately, at the cost of some clash in the intended targets; there were plenty of Foo illustration.JPG and Foo info.JPG pairs of files, but like the Highlander, there can be only on Foo.JPG.
Munge the database entries so that all of those now apparently insecure %20-containing filenames would become underscores.

If you're a Drupal user or a Drupal with Panels module user, you might know that the database schema suffers from some fairly horrible tricks being played on it. In this case, the Panels module creates a panels_pane table with a configuration TEXT column. Based on the name alone, it might seem odd that column is used to store the HTML content of the corresponding panel. Even odder to me is that this is not just a TEXT column, it's a column that expects a very particular structure - something like:

a:5:{s:11:"admin_title";s:5:"RACER";s:5:"title";s:0:"";s:4:"body";s:639:"<p><img width="225" height [...]}

Ah, nothing like storing an object within a single database column. Of particular interest was the result that I had when I tested updating the column value with a basic "replace(configuration, '%20', '_')" - the panel showed only n/a, presumably because the size (defined by the s properties in the object) for the "body" text property was no longer a match. That would be an instance of http://drupal.org/node/926448 - so okay, clearly I had to change tactics and update the entire object.

I tried quickly finding the Drupal way to do this: clearly there's an API and there must be some simple way to retrieve an object, change it's values, and update it so that the serialized object gets stored in the database and Drupal is happy. However, I couldn't find a simple tutorial, and trying #drupal on Freenode was unfortunately fruitless as well (although some people did try to suggest running REPLACE() at the database level, that was nice but they didn't recognize that that would actually damage things significantly).

So... out came the Perl, and here's what I hacked together:

#!/bin/perl
use strict;
use warnings;

foreach (<DATA>) {
    chomp();
    my $i = 0;
    my $body = 0;
    my @fixed;
    my @row = split /\t/;
    my $pid = $row[1];
    my $configuration = $row[0];
    my @chunks = split /";s:/, $configuration;
    foreach my $chunk (@chunks) {
        if (!$i++) {
            push @fixed, $chunk;
            next;
        }
        if ($chunk =~ m/"body/) {
            $body = 1;
            push @fixed, $chunk;
            next;
        }
        if ($body) {
            my ($length, $content) = $chunk =~ m/^(\d+):"(.+)$/;
			for (my $j = 0; $j < 50; $j++) {
            $content =~ s{(/pictures/[^\./]+?)%20}{$1_}g;
			}
            $content =~ s{%20}{+}g;
            $length = length($content);
            $chunk = "$length:\"$content";
            $body = 0;
        }
        push @fixed, $chunk;
    }
    print 'UPDATE panels_pane SET configuration = $ashes$' . 
        join('";s:', @fixed) . '$ashes$' . " WHERE pid = $pid;\n";
}
__DATA__

Against the trusty database (I ? PostgreSQL!), I ran COPY (SELECT configuration, pid FROM panels_pane WHERE configuration ~ '%20') TO 'conf_pids.out';, then slapped the Perl code on top and generated a load of UPDATE statements. It's far from my best Perl code, but it worked and once I gave up on doing things the Drupal way I was able to put it together in a handful of minutes. I now have a functional Drupal 6 instance again, updated such that there are no known security vulnerabilities with either core or the modules we're using, and there are no broken image links.

And now I need to begin working towards either grokking Drupal, or finding a content management system that my tiny brain can comprehend, because I don't want to have to go through these kinds of contortions again with future upgrades... Suggestions welcome!

Posted by Dan Scott in Coding, Perl, PostgreSQL at 17:48 | Comments (2)

Defined tags for this entry: coding, perl, postgresql

(Page 1 of 5, totaling 20 entries) » next page

Quicksearch

About Me

I'm Dan Scott: barista, library geek, and free-as-in-freedom software developer. I hack on projects such as the Evergreen open-source ILS project and PEAR's File_MARC package .

By day I'm the Systems Librarian for Laurentian University. You can reach me by email at [email protected].

License

This work is licensed under a Creative Commons Attribution-Share Alike 2.5 Canada License.

Firefox Beta for Android

Coffee|Code : Dan Scott