Tuesday, August 26, 2008

Search|Extending Django’s database API to include full-text search

Extending Django's database API to include full-text search

A Mercurytide white paper: 1st August 2006.

Introducing Django

Django is an open-source web application framework that—to quote its creators—encourages rapid development and clean, pragmatic design. It is written in Python, and provides components to build high-quality web applications: an object-relational mapper, a full-featured templating system, URL dispatching, session management, authentication, and much more besides.

One valuable measure of a framework is the ease with which it can be extended to meet one's needs. Here we will introduce Django's database API, and demonstrate its flexibility by extending it to support MySQL's full-text search capabilities.

The source code included in this article is released into the public domain.

Django's Database API

Django's object-relational mapper provides a rich API for building database queries.

Django's object-relational mapper provides a rich API for building database queries. This API insulates the application developer from the details of the SQL statements, while still allowing queries with complex selection criteria to be made in an efficient manner. Another notable benefit of this approach is that it removes entirely the possibility of SQL injection attacks, as the application developer is no longer inserting values directly into SQL statements.

Objects are retrieved from the database with a QuerySet, which is an abstraction of an SQL SELECT statement. QuerySets provide methods that narrow down the results to objects matching specific criteria, much like the WHERE clause in an SQL statement—in fact, behind the scenes a QuerySet builds an SQL statement as its methods are called. QuerySet instances are obtained from a model class's Manager instance, which is normally called objects. Here are a few examples of the use of QuerySets:

# Retrieve the QuerySet containing all articles
articles = Article.objects.all()

# Include only articles written before this year
articles = articles.filter(posted_date__lt='2006-01-01')

# But leave out those written by me
articles = articles.exclude(author__exact='Andrew')

# And finally, sort them by rating and date
articles = articles.order_by('rating', 'posted_date')

QuerySets can be filtered and sorted very cheaply, as they are lazily evaluated: these actions manipulate the QuerySet's internal SQL statement, and the statement is not executed until you try to access the results of the QuerySet, by iteration or slicing, for example.

# Get the top five articles; only now is the database accessed.
a = articles[:5]

To extend this interface, we'll develop a Manager subclass and a QuerySet subclass, but first we'll briefly describe MySQL's full text search.

Using the Search Component

We'll now demonstrate the use of these subclasses. Here is a simple model that represents articles posted on a web site; so that it can be searched, we create a SearchManager instance and assign it to objects:

from django.db import models
from fulltext.search import SearchManager

class Article(models.Model):
posted_date = models.DateField(db_index=True)
title = models.CharField(maxlength=100)
text = models.TextField()

# Use a SearchManager for retrieving objects,
# and tell it which fields to search.
objects = SearchManager(('title', 'text'))

class Admin:
pass

def __str__(self):
return "%s (%s)" % (self.title, self.posted_date)

Articles have a title, body text, and the date they were posted. We will define a FULLTEXT INDEX on the title and text columns in the database, and we pass a tuple of the corresponding field names to the SearchManager instance to declare that here. Here's the SQL statement that creates the index:

CREATE FULLTEXT INDEX fulltext_article_title_text
ON fulltext_article (title, text);

Given a Django project, an application that defines the Article model, and a database populated with suitable articles, the full text search can be easily demonstrated with Python's interactive interpreter:

>>> # How many articles are there?
>>> len(Article.objects.all())

12

>>> # Find articles about frameworks:
>>> Article.objects.search('framework')

[<Article: Introducing Django (2006-07-17)>,
<Article: Django on Windows HOWTO (2006-04-03)>,
<Article: Django for non-programmers (2006-05-02)>]


>>> # Show the relevance score for these articles:
>>> [(a, a.relevance)
... for a in Article.objects.search('framework')]

[(<Article: Introducing Django (2006-07-17)>, 1.0980),
(<Article: Django on Windows HOWTO (2006-04-03)>, 0.6015),
(<Article: Django for non-programmers (2006-05-02)>, 0.5842)]


>>> # Limit the results to those posted before June:
>>> Article.objects.search('framework').filter(
... posted_date__lt='2006-06-01')

[<Article: Django on Windows HOWTO (2006-04-03)>,
<Article: Django for non-programmers (2006-05-02)>]


>>> # Note that filter() also returns a SearchQuerySet:
>>> Article.objects.filter(
... posted_date__lt='2006-06-01').search('framework')

[<Article: Django on Windows HOWTO (2006-04-03)>,
<Article: Django for non-programmers (2006-05-02)>]

Final Remarks

I'll now let you in on a secret: since the start of June, Django has supported a search operator for queries that uses MySQL's full text search:

# This uses the Boolean search mode, not
# natural language queries
Article.objects.filter(title__search='+Django -Rails')

However, the techniques demonstrated in this article can be built upon extend the database API to support any SQL feature, whether that's supporting full text search, grouping and aggregate queries, or any other SQL feature or database-specific extension.

Technorati tags
;
;
;
;
; and
.

References

Mercurytide is a forward thinking, dynamic, and innovative Internet applications development company. We are well-established with a proven track-record within the industry of attracting blue chip clients from around the world. We produce monthly white papers on a variety of technology-orientated topics. For more details see the Mercurytide web site.

No comments: