Amy Unruh, Oct 2012
Google Developer Relations
Introduction
This lesson covers the basics of using the Search API: indexing content and making queries on an index. In it, you'll learn how to
- Create a search index
- Add content to it via an index document
- Make simple full-text search queries on that indexed data
Objectives
Learn the basics of using the App Engine Search API.
Prerequisites
- Python 2.7 and the Google App Engine SDK for Python
- Basic understanding of Python
- Familiarity with Google App Engine
Indexes
App Engine's Search API operates through an
Index object. This object lets you
store data via an index document, retrieve documents using search queries,
modify documents, and delete documents.
Each index has an index name and, optionally, a namespace. The name uniquely
identifies the index within a given namespace. It must be a visible, printable
ASCII string not starting with !. Whitespace characters are excluded. You can
create multiple Index objects, but any two such objects that have the same
index name in the same namespace reference the same index.
You can use namespaces and indexes to organize your documents. For the example product search application, all the product documents are in one index, with another index containing information about store locations. We can filter a query on the product category if we want to search for, say, only books.
In your code, you create an Index object by specifying the index name:
from google.appengine.api import search
index = search.Index(name='productsearch1')
or
index = search.Index(name='yourindex', namespace='yournamespace')
The underlying document index will be created at first access if it does not already exist; you don't have to create it explicitly.
You can't currently delete indexes, though you can delete documents from them, as will be described in the next class, A Deeper Look at the Python Search API.
Documents
Documents hold an index's searchable content. A document is a container for
structuring indexable data. From a technical point of view, a
Document object represents a
uniquely identified collection of fields, identified by a document ID.
Fields are named, typed values. Documents do not have kinds in the
same sense as Datastore entities.
In our example application, for instance, our product categories are books and
HD televisions. The store has a rather limited selection of products. Each
product document in the example application always includes the following core
fields, defined by docs.Product class variables:
CATEGORY(set tobooksorhd_televisions)PID(product ID)PRODUCT_NAMEDESCRIPTIONPRICEAVG_RATINGUPDATED(date of last update)
The books and HD televisions categories each have some additional fields of their own. For books, the extra fields are:
titleauthorpublisherpagesisbn
For HD televisions, they are:
brandtv_typesize
The application itself enforces an application-level semantic consistency for documents of each product type. That is, all product documents will always include the same core fields, all books have the same set of additional fields, and so on. However, a search index doesn't impose any cross-document schematic consistency on the fields that are used, so there is no explicit concept of querying for "product" documents specifically.
Field types
Each document field has a unique field type. The type can be any of the
following, which is defined in the Python module search:
TextField: A plain text string.HtmlField: HTML-formatted text. If your string is HTML, use this field type, as the Search API can take the markup into account when creating result snippets and in document scoring.AtomField: A string treated as a single token. A query will not match if it includes only a substring rather than the full field value.NumberField: A numeric (integer or floating-point) value.DateField: A date with no time component.GeoField: A geographical location, denoted by aGeoPointobject specifying latitude and longitude coordinates.
For text fields (TextField, HtmlField, and AtomField), the values should
be Unicode strings.
Example: Building product document fields and creating a document
To construct a Document object, you build a list of its fields, define its
document ID if desired, and then pass this information to the Document
constructor.
The example application uses the TextField, AtomField, NumberField, and
DateField field types for product documents.
Defining the product document fields
The core product fields (those which are included in all product documents) look like this, where we assume the value arguments of the constructors below are set to appropriate values:
from google.appengine.api import search
...
fields = [
search.TextField(name=docs.Product.PID, value=pid), # the product id
# The 'updated' field is set to the current date.
search.DateField(name=docs.Product.UPDATED,
value=datetime.datetime.now().date()),
search.TextField(name=docs.Product.PRODUCT_NAME, value=name),
search.TextField(name=docs.Product.DESCRIPTION, value=description),
# The category names are atomic
search.AtomField(name=docs.Product.CATEGORY, value=category),
# The average rating starts at 0 for a new product.
search.NumberField(name=docs.Product.AVG_RATING, value=0.0),
search.NumberField(name=docs.Product.PRICE, value=price) ]
Note that the category field is typed as AtomField. Atom fields are useful for
things like categories, where exact matches are desired; Text fields are better
for strings like titles or descriptions. One of our example categories is hd
televisions. If we search for just televisions, we will not get a match
(assuming that that string is not contained in another product field). But, if
we search for the full field string, hd televisions, we will match on the
category field.
The example application also includes fields specific to individual product
categories. These are added to the field list as well, depending on the
category. For example, for the television category, there are additional fields
for size (a number field), brand, and tv_type (text fields). Books have a
different set of fields.
Creating Documents
Given the field list, we can create a document object. For each product document, we'll set its document ID to be the predefined unique ID of that product:
d = search.Document(doc_id=product_id, fields=fields)
This design has some advantages for us (as we'll discuss in the follow-on class to this one), but if we didn't specify the document ID, one would be generated for us automatically when the document is added to an index.
Example: Using geopoints in store location documents
The Search API supports Geosearch on documents that include fields of type
GeoField. If your documents contain such fields, you can query an index for
matches based on distance comparisons.
A location is defined by the
GeoPoint class, which stores
latitude and longitude coordinates. The latitude specifies the angular distance,
in degrees, north or south of the equator. The longitude specifies the angular
distance, again in degrees, east or west of the prime meridian. For example, the
location of the Opera House in Sydney is defined by
GeoPoint(-33.857, 151.215). To store a geopoint in a document, you need to add
a GeoField field with a GeoPoint object set as its value.
Here is how the fields for the store location documents in the product search application are constructed:
from google.appengine.api import search
...
geopoint = search.GeoPoint(latitude, longitude)
fields = [search.TextField(name=docs.Store.STORE_NAME, value=storename),
search.TextField(name=docs.Store.STORE_ADDRESS, value=store_address),
search.GeoField(name=docs.Store.STORE_LOCATION, value=geopoint) ]
Indexing documents
Before you can query a document's contents, you must add the document to an
index, using the Index object's
put() method. Indexing
allows the document to be searched with the Search API's query language and
query options.
You can specify your own document ID when constructing a document. The document
ID must be a visible, printable ASCII string not starting with !. Whitespace
characters are excluded. (As we'll see later, if you index a document using the
ID of an existing document, that existing document will be reindexed). If you
don't specify a document ID, a unique numeric ID will be generated automatically
when the document is added to the index.
You can add documents one at a time, or alternatively you can add a list of documents in batch, which is more efficient. Here's how to construct a document, given a fields list, and add it to an index:
from google.appengine.api import search
# Here we do not specify a document ID, so one will be auto-generated on put.
d = search.Document(fields=fields)
try:
add_result = search.Index(name=INDEX_NAME).put(d)
except search.Error:
# ...
You should catch and handle any exceptions resulting from the put(), which
will be of type search.Error.
If you want to specify the document ID, pass it to the Document constructor like this:
d = search.Document(doc_id=doc_id, fields=fields)
You can get the ID(s) of the document(s) that were added, via the id
properties of the list of search.AddResult objects returned from the put()
operation:
doc_id = add_result[0].id
Basic search queries
Adding documents to an index makes the document content searchable. You can then perform full-text search queries over the documents in the index.
There are two ways to submit a search query. Most simply, you can pass a query
string to the Index object's
search() method.
Alternatively, you can create a
Query object and pass that to the
search() method. Constructing a query object allows you to specify query,
sort, and result presentation options for your search.
In this lesson, we'll look at how to construct simple queries using both approaches. Recall that some search queries are not fully supported on the Development Web Server (running locally), so you'll need to run them using a deployed application.
Search using a query string
A query string can be any Unicode string that can be parsed by the Search
API's query language. Once you've
constructed a query string, pass it to the Index.search() method. For example:
from google.appengine.api import search
# a query string like this comes from the client
query = "stories"
try:
index = search.Index(INDEX_NAME)
search_results = index.search(query)
for doc in search_results:
# process doc ..
except search.Error:
# ...
Search using a query object
A Query object gives you more
control over your query options than does a query string. In this example, we
first construct a
QueryOptions object. Its
arguments specify that the query should return doc_limit number of results.
(If you've looked at the product search application code, you'll see more
complex QueryOption objects; we'll look at these in the following class, A
Deeper Look at the Python Search API). Next we construct the
Query object using the query string and the QueryOptions object. We then
pass the Query object to the Index.search() method, just as we did above
with the query string.
from google.appengine.api import search
# a query string like this comes from the client
querystring = “stories”
try:
index = search.Index(INDEX_NAME)
search_query = search.Query(
query_string=querystring,
options=search.QueryOptions(
limit=doc_limit))
search_results = index.search(search_query)
except search.Error:
# ...
Processing the query results
After you've submitted a query, matching search results are returned to the
application in an iterable
SearchResults object. This
object includes the number of results found, the actual results returned, and an
optional query cursor object.
The returned documents can be accessed by iterating on the SearchResults
object. The number of results returned is the length of the object's results
property. The number_found property is set to the number of hits found.
Iterating on the returned object gives you the returned documents, which you can
process as you like:
try:
search_results = index.search("stories")
returned_count = len(search_results.results)
number_found = search_results.number_found
for doc in search_results:
doc_id = doc.doc_id
fields = doc.fields
# etc.
except search.Error:
# ...
Summary and review
In this lesson, we've learned the basics of creating indexed documents and querying their contents. To check your knowledge, try recreating these steps yourself in your own simple application:
- Create an
Indexobject. - Build a list of document fields (say, using the
TextFieldtype) and construct aDocumentobject with that field list. Add the document to the index. - Search the index using a search string consisting of a term in one of your field values. Is the document you created returned as a match?
In the next lesson, we'll take a closer look at Search API indexes.