Google Developer Relations
Introduction
This lesson shows you how to retrieve a document via its document ID, how to
delete documents from an index, and how to update an existing document by
re-indexing it. All of the convenience methods shown here are from the example
application's docs.py file.
Objectives
Learn how to retrieve, delete, and re-index documents using the Search API.
Prerequisites
The precursor to this class, Getting Started with the Python Search API
You should also:
- Python 2.7 and the Google App Engine SDK for Python
- Familiarity with Python and the basics of App Engine applications
Retrieving a document by its document ID
You sometimes might need to retrieve a document by its document ID, instead of using a query. In the example application, this comes up in the context of product review creation, where a new review triggers some Datastore-related bookkeeping, after which the relevant product document is updated with its new average rating.
You can retrieve a document with the
Index.get
method and provide the document ID as the doc_id parameter:
@classmethod
def getDoc(cls, doc_id):
"""Return the document with the given doc id."""
index = cls.getIndex()
return index.get(doc_id=doc_id)
Deleting a document from an index
To delete a document from an index, pass its document ID to the index's
delete method.
You should also catch any DeleteError exceptions.
@classmethod
def deleteDocById(cls, doc_id):
"""Delete the doc with the given doc id."""
try:
cls.getIndex().delete(doc_id)
except search.DeleteError:
logging.exception("Error removing doc id %s.", doc_id)
You can delete all documents from a given index by using the index's get_range
method. For efficiency, set the ids_only parameter to True, which means that
the returned document objects only contain their IDs and not the document
fields, which you don't need here. Delete each document based on the returned
ID:
@classmethod
def deleteAllInIndex(cls):
"""Delete all the docs in the given index."""
docindex = cls.getIndex()
try:
while True:
# until no more documents, get a list of documents,
# constraining the returned objects to contain only the doc ids,
# extract the doc ids, and delete the docs.
document_ids = [document.doc_id for document in docindex.get_range(ids_only=True)]
if not document_ids:
break
docindex.delete(document_ids)
except search.DeleteError:
logging.exception("Error removing documents:")
Notice that this method loops until there are no more documents in the index.
This is because get_range returns at most only 1000 documents at a time
(the default limit is 100), so multiple calls may be needed to clear the whole
index.
Document reindexing
To update or change an indexed document, simply add a new document object to the index with the same document ID. If the index already contains a document with that ID, the existing document will be updated and reindexed; if no document already exists in the index with that ID, the new document will simply be added with the given ID.
The example application uses product IDs from the sample product data as the
document IDs. (If you look at the code, you'll notice that it also uses the
product IDs as Product entity IDs in the Datastore). Because the document IDs
are the same as the product IDs, it's easy to reindex the product data if it
changes. Because you get the product IDs from the data source, you can update the
indexed documents without needing to retrieve them first.
To step through this process, first take a look at the files
data/sample_data_books.csv and data/sample_data_books_update.csv from the
example application. These contain the application's sample product data. When
the user clicks the link Delete all datastore and index product data, then
load in sample product data, all of the data in data/sample_data_books.csv
is imported, first deleting any existing index contents. For the purposes of
this discussion, the pertinent part of the process is that when a new document
is created, its ID is set to the product ID:
d = search.Document(doc_id=product_id, fields=docfields)
The document is then added to the product index.
Next, if the user clicks Demo loading product update data, the data in
data/sample_data_books_update.csv is added to the index. Some of the entries
in this file update existing book documents, since their product IDs correspond
to existing documents. Other entries in this file define new books—by
definition-since no existing documents have their product IDs.
Since you're using the product IDs as document IDs, you can create new documents from this data, setting their document IDs to the product IDs as above, and simply add the documents. You don't need to know whether any documents with these product IDs already exist. If documents with those same product IDs do exist, they will be updated with the new content and reindexed; if not, new documents are indexed.
If you look at the example application code, you'll notice that this is not quite the whole story: in some situations, there is information in an existing document that you need to retain and set in the updated document, so in those cases you do need to access the old document if it exists.
Summary and review
In this lesson, you've seen how to retrieve documents by document IDs and how to delete and update them.
This lesson concludes the Deeper Look at the Python Search API class. In the course of this class and its precursor, you've accumulated the basic toolkit for building applications that use the Search API. Try creating a simple application of your own, or making additional modifications to the example app!
You can get help on Stack Overflow using the google-app- engine tag, or in the App Engine Google Group.