Retrieving, Deleting, and Reindexing (Updating) Documents

< Prev Next >

Amy Unruh, Oct 2012
Google Developer Relations

Introduction

This lesson shows you how to retrieve a document via its document ID, how to delete documents from an index, and how to update an existing document by re-indexing it. All of the convenience methods shown here are from the example application's docs.py file.

Objectives

Learn how to retrieve, delete, and re-index documents using the Search API.

Prerequisites

The precursor to this class, Getting Started with the Python Search API

You should also:

Python 2.7 and the Google App Engine SDK for Python
Familiarity with Python and the basics of App Engine applications

Retrieving a document by its document ID

You sometimes might need to retrieve a document by its document ID, instead of using a query. In the example application, this comes up in the context of product review creation, where a new review triggers some Datastore-related bookkeeping, after which the relevant product document is updated with its new average rating.

You can retrieve a document with the Index.get method and provide the document ID as the doc_id parameter:

@classmethod
def getDoc(cls, doc_id):
  """Return the document with the given doc id."""
  index = cls.getIndex()
  return index.get(doc_id=doc_id)

Deleting a document from an index

To delete a document from an index, pass its document ID to the index's delete method. You should also catch any DeleteError exceptions.

@classmethod
def deleteDocById(cls, doc_id):
  """Delete the doc with the given doc id."""
  try:
    cls.getIndex().delete(doc_id)
  except search.DeleteError:
    logging.exception("Error removing doc id %s.", doc_id)

You can delete all documents from a given index by using the index's get_range method. For efficiency, set the ids_only parameter to True, which means that the returned document objects only contain their IDs and not the document fields, which you don't need here. Delete each document based on the returned ID:

@classmethod
def deleteAllInIndex(cls):
  """Delete all the docs in the given index."""
  docindex = cls.getIndex()

  try:
    while True:
      # until no more documents, get a list of documents,
      # constraining the returned objects to contain only the doc ids,
      # extract the doc ids, and delete the docs.
      document_ids = [document.doc_id for document in docindex.get_range(ids_only=True)]
      if not document_ids:
        break
      docindex.delete(document_ids)
  except search.DeleteError:
    logging.exception("Error removing documents:")

Notice that this method loops until there are no more documents in the index. This is because get_range returns at most only 1000 documents at a time (the default limit is 100), so multiple calls may be needed to clear the whole index.

Document reindexing

To update or change an indexed document, simply add a new document object to the index with the same document ID. If the index already contains a document with that ID, the existing document will be updated and reindexed; if no document already exists in the index with that ID, the new document will simply be added with the given ID.

The example application uses product IDs from the sample product data as the document IDs. (If you look at the code, you'll notice that it also uses the product IDs as Product entity IDs in the Datastore). Because the document IDs are the same as the product IDs, it's easy to reindex the product data if it changes. Because you get the product IDs from the data source, you can update the indexed documents without needing to retrieve them first.

To step through this process, first take a look at the files data/sample_data_books.csv and data/sample_data_books_update.csv from the example application. These contain the application's sample product data. When the user clicks the link Delete all datastore and index product data, then load in sample product data, all of the data in data/sample_data_books.csv is imported, first deleting any existing index contents. For the purposes of this discussion, the pertinent part of the process is that when a new document is created, its ID is set to the product ID:

d = search.Document(doc_id=product_id, fields=docfields)

The document is then added to the product index.

Next, if the user clicks Demo loading product update data, the data in data/sample_data_books_update.csv is added to the index. Some of the entries in this file update existing book documents, since their product IDs correspond to existing documents. Other entries in this file define new books—by definition-since no existing documents have their product IDs.

Since you're using the product IDs as document IDs, you can create new documents from this data, setting their document IDs to the product IDs as above, and simply add the documents. You don't need to know whether any documents with these product IDs already exist. If documents with those same product IDs do exist, they will be updated with the new content and reindexed; if not, new documents are indexed.

If you look at the example application code, you'll notice that this is not quite the whole story: in some situations, there is information in an existing document that you need to retain and set in the updated document, so in those cases you do need to access the old document if it exists.

Summary and review

In this lesson, you've seen how to retrieve documents by document IDs and how to delete and update them.

This lesson concludes the Deeper Look at the Python Search API class. In the course of this class and its precursor, you've accumulated the basic toolkit for building applications that use the Search API. Try creating a simple application of your own, or making additional modifications to the example app!

You can get help on Stack Overflow using the google-app- engine tag, or in the App Engine Google Group.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated November 7, 2016.