Google Research Blog
The latest news from Research at Google
A Multilingual Corpus of Automatically Extracted Relations from Wikipedia
Tuesday, June 02, 2015
Posted by Shankar Kumar, Google Research Scientist and Manaal Faruqui, Carnegie Mellon University PhD candidate
In
Natural Language Processing
, relation extraction is the task of assigning a semantic relationship between a pair of arguments. As an example, a relationship between the phrases “
Ottawa
” and “
Canada
” is “
is the capital of
”. These extracted relations could be used in a variety of applications ranging from
Question Answering
to building databases from unstructured text.
While relation extraction systems work accurately for English and a few other languages, where tools for syntactic analysis such as parsers, part-of-speech taggers and named entity analyzers are readily available, there is relatively little work in developing such systems for most of the world's languages where linguistic analysis tools do not yet exist. Fortunately, because we do have translation systems between English and many other languages (such as
Google Translate
), we can translate text from a non-English language to English, perform relation extraction and project these relations back to the foreign language.
Relation extraction in a Spanish sentence using the cross-lingual relation extraction pipeline.
In
Multilingual Open Relation Extraction Using Cross-lingual Projection
, that will appear at the
2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies
(NAACL HLT 2015), we use this idea of cross-lingual projection to develop an algorithm that extracts open-domain relation
tuples
, i.e. where an arbitrary phrase can describe the relation between the arguments, in multiple languages from
Wikipedia
. In this work, we also evaluated the performance of extracted relations using human annotations in French, Hindi and Russian.
Since there is no such publicly available corpus of multilingual relations, we are
releasing a dataset
of automatically extracted relations from the Wikipedia corpus in 61 languages, along with the manually annotated relations in 3 languages (French, Hindi and Russian). It is our hope that our data will help researchers working on natural language processing and encourage novel applications in a wide variety of languages. More details on the corpus and the file formats can be found in this
README file
.
We wish to thank Bruno Cartoni, Vitaly Nikolaev, Hidetoshi Shimokawa, Kishore Papineni, John Giannandrea and their teams for making this data release possible. This dataset is licensed by Google Inc. under the
Creative Commons Attribution-ShareAlike 3.0 License
.
A picture is worth a thousand (coherent) words: building a natural description of images
Monday, November 17, 2014
Posted by Google Research Scientists Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan
“Two pizzas sitting on top of a stove top oven”
“A group of people shopping at an outdoor market”
“Best seats in the house”
People can summarize a complex scene in a few words without thinking twice. It’s much more difficult for computers. But we’ve just gotten a bit closer -- we’ve developed a machine-learning system that can automatically produce captions (like the three above) to accurately describe images the first time it sees them. This kind of system could eventually help visually impaired people understand pictures, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images.
Recent research has greatly improved
object detection, classification, and labeling
. But accurately describing a complex scene requires a deeper representation of what’s going on in the scene, capturing how the various objects relate to one another and translating it all into natural-sounding language.
Automatically captioned: “Two pizzas sitting on top of a stove top oven”
Many efforts to construct computer-generated natural descriptions of images propose combining current state-of-the-art techniques in both
computer vision
and
natural language processing
to form a
complete image description approach
. But what if we instead merged recent computer vision and language models into a single jointly trained system, taking an image and directly producing a human readable sequence of words to describe it?
This idea comes from recent advances in
machine translation
between languages, where a
Recurrent Neural Network
(RNN) transforms, say, a French sentence into a
vector representation
, and a second RNN uses that vector representation to generate a target sentence in German.
Now, what if we replaced that first RNN and its input words with a deep
Convolutional Neural Network
(CNN) trained to classify objects in images? Normally, the CNN’s last layer is used in a final
Softmax
among known classes of objects, assigning a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN’s rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image.
The model combines a vision CNN with a language-generating RNN so it can take in an image and generate a fitting natural-language caption.
Our experiments with this system on several openly published datasets, including Pascal, Flickr8k, Flickr30k and SBU, show how robust the qualitative results are -- the generated sentences are quite reasonable. It also performs well in quantitative evaluations with the
Bilingual Evaluation Understudy
(BLEU), a metric used in machine translation to evaluate the quality of generated sentences.
A selection of evaluation results, grouped by human rating.
A picture may be worth a thousand words, but sometimes it’s the words that are most useful -- so it’s important we figure out ways to translate from images to words automatically and accurately. As the datasets suited to learning image descriptions grow and mature, so will the performance of end-to-end approaches like this. We look forward to continuing developments in systems that can read images and generate good natural-language descriptions. To get more details about the framework used to generate descriptions from images, as well as the model evaluation, read the full paper
here
.
Teaching machines to read between the lines (and a new corpus with entity salience annotations)
Monday, August 25, 2014
Posted by Dan Gillick, Research Scientist, and Dave Orr, Product Manager
Language understanding systems are largely trained on freely available data, such as the
Penn Treebank
, perhaps the most widely used linguistic resource ever created. We have previously released
lots of linguistic data
ourselves, to contribute to the language understanding community as well as encourage further research into these areas.
Now, we’re releasing a new dataset, based on another great resource: the
New York Times Annotated Corpus
, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages
use of the metadata
for all kinds of things, and has set up
a forum
to discuss related research.
We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people -- we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.
One way to approach the problem is to look for words that appear more often than their ordinary rates. For example, if you see the word “coach” 5 times in a
581 word article
, and compare that to the usual frequency of “coach” --
more like 5 in 330,000 words
-- you have reason to suspect the article has something to do with coaching. The term “basketball” is even more extreme, appearing 150,000 times more often than usual. This is the idea of the famous
TFIDF
, long used to index web pages.
Congratulations to
Becky Hammon
, first female NBA coach! Image via Wikipedia.
Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the
Knowledge Graph
, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about.
Background information about entities ought to help us decide which of them are most salient. After all, an article’s author assumes her readers have some general understanding of the world, and probably a bit about sports too. Using background knowledge, we might be able to infer that the WNBA is a salient entity in the Becky Hammon article even though it only appears once.
To encourage research on leveraging background information, we are releasing a large dataset of annotations to accompany the New York Times Annotated Corpus, including resolved
Freebase entity IDs
and labels indicating which entities are salient. The salience annotations are determined by automatically aligning entities in the document with entities in accompanying human-written abstracts. Details of the salience annotations and some baseline results are described in our recent paper:
A New Entity Salience Task with Millions of Training Examples
(Jesse Dunietz and Dan Gillick).
Since our entity resolver works better for named entities like WNBA than for nominals like “coach” (this is the notoriously difficult
word sense disambiguation
problem, which we’ve
previously touched on
), the annotations are limited to names.
Below is sample output for a document. The first line contains the NYT document ID and the headline; each subsequent line includes an entity index, an indicator for salience, the mention count for this entity in the document as determined by our coreference system, the text of the first mention of the entity, the byte offsets (start and end) for the first mention of the entity, and the resolved Freebase MID.
Features like mention count and document positioning give reasonable salience predictions. But because they only describe what’s explicitly in the document, we expect a system that uses background information to expose what’s implicit could give better results.
Download the data directly
from Google Drive
, or visit the project home page with more information at
our Google Code site
. We look forward to seeing what you come up with!
A Billion Words: Because today's language modeling standard should be higher
Wednesday, April 30, 2014
Posted by Dave Orr, Product Manager, and Ciprian Chelba, Research Scientist
Language is chock full of ambiguity, and it can turn up in surprising places. Many words are hard to tell apart without context: most Americans
pronounce “ladder” and “latter” identically
, for instance. Keyboard inputs on mobile devices have a similar problem, especially for
IME keyboards
. For example, the input patterns for “Yankees” and “takes” look very similar:
Photo credit: Kurt Partridge
But in this context -- the previous two words, “New York” -- “Yankees” is much more likely.
One key way computers use context is with
language models
. These are used for predictive keyboards, but also speech recognition, machine translation, spelling correction, query suggestions, and so on. Often those are specialized: word order for queries versus web pages can be very different. Either way, having an accurate language model with wide coverage drives the quality of all these applications.
Due to interactions between components, one thing that can be tricky when evaluating the quality of such complex systems is error attribution. Good engineering practice is to evaluate the quality of each module separately, including the language model. We believe that the field could benefit from a large, standard set with benchmarks for easy comparison and experiments with new modeling techniques.
To that end,
we are releasing scripts
that convert a set of public data into a language model consisting of over a billion words, with standardized training and test splits, described in an
arXiv paper
. Along with the scripts, we’re releasing the processed data in one convenient location, along with the training and test data. This will make it much easier for the research community to quickly reproduce results, and we hope will speed up progress on these tasks.
The benchmark scripts and data are freely available, and can be found here:
http://www.statmt.org/lm-benchmark/
The field needs a new and better standard benchmark. Currently, researchers report from a set of their choice, and results are very hard to reproduce because of a lack of a standard in preprocessing. We hope that this will solve both those problems, and become the standard benchmark for language modeling experiments. As more researchers use the new benchmark, comparisons will be easier and more accurate, and progress will be faster.
For all the researchers out there, try out this model, run your experiments, and let us know how it goes -- or publish, and we’ll enjoy finding your results at conferences and in journals.
Free Language Lessons for Computers
Tuesday, December 03, 2013
Posted by Dave Orr, Google Research Product Manager
Not everything that can be counted counts.
Not everything that counts can be counted.
-
William Bruce Cameron
50,000 relations from Wikipedia. 100,000 feature vectors from YouTube videos. 1.8 million historical infoboxes. 40 million entities derived from webpages. 11 billion Freebase entities in 800 million web documents. 350 billion words’ worth from books analyzed for syntax.
These are all datasets that we’ve shared with researchers around the world over the last year from Google Research.
But data by itself doesn’t mean much. Data is only valuable in the right context, and only if it leads to increased knowledge. Labeled data is critical to train and evaluate machine-learned systems in many arenas, improving systems that can increase our ability to understand the world. Advances in natural language understanding, information retrieval, information extraction, computer vision, etc. can help us
tell stories
, mine for valuable insights, or
visualize information
in beautiful and compelling ways.
That’s why we are pleased to be able to release sets of labeled data from various domains and with various annotations, some automatic and some manual. Our hope is that the research community will use these datasets in ways both straightforward and surprising, to improve systems for annotation or understanding, and perhaps launch new efforts we haven’t thought of.
Here’s a listing of the major datasets we’ve released in the last year, or you can subscribe to our
mailing list
. Please tell us what you’ve managed to accomplish, or send us pointers to papers that use this data. We want to see what the research world can do with what we’ve created.
50,000 Lessons on How to Read: a Relation Extraction Corpus
What is it
: A human-judged dataset of two relations involving public figures on
Wikipedia
: about 10,000 examples of “place of birth” and 40,000 examples of “attended or graduated from an institution.”
Where can I find it
:
https://code.google.com/p/relation-extraction-corpus/
I want to know more
: Here’s a
handy blog post
with a broader explanation, descriptions and examples of the data, and plenty of links to learn more.
11 Billion Clues in 800 Million Documents
What is it
: We took the ClueWeb corpora and automatically labeled concepts and entities with
Freebase concept IDs
, an example of entity resolution. This dataset is huge: nearly 800 million web pages.
Where can I find it
: We released two corpora:
ClueWeb09 FACC
and
ClueWeb12 FACC
.
I want to know more
: We described the process and results in a recent blog post.
Features Extracted From YouTube Videos for Multiview Learning
What is it
: Multiple feature families from a set of public YouTube videos of games. The videos are labeled with one of 30 categories, and each has an associated set of visual, auditory, and and textual features.
Where can I find it
: The data and more information can be obtained from the
UCI machine learning repository (multiview video dataset)
, or from
Google’s repository
.
I want to know more
: Read more about the data and uses for it
here
.
40 Million Entities in Context
What is it
: A disambiguation set consisting of pointers to 10 million web pages with 40 million entities that have links to Wikipedia. This is another entity resolution corpus, since the links can be used to disambiguate the mentions, but unlike the ClueWeb example above, the links are inserted by the web page authors and can therefore be considered human annotation.
Where can I find it
: Here’s the
WikiLinks corpus
, and tools can be found to help use this data on our partner’s page:
Umass Wiki-links
.
I want to know more
: Other disambiguation sets, data formats, ideas for uses of this data, and more can be found at our
blog post announcing the release
.
Distributing the Edit History of Wikipedia Infoboxes
What is it
: The edit history of 1.8 million infoboxes in Wikipedia pages in one handy resource. Attributes on Wikipedia change over time, and some of them change more than others. Understanding attribute change is important for extracting accurate and useful information from Wikipedia.
Where can I find it
:
Download from Google
or from
Wikimedia Deutschland
.
I want to know more
: We
posted
a detailed look at the data, the process for gathering it, and where to find it. You can also read a
paper
we published on the release.
Note the change in the capital of Palau.
Syntactic Ngrams over Time
What is it
: We automatically syntactically analyzed 350 billion words from the 3.5 million English language books in
Google Books
, and collated and released a set of fragments -- billions of unique tree fragments with counts sorted into types. The underlying corpus is the same one that underlies the recently updated
Google Ngram Viewer
.
Where can I find it
:
http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html
I want to know more
: We discussed the nature of dependency parses and describe the data and release in a
blog post
. We also published a
paper about the release
.
Dictionaries for linking Text, Entities, and Ideas
What is it
: We created a large database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts in this case are Wikipedia articles, and the strings are anchor text spans that link to the concepts in question.
Where can I find it
:
http://nlp.stanford.edu/pubs/crosswikis-data.tar.bz2
I want to know more
: A description of the data, several examples, and ideas for uses for it can be found in a
blog post
or in the
associated paper
.
Other datasets
Not every release had its own blog post describing it. Here are some other releases:
Automatic
Freebase annotations
of Trec’s Million Query and Web track queries.
A
set of Freebase triples
that have been deleted from Freebase over time -- 63 million of them.
New Research Challenges in Language Understanding
Friday, November 22, 2013
Posted by Maggie Johnson, Director of Education and University Relations
We held the first global Language Understanding and Knowledge Discovery Focused Faculty Workshop in Nanjing, China, on November 14-15, 2013. Thirty-four faculty members joined the workshop arriving from 10 countries and regions across APAC, EMEA and the US. Googlers from Research, Engineering and University Relations/University Programs also attended the event.
The 2-day workshop included keynote talks, panel discussions and break-out sessions [
agenda
]. It was an engaging and productive workshop, and we saw lots of positive interactions among the attendees. The workshop encouraged communication between Google and faculty around the world working in these areas.
Research in text mining continues to explore open questions relating to entity annotation, relation extraction, and more. The workshop’s goal was to brainstorm and discuss relevant topics to further investigate these areas. Ultimately, this research should help provide users search results that are much more relevant to them.
At the end of the workshop, participants identified four topics representing challenges and opportunities for further exploration in Language Understanding and Knowledge Discovery:
Knowledge representation, integration, and maintenance
Efficient and scalable infrastructure and algorithms for inferencing
Presentation and explanation of knowledge
Multilingual computation
Going forward, Google will be collaborating with academic researchers on a position paper related to these topics. We also welcome faculty interested in contributing to further research in this area to submit a proposal to the
Faculty Research Awards program
. Faculty Research Awards are one-year grants to researchers working in areas of mutual interest.
The faculty attendees responded positively to the focused workshop format, as it allowed time to go in depth into important and timely research questions. Encouraged by their feedback, we are considering similar workshops on other topics in the future.
11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts
Wednesday, July 17, 2013
Posted by Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, and Michael Ringgaard, Google Research
“I assume that by knowing the truth you mean knowing things as they really are.”
- Plato
When you type in a search query -- perhaps
Plato
-- are you interested in the string of letters you typed? Or the concept or entity represented by that string? But knowing that the string represents something real and meaningful only gets you so far in computational linguistics or information retrieval -- you have to know what the string actually refers to. The
Knowledge Graph
and
Freebase
are databases of things, not strings, and references to them let you operate in the realm of concepts and entities rather than strings and n-grams.
We’ve previously released
data to help with disambiguation
and recently awarded
$1.2M in research grants
to work on related problems. Today we’re taking another step: releasing data consisting of nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.
These Freebase Annotations of the ClueWeb Corpora (FACC) consist of
ClueWeb09 FACC
and
ClueWeb12 FACC
. 11 billion phrases that refer to concepts and entities in Freebase were automatically labeled with their unique identifiers (
Freebase MID’s
). For example:
Since the annotation process was automatic, it likely made mistakes. We optimized for precision over recall, so the algorithm skipped a phrase if it wasn’t confident enough of the correct MID. If you prefer higher precision, we include confidence levels, so you can filter out lower confidence annotations that we did include.
Based on review of a sample of documents, we believe the precision is about 80-85%, and recall, which is inherently difficult to measure in situations like this, is in the range of 70-85%. Not every ClueWeb document is included in this corpus; documents in which we found no entities were excluded from the set. A document might be excluded because there were no entities to be found, because the entities in question weren’t in Freebase, or because none of the entities were resolved at a confidence level above the threshold.
The ClueWeb data is used in multiple TREC tracks. You may also be interested in our annotations of several
TREC query sets
, including those from the
Million Query Track
and
Web Track
.
If you would prefer a human-annotated set, you might want to look at the
Wikilinks Corpus
we released last year. Entities there were disambiguated by links to Wikipedia, inserted by the authors of the page, which is effectively a form of human annotation.
You can find more detail and download the data on the pages for the two sets:
ClueWeb09 FACC
and
ClueWeb12 FACC
. You can also subscribe to our
data release mailing list
to learn about releases as they happen.
Special thanks to Jamie Callan and Juan Caicedo Carvajal for their help throughout the annotation project.
50,000 Lessons on How to Read: a Relation Extraction Corpus
Thursday, April 11, 2013
Posted by Dave Orr, Product Manager, Google Research
One of the most difficult tasks in NLP is called
relation extraction.
It’s an example of information extraction, one of the goals of natural language understanding. A relation is a semantic connection between (at least) two entities. For instance, you could say that
Jim Henson
was in a spouse relation with
Jane Henson
(and in a creator relation with
many
beloved
characters
and
shows
).
The goal of relation extraction is to learn relations from unstructured natural language text. The relations can be used to answer questions (“
Who created Kermit?
”), learn
which proteins interact
in the biomedical literature, or to build a database of
hundreds of millions of entities and billions of relations
to try and help people
explore the world’s information
.
To help researchers investigate relation extraction, we’re releasing a
human-judged dataset
of two relations about public figures on
Wikipedia
: nearly 10,000 examples of “place of birth”, and over 40,000 examples of “attended or graduated from an institution”. Each of these was judged by at least 5 raters, and can be used to train or evaluate relation extraction systems. We also plan to release more relations of new types in the coming months.
(Update: you can find additional relations
here
.)
Each relation is in the form of a triple: the relation in question, called a predicate; the subject of the relation; and the object of the relation. In the relation “Stephen Hawking graduated from Oxford,” Stephen Hawking is the subject, graduated from is the relation, and Oxford University is the object. Subjects and objects are represented by their
Freebase MID’s
, and the relation is defined as a
Freebase property
. So in this case, the triple would be represented as:
"pred":"
/education/education/institution
"
"sub":"
/m/01tdnyh
"
"obj":"
/m/07tgn
"
Just having the triples is interesting enough if you want a database of entities and relations, but doesn’t make much progress towards training or evaluation a relation extraction system. So we’ve also included the evidence for the relation, in the form of a URL and an excerpt from the web page that our raters judged. We’re also including examples where the evidence does not support the relation, so you have negative examples for use in training better extraction systems. Finally, we included ID’s and actual judgments of individual raters, so that you can filter triples by agreement.
Gory Details
The corpus itself, extracted from Wikipedia, can be found here:
https://code.google.com/p/relation-extraction-corpus/
The files are in
JSON
format. Each line is a triple with the following fields:
pred: predicate of a triple
sub: subject of a triple
obj: object of a triple
evidences: an array of evidences for this triple
url: the web page from which this evidence was obtained
snippet: short piece of text supporting the triple
judgments: an array of judgements from human annotators
rator: hash code of the identity of the annotator
judgment: judgement of the annotator. It can take the values "yes" or "no"
Here’s an example:
{"pred":"/people/person/place_of_birth","sub":"/m/026_tl9","obj":"/m/02_286","evidences":[{"url":"http://en.wikipedia.org/wiki/Morris_S._Miller","snippet":"Morris Smith Miller (July 31, 1779 -- November 16, 1824) was a United States Representative from New York. Born in New York City, he graduated from Union College in Schenectady in 1798. He studied law and was admitted to the bar. Miller served as private secretary to Governor Jay, and subsequently, in 1806, commenced the practice of his profession in Utica. He was president of the village of Utica in 1808 and judge of the court of common pleas of Oneida County from 1810 until his death."}],"judgments":[{"rater":"11595942516201422884","judgment":"yes"},{"rater":"16169597761094238409","judgment":"yes"},{"rater":"1014448455121957356","judgment":"yes"},{"rater":"16651790297630307764","judgment":"yes"},{"rater":"1855142007844680025","judgment":"yes"}]}
The web is chock full of information, put there to be read and learned from. Our hope is that this corpus is a small step towards computational understanding of the wealth of relations to be found everywhere you look.
This dataset is licensed by Google Inc. under the
Creative Commons Attribution-Sharealike 3.0
license.
Thanks to Shaohua Sun, Ni Lao, and Rahul Gupta for putting this dataset together.
Thanks also to Michael Ringgaard, Fernando Pereira, Amar Subramanya, Evgeniy Gabrilovich, and John Giannandrea for making this data release possible.
Learning from Big Data: 40 Million Entities in Context
Friday, March 08, 2013
Posted by Dave Orr, Amar Subramanya, and Fernando Pereira, Google Research
When someone mentions Mercury, are they talking about the
planet
, the
god
, the
car
, the
element
,
Freddie
, or one of some
89 other possibilities
? This problem is called
disambiguation
(a word that is itself
ambiguous
), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a
fruit
with a
giant tech company
?), computers need help.
To provide that help, we are releasing the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages -- over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (
an idea we’ve discussed before
), then the anchor text can be thought of as a mention of the corresponding entity.
Dataset
Number of Mentions
Number of Entities
Bentivogli et al.
(
data
) (2008)
43,704
709
Day et al.
(2008)
less than 55,000
3,660
Artiles et al.
(
data
) (2010)
57,357
300
Wikilinks Corpus
40,323,863
2,933,659
What might you do with this data? Well, we’ve already written one
ACL paper on cross-document co-reference
(and received lots of requests for the underlying data, which partly motivates this release). And really, we look forward to seeing what you are going to do with it! But here are a few ideas:
Look into
coreference
-- when different mentions mention the same entity -- or
entity resolution
-- matching a mention to the underlying entity
Work on the bigger problem of
cross-document coreference
, which is how to find out if different web pages are talking about the same person or other entity
Learn things about entities by aggregating information across all the documents they’re mentioned in
Type tagging
tries to assign types (they could be broad, like person, location, or specific, like amusement park ride) to entities. To the extent that the Wikipedia pages contain the type information you’re interested in, it would be easy to construct a training set that annotates the Wikilinks entities with types from Wikipedia.
Work on any of the above, or more, on subsets of the data. With existing datasets, it wasn’t possible to work on just musicians or chefs or train stations, because the sample sizes would be too small. But with 10 million Web pages, you can find a decent sampling of almost anything.
Gory Details
How do you actually get the data? It’s right here:
Google’s Wikilinks Corpus
. Tools and data with extra context can be found on our partners’ page:
UMass Wiki-links
. Understanding the corpus, however, is a little bit involved.
For copyright reasons, we cannot distribute actual annotated web pages. Instead, we’re providing an index of URLs, and the tools to create the dataset, or whichever slice of it you care about, yourself. Specifically, we’re providing:
The URLs of all the pages that contain labeled mentions, which are links to English Wikipedia
The anchor text of the link (the mention string), the Wikipedia link target, and the byte offset of the link for every page in the set
The byte offset of the 10 least frequent words on the page, to act as a signature to ensure that the underlying text hasn’t changed -- think of this as a version, or fingerprint, of the page
Software tools (on the
UMass site
) to: download the web pages; extract the mentions, with ways to recover if the byte offsets don’t match; select the text around the mentions as local context; and compute evaluation metrics over predicted entities.
The format looks like this:
URL http://1967mercurycougar.blogspot.com/2009_10_01_archive.html
MENTION Lincoln Continental Mark IV 40110 http://en.wikipedia.org/wiki/Lincoln_Continental_Mark_IV
MENTION 1975 MGB roadster 41481 http://en.wikipedia.org/wiki/MG_MGB
MENTION Buick Riviera 43316 http://en.wikipedia.org/wiki/Buick_Riviera
MENTION Oldsmobile Toronado 43397 http://en.wikipedia.org/wiki/Oldsmobile_Toronado
TOKEN seen 58190
TOKEN crush 63118
TOKEN owners 69290
TOKEN desk 59772
TOKEN relocate 70683
TOKEN promote 35016
TOKEN between 70846
TOKEN re 52821
TOKEN getting 68968
TOKEN felt 41508
We’d love to hear what you’re working on, and look forward to what you can do with 40 million mentions across over 10 million web pages!
Thanks to our collaborators at
UMass Amherst
:
Sameer Singh
and
Andrew McCallum
.
Large Scale Language Modeling in Automatic Speech Recognition
Wednesday, October 31, 2012
Posted by Ciprian Chelba, Research Scientist
At Google, we’re able to use the large amounts of data made available by the Web’s fast growth. Two such data sources are the anonymized queries on google.com and the web itself. They help improve automatic speech recognition through large language models: Voice Search makes use of the former, whereas YouTube speech transcription benefits significantly from the latter.
The language model is the component of a speech recognizer that assigns a probability to the next word in a sentence given the previous ones. As an example, if the previous words are “new york”, the model would assign a higher probability to “pizza” than say “granola”. The n-gram approach to language modeling (predicting the next word based on the previous n-1 words) is particularly well-suited to such large amounts of data: it scales gracefully, and the non-parametric nature of the model allows it to grow with more data. For example, on Voice Search we were able to train and evaluate 5-gram language models consisting of 12 billion n-grams, built using large vocabularies (1 million words), and trained on as many as 230 billion words.
The computational effort pays off, as highlighted by the plot above: both word error rate (a measure of speech recognition accuracy) and search error rate (a metric we use to evaluate the output of the speech recognition system when used in a search engine) decrease significantly with larger language models.
A more detailed
summary of results on Voice Search and a few YouTube speech transcription tasks
(authors: Ciprian Chelba, Dan Bikel, Maria Shugrina, Patrick Nguyen, Shankar Kumar) presents our results when increasing both the amount of training data, and the size of the language model estimated from such data. Depending on the task, availability and amount of training data used, as well as language model size and the performance of the underlying speech recognizer, we observe reductions in word error rate between 6% and 10% relative, for systems on a wide range of operating points.
Cross-posted with the Research at Google G+ Page
Ngram Viewer 2.0
Thursday, October 18, 2012
Posted by Jon Orwant, Engineering Manager
Since launching the
Google Books Ngram Viewer
, we’ve been overjoyed by the public reception. Co-creator Will Brockman and I hoped that the ability to track the usage of phrases across time would be of interest to professional linguists, historians, and bibliophiles. What we didn’t expect was its popularity among casual users. Since the launch in 2010, the Ngram Viewer has been used about 50 times every minute to explore how phrases have been used in books spanning the centuries. That’s over 45 million graphs created, each one a glimpse into the history of the written word. For instance, comparing
flapper
,
hippie
, and
yuppie
, you can see when each word peaked:
Meanwhile, Google Books reached a milestone, having scanned 20 million books. That’s approximately one-seventh of all the books published since Gutenberg invented the printing press. We’ve updated the Ngram Viewer datasets to include a lot of those new books we’ve scanned, as well as improvements our engineers made in OCR and in hammering out inconsistencies between library and publisher metadata. (We’ve kept the old dataset around for scientists pursuing empirical, replicable language experiments such as the ones Jean-Baptiste Michel and Erez Lieberman Aiden conducted for our
Science paper
.)
At Google, we’re also trying to understand the meaning behind what people write, and to do that it helps to understand grammar. Last summer Slav Petrov of Google’s Natural Language Processing group and his intern Yuri Lin (who’s since joined Google full-time) built a system that identified parts of speech—nouns, adverbs, conjunctions and so forth—for all of the words in the millions of Ngram Viewer books. Now, for instance, you can compare the verb and noun forms of “cheer” to see how the frequencies have converged over time:
Some users requested the ability to combine Ngrams, and Googler Matthew Gray generalized that notion into what we’re calling Ngram compositions: the ability to add, subtract, multiply, and divide Ngram counts. For instance, you can see how “record player” rose at the expense of “Victrola”:
Our
info page
explains all the details about this curious notion of treating phrases like components of a mathematical expression. We’re guessing they’ll only be of interest to lexicographers, but then again that’s what we thought about Ngram Viewer 1.0.
Oh, and we added Italian too, supplementing our current languages: English, Chinese, Spanish, French, German, Hebrew, and Russian. Buon divertimento!
Natural Language in Voice Search
Tuesday, July 31, 2012
Posted by Jakob Uszkoreit, Software Engineer
On July 26 and 27, we held our eighth annual
Computer Science Faculty Summit
on our Mountain View Campus. During the event, we brought you a series of blog posts dedicated to sharing the Summit's talks, panels and sessions, and we continue with this glimpse into natural language in voice search. --Ed
At this year’s Faculty Summit, I had the opportunity to showcase the newest version of
Google Voice Search
. This version hints at how Google Search, in particular on mobile devices and by voice, will become increasingly capable of responding to natural language queries.
I first outlined the trajectory of Google Voice Search, which was initially released in 2007.
Voice actions
, launched in 2010 for Android devices, made it possible to control your device by speaking to it. For example, if you wanted to set your device alarm for 10:00 AM, you could say “set alarm for 10:00 AM. Label: meeting on voice actions.” To indicate the subject of the alarm, a meeting about voice actions, you would have to use the keyword “label”! Certainly not everyone would think to frame the requested action this way. What if you could speak to your device in a more natural way and have it understand you?
At last month’s
Google I/O 2012
, we announced a version of voice actions that supports much more natural commands. For instance, your device will now set an alarm if you say “my meeting is at 10:00 AM, remind me”. This makes even previously existing functionality, such as sending a text message or calling someone, more discoverable on the device -- that is, if you express a voice command in whatever way feels natural to you, whether it be “let David know I’ll be late via text” or “make sure I buy milk by 3 pm”, there is now a good chance that your device will respond how you anticipated it to.
I then discussed some of the possibly unexpected decisions we made when designing the system we now use for interpreting natural language queries or requests. For example, as you would expect from Google, our approach to interpreting natural language queries is data-driven and relies heavily on machine learning. In complex machine learning systems, however, it is often difficult to figure out the underlying cause for an error: after supplying them with training and test data, you merely obtain a set of metrics that hopefully give a reasonable indication about the system’s quality but they fail to provide an explanation for why a certain input lead to a given, possibly wrong output.
As a result, even understanding why some mistakes were made requires experts in the field and detailed analysis, rendering it nearly impossible to harness non-experts in analyzing and improving such systems. To avoid this, we aim to make every partial decision of the system as interpretable as possible. In many cases, any random speaker of English could look at its possibly erroneous behavior in response to some input and quickly identify the underlying issue - and in some cases even fix it!
We are especially interested in working with our academic colleagues on some of the many fascinating research and engineering challenges in building large-scale, yet interpretable natural language understanding systems and devising the machine learning algorithms this requires.
Announcing Google-hosted workshop videos from NIPS 2011
Thursday, February 23, 2012
Posted by John Blitzer and Douglas Eck, Google Research
At the
25th Neural Information Processing Systems (NIPS)
conference in Granada, Spain last December, we engaged in dialogue with a diverse population of neuroscientists, cognitive scientists, statistical learning theorists, and machine learning researchers. More than twenty Googlers participated in an intensive single-track program of talks, nightly poster sessions and a workshop weekend in the Spanish Sierra Nevada mountains. Check out the
NIPS 2011 blog post
for full information on Google at NIPS.
In conjunction with our technical involvement and gold sponsorship of NIPS, we recorded the five workshops that Googlers helped to organize on various topics from big learning to music. We’re now pleased to provide access to these rich workshop experiences to the wider technical community.
Watch videos of Googler-led workshops on the
YouTube Tech Talks Channel
:
Big Learning: Algorithms, Systems, and Tools for Learning at Scale
by Joseph Gonzalez, Sameer Singh, Graham Taylor, James Bergstra, Alice Zheng, Misha Bilenko, Yucheng Low, Yoshua Bengio, Michael Franklin, Carlos Guestrin, Andrew McCallum, Alexander Smola, Michael Jordan, Sugato Basu (Googler)
Domain Adaptation Workshop: Theory and Application
by John Blitzer, Corinna Cortes, Afshin Rostamizadeh (all Googlers)
Learning Semantics
by Antoine Bordes, Jason Weston (Googler), Ronan Collobert, Leon Bottou
Sparse Representation and Low-rank Approximation
by Ameet Talwalkar, Lester Mackey, Mehryar Mohri (Googler), Michael Mahoney, Francis Bach, Mike Davies, Remi Gribonval, Guillaume Obozinski
International Workshop on Music and Machine Learning: Learning from Musical Structure
by Rafael Ramirez, Darrell Conklin, Douglas Eck (Googler), Ryan Rifkin (Googler)
To highlight a few workshops:
The Domain Adaptation
workshop organized by Google, which fused theoretical and practical domain adaptation, featured invited talks from Shai Ben-David and Googler Mehryar Mohri from the theory side and Dan Roth from the applications side. This was just next door to Googlers Doug Eck and Ryan Rifkin's workshop on
Machine Learning and Music
, with musical demonstrations loud enough for the next-door neighbors to ask them to “turn it down a bit, please.” In addition to the Googler-run workshops, the
Integrating Language and Vision
workshop showcased invited talks by Google postdoctoral fellow Percy Liang on the pragmatics of visual scene description and Josh Tenenbaum on physical models as a cognitive plausible mechanism for bridging language and vision. Finally, Google consultant Andrew Ng was one of the organizers of the
Deep Learning and Unsupervised Feature Learning
, which offered an extended tutorial, several inspiring talks, and two panel discussions (one with Googler Samy Bengio as panelist) exploring the question of “How deep is deep?”
As the workshop weekend drew to a close, an airline strike in Spain left NIPS attendees scrambling to get home for the holidays. We hope the skies look clear for 2012 when NIPS lands in Google’s neck of the woods, Lake Tahoe!
Building resources to syntactically parse the web
Wednesday, March 09, 2011
Posted by Slav Petrov and Ryan McDonald, Research Team
One major hurdle in organizing the world’s information is building computer systems that can understand natural, or human, language. Such understanding would advance if systems could automatically determine syntactic and semantic structures.
This analysis is an extremely complex inferential process. Consider for example the sentence, "A hearing is scheduled on the issue today." A syntactic parser needs to determine that "is scheduled" is a verb phrase, that the "hearing" is its subject, that the prepositional phrase "on the issue" is modifying the "hearing", and that today is an adverb modifying the verb phrase. Of course, humans do this all the time without realizing it. For computers, this is non-trivial as it requires a fair amount of background knowledge, typically encoded in a rich statistical model. Consider, "I saw a man with a jacket" versus "I saw a man with a telescope". In the former, we know that a "jacket" is something that people wear and is not a mechanism for viewing people. So syntactically, the "jacket" must be a property associated with the "man" and not the verb "saw", i.e., I did not see the man by using a jacket to view him. Whereas in the latter, we know that a telescope is something with which we can view people, so it can also be a property of the verb. Of course, it is ambiguous, maybe the man is carrying the telescope.
Linguistically inclined readers will of course notice that this parse tree has been simplified by omitting empty clauses and traces.
Computer programs with the ability to analyze the syntactic structure of language are fundamental to improving the quality of many tools millions of people use every day, including
machine translation
,
question answering
,
information extraction
, and
sentiment analysis
. Google itself is already using syntactic parsers in many of its projects. For example,
this paper
, describes a system where a syntactic dependency parser is used to make translations more grammatical between languages with different word orderings.
This paper
uses the output of a syntactic parser to help determine the scope of negation within sentences, which is then used downstream to improve a
sentiment
analysis
system
.
To further this work, Google is pleased to announce a gift to the
Linguistic Data Consortium (LDC)
to create new annotated resources that can facilitate research progress in the area of syntactic parsing. The primary purpose of the gift is to generate data sets that language technology researchers can use to evaluate the robustness of new parsing methods in several web domains, such as blogs and discussion forums. The goal is to move parsing beyond its current focus on carefully edited text such as print news (for which annotated resources already exist) to domains with larger stylistic and topical variability (where spelling errors and grammatical mistakes are more common).
The Linguistic Data Consortium is a non-profit organization that produces and distributes linguistic data to researchers, technology developers, universities and university libraries. The LDC is hosted by the University of Pennsylvania and directed by Mark Liberman, Christopher H. Browne Distinguished Professor of Linguistics.
The LDC is the leader in building linguistic data resources and will annotate several thousand sentences with syntactic parse trees like the one shown in the figure. The annotation will be done manually by specially trained linguists who will also have access to machine analysis and can correct errors the systems make. Once the annotation is completed, the corpus will be released to the research community through the
LDC catalog
. We look forward to seeing what they produce and what the natural language processing research community can do with the rich annotation resource.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
Android
API
App Engine
App Inventor
April Fools
Audio
Australia
Automatic Speech Recognition
Awards
Cantonese
China
Chrome
Cloud Computing
Collaboration
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
data science
datasets
Deep Learning
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Exacycle
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Genomics
Gmail
Google Books
Google Drive
Google Science Fair
Google Sheets
Google Translate
Google Voice Search
Google+
Government
grants
HCI
Health
High Dynamic Range Imaging
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Klingon
Korean
Labs
Linear Optimization
localization
Machine Hearing
Machine Intelligence
Machine Learning
Machine Translation
MapReduce
market algorithms
Market Research
ML
MOOC
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Ngram
NIPS
NLP
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
ph.d. fellowship
PiLab
Policy
Professional Development
Public Data Explorer
publication
Publications
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Search
search ads
Security and Privacy
SIGCOMM
SIGMOD
Site Reliability Engineering
Software
Speech
Speech Recognition
statistics
Structured Data
Systems
TensorFlow
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Follow @googleresearch
Give us feedback in our
Product Forums
.