Research Blog: Natural Language Processing

A Multilingual Corpus of Automatically Extracted Relations from Wikipedia

Tuesday, June 02, 2015

Posted by Shankar Kumar, Google Research Scientist and Manaal Faruqui, Carnegie Mellon University PhD candidateNatural Language ProcessingOttawaCanadais the capital ofQuestion AnsweringGoogle Translate

Relation extraction in a Spanish sentence using the cross-lingual relation extraction pipeline.

Multilingual Open Relation Extraction Using Cross-lingual Projection2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language TechnologiestuplesWikipediareleasing a datasetREADME fileCreative Commons Attribution-ShareAlike 3.0 License

A picture is worth a thousand (coherent) words: building a natural description of images

Monday, November 17, 2014

Posted by Google Research Scientists Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan“Two pizzas sitting on top of a stove top oven”“A group of people shopping at an outdoor market”“Best seats in the house”object detection, classification, and labeling

Automatically captioned: “Two pizzas sitting on top of a stove top oven”

computer visionnatural language processingcomplete image description approachmachine translationRecurrent Neural Networkvector representationConvolutional Neural NetworkSoftmax

The model combines a vision CNN with a language-generating RNN so it can take in an image and generate a fitting natural-language caption.

Bilingual Evaluation Understudy

A selection of evaluation results, grouped by human rating.

here

Teaching machines to read between the lines (and a new corpus with entity salience annotations)

Monday, August 25, 2014

Posted by Dan Gillick, Research Scientist, and Dave Orr, Product ManagerPenn Treebanklots of linguistic dataNew York Times Annotated Corpususe of the metadataa forum581 word articlemore like 5 in 330,000 wordsTFIDF

Congratulations to Becky Hammon, first female NBA coach! Image via Wikipedia.

Knowledge GraphFreebase entity IDsA New Entity Salience Task with Millions of Training Examplesword sense disambiguationpreviously touched on

from Google Driveour Google Code site

A Billion Words: Because today's language modeling standard should be higher

Wednesday, April 30, 2014

Posted by Dave Orr, Product Manager, and Ciprian Chelba, Research Scientistpronounce “ladder” and “latter” identicallyIME keyboards

Photo credit: Kurt Partridge

language modelswe are releasing scriptsarXiv paperhttp://www.statmt.org/lm-benchmark/

Free Language Lessons for Computers

Tuesday, December 03, 2013

Posted by Dave Orr, Google Research Product Manager Not everything that can be counted counts. Not everything that counts can be counted. - William Bruce Camerontell storiesvisualize informationmailing list50,000 Lessons on How to Read: a Relation Extraction CorpusWhat is itWikipediaWhere can I find ithttps://code.google.com/p/relation-extraction-corpus/I want to know morehandy blog post11 Billion Clues in 800 Million DocumentsWhat is itFreebase concept IDsWhere can I find itClueWeb09 FACCClueWeb12 FACCI want to know more

Features Extracted From YouTube Videos for Multiview LearningWhat is itWhere can I find itUCI machine learning repository (multiview video dataset)Google’s repositoryI want to know morehere40 Million Entities in ContextWhat is itWhere can I find itWikiLinks corpusUmass Wiki-linksI want to know moreblog post announcing the release

Distributing the Edit History of Wikipedia InfoboxesWhat is itWhere can I find itDownload from GoogleWikimedia DeutschlandI want to know morepostedpaper

Note the change in the capital of Palau.
Syntactic Ngrams over TimeWhat is itGoogle BooksGoogle Ngram ViewerWhere can I find ithttp://commondatastorage.googleapis.com/books/syntactic-ngrams/index.htmlI want to know moreblog postpaper about the release

Dictionaries for linking Text, Entities, and IdeasWhat is itWhere can I find ithttp://nlp.stanford.edu/pubs/crosswikis-data.tar.bz2I want to know moreblog postassociated paperOther datasets

Automatic Freebase annotations of Trec’s Million Query and Web track queries.

A set of Freebase triples that have been deleted from Freebase over time -- 63 million of them.

New Research Challenges in Language Understanding

Friday, November 22, 2013

Posted by Maggie Johnson, Director of Education and University Relationsagenda

Knowledge representation, integration, and maintenance

Efficient and scalable infrastructure and algorithms for inferencing

Presentation and explanation of knowledge

Multilingual computation

Faculty Research Awards program

11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts

Wednesday, July 17, 2013

Posted by Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, and Michael Ringgaard, Google Research “I assume that by knowing the truth you mean knowing things as they really are.” - PlatoPlatoKnowledge GraphFreebasedata to help with disambiguation$1.2M in research grantsClueWeb09 FACCClueWeb12 FACCFreebase MID’s

TREC query setsMillion Query TrackWeb TrackWikilinks CorpusClueWeb09 FACCClueWeb12 FACCdata release mailing listSpecial thanks to Jamie Callan and Juan Caicedo Carvajal for their help throughout the annotation project.

50,000 Lessons on How to Read: a Relation Extraction Corpus

Thursday, April 11, 2013

Posted by Dave Orr, Product Manager, Google Researchrelation extraction.Jim HensonJane HensonmanybelovedcharactersshowsWho created Kermit?which proteins interacthundreds of millions of entities and billions of relationsexplore the world’s informationhuman-judged datasetWikipedia(Update: you can find additional relations here.)Freebase MID’sFreebase property/education/education/institution/m/01tdnyh/m/07tgnGory Detailshttps://code.google.com/p/relation-extraction-corpus/JSON

pred: predicate of a triple

sub: subject of a triple

obj: object of a triple

evidences: an array of evidences for this triple

url: the web page from which this evidence was obtained
snippet: short piece of text supporting the triple

judgments: an array of judgements from human annotators

rator: hash code of the identity of the annotator
judgment: judgement of the annotator. It can take the values "yes" or "no"

{"pred":"/people/person/place_of_birth","sub":"/m/026_tl9","obj":"/m/02_286","evidences":[{"url":"http://en.wikipedia.org/wiki/Morris_S._Miller","snippet":"Morris Smith Miller (July 31, 1779 -- November 16, 1824) was a United States Representative from New York. Born in New York City, he graduated from Union College in Schenectady in 1798. He studied law and was admitted to the bar. Miller served as private secretary to Governor Jay, and subsequently, in 1806, commenced the practice of his profession in Utica. He was president of the village of Utica in 1808 and judge of the court of common pleas of Oneida County from 1810 until his death."}],"judgments":[{"rater":"11595942516201422884","judgment":"yes"},{"rater":"16169597761094238409","judgment":"yes"},{"rater":"1014448455121957356","judgment":"yes"},{"rater":"16651790297630307764","judgment":"yes"},{"rater":"1855142007844680025","judgment":"yes"}]}

Creative Commons Attribution-Sharealike 3.0

Learning from Big Data: 40 Million Entities in Context

Friday, March 08, 2013

Posted by Dave Orr, Amar Subramanya, and Fernando Pereira, Google ResearchplanetgodcarelementFreddie89 other possibilitiesdisambiguationambiguousfruitgiant tech company

an idea we’ve discussed before

Dataset

Number of Mentions

Number of Entities

Bentivogli et al. (data) (2008)

43,704

709

Day et al. (2008)

less than 55,000

3,660

Artiles et al. (data) (2010)

57,357

300

Wikilinks Corpus

40,323,863

2,933,659

ACL paper on cross-document co-reference

Look into coreference -- when different mentions mention the same entity -- or entity resolution -- matching a mention to the underlying entity

Work on the bigger problem of cross-document coreference, which is how to find out if different web pages are talking about the same person or other entity

Learn things about entities by aggregating information across all the documents they’re mentioned in

Type tagging tries to assign types (they could be broad, like person, location, or specific, like amusement park ride) to entities. To the extent that the Wikipedia pages contain the type information you’re interested in, it would be easy to construct a training set that annotates the Wikilinks entities with types from Wikipedia.

Work on any of the above, or more, on subsets of the data. With existing datasets, it wasn’t possible to work on just musicians or chefs or train stations, because the sample sizes would be too small. But with 10 million Web pages, you can find a decent sampling of almost anything.

Gory DetailsGoogle’s Wikilinks CorpusUMass Wiki-links

The URLs of all the pages that contain labeled mentions, which are links to English Wikipedia

The anchor text of the link (the mention string), the Wikipedia link target, and the byte offset of the link for every page in the set

The byte offset of the 10 least frequent words on the page, to act as a signature to ensure that the underlying text hasn’t changed -- think of this as a version, or fingerprint, of the page

Software tools (on the UMass site) to: download the web pages; extract the mentions, with ways to recover if the byte offsets don’t match; select the text around the mentions as local context; and compute evaluation metrics over predicted entities.

URL http://1967mercurycougar.blogspot.com/2009_10_01_archive.htmlMENTION Lincoln Continental Mark IV 40110 http://en.wikipedia.org/wiki/Lincoln_Continental_Mark_IVMENTION 1975 MGB roadster 41481 http://en.wikipedia.org/wiki/MG_MGBMENTION Buick Riviera 43316 http://en.wikipedia.org/wiki/Buick_RivieraMENTION Oldsmobile Toronado 43397 http://en.wikipedia.org/wiki/Oldsmobile_ToronadoTOKEN seen 58190TOKEN crush 63118TOKEN owners 69290TOKEN desk 59772TOKEN relocate 70683TOKEN promote 35016TOKEN between 70846TOKEN re 52821TOKEN getting 68968TOKEN felt 41508

UMass AmherstSameer SinghAndrew McCallum

Large Scale Language Modeling in Automatic Speech Recognition

Wednesday, October 31, 2012

Posted by Ciprian Chelba, Research Scientist

summary of results on Voice Search and a few YouTube speech transcription tasks Cross-posted with the Research at Google G+ Page

Ngram Viewer 2.0

Thursday, October 18, 2012

Posted by Jon Orwant, Engineering ManagerGoogle Books Ngram Viewerflapperhippieyuppie

Science paper

info page

Natural Language in Voice Search

Tuesday, July 31, 2012

Posted by Jakob Uszkoreit, Software EngineerOn July 26 and 27, we held our eighth annual Computer Science Faculty Summit on our Mountain View Campus. During the event, we brought you a series of blog posts dedicated to sharing the Summit's talks, panels and sessions, and we continue with this glimpse into natural language in voice search. --EdGoogle Voice SearchVoice actionsGoogle I/O 2012

Announcing Google-hosted workshop videos from NIPS 2011

Thursday, February 23, 2012

Posted by John Blitzer and Douglas Eck, Google Research25th Neural Information Processing Systems (NIPS)NIPS 2011 blog postYouTube Tech Talks Channel

Big Learning: Algorithms, Systems, and Tools for Learning at Scale by Joseph Gonzalez, Sameer Singh, Graham Taylor, James Bergstra, Alice Zheng, Misha Bilenko, Yucheng Low, Yoshua Bengio, Michael Franklin, Carlos Guestrin, Andrew McCallum, Alexander Smola, Michael Jordan, Sugato Basu (Googler)

Domain Adaptation Workshop: Theory and Application by John Blitzer, Corinna Cortes, Afshin Rostamizadeh (all Googlers)

Learning Semantics by Antoine Bordes, Jason Weston (Googler), Ronan Collobert, Leon Bottou

Sparse Representation and Low-rank Approximation by Ameet Talwalkar, Lester Mackey, Mehryar Mohri (Googler), Michael Mahoney, Francis Bach, Mike Davies, Remi Gribonval, Guillaume Obozinski

International Workshop on Music and Machine Learning: Learning from Musical Structure by Rafael Ramirez, Darrell Conklin, Douglas Eck (Googler), Ryan Rifkin (Googler)

The Domain AdaptationMachine Learning and MusicIntegrating Language and VisionDeep Learning and Unsupervised Feature Learning

Building resources to syntactically parse the web

Wednesday, March 09, 2011

Posted by Slav Petrov and Ryan McDonald, Research Team

Linguistically inclined readers will of course notice that this parse tree has been simplified by omitting empty clauses and traces.machine translationquestion answeringinformation extractionsentiment analysisthis paperThis papersentimentanalysissystemLinguistic Data Consortium (LDC)LDC catalog

Google Research Blog