Google Research Blog
The latest news from Research at Google
See through the clouds with Earth Engine and Sentinel-1 Data
Monday, August 03, 2015
Posted by Luc Vincent, Engineering Director, Geo Imagery
This year the
Google Earth Engine
team attended the
European Geosciences Union General Assembly
meeting in Vienna, Austria to engage with a number of European geoscientific partners. This was just the first of a series of European summits the team has attended over the past few months, including, most recently, the
IEEE Geoscience and Remote Sensing Society
meeting held last week in Milan, Italy.
Noel Gorelick presenting Google Earth Engine at EGU 2015.
We are very excited to be collaborating with many European scientists from esteemed institutions such as the
European Commission Joint Research Centre
,
Wageningen University
, and
University of Pavia
. These researchers are
utilizing the Earth Engine geospatial analysis platform
to address issues of global importance in areas such as food security, deforestation detection, urban settlement detection, and freshwater availability.
Thanks to the enlightened free and open data policy of the European Commission and European Space Agency, we are pleased to announce the availability of
Copernicus Sentinel-1
data through Earth Engine for visualization and analysis. Sentinel-1, a radar imaging satellite with the ability to see through clouds, is the first of at least 6
Copernicus
satellites going up in the next 6 years.
Sentinel-1 data visualized using Earth Engine, showing Vienna (left) and Milan (right).
Wind farms seen off the Eastern coast of England.
This radar data offers a powerful complement to other optical and thermal data from satellites like Landsat, that are already available in the Earth Engine public data catalog. If you are a geoscientist interested in accessing and analyzing the newly available EC/ESA Sentinel-1 data, or anything else in our multi-petabyte data catalog, please
sign up for Google Earth Engine
.
We look forward to further engagements with the European research community and are excited to see what the world will do with the data from the European Union's Copernicus program satellites.
A Multilingual Corpus of Automatically Extracted Relations from Wikipedia
Tuesday, June 02, 2015
Posted by Shankar Kumar, Google Research Scientist and Manaal Faruqui, Carnegie Mellon University PhD candidate
In
Natural Language Processing
, relation extraction is the task of assigning a semantic relationship between a pair of arguments. As an example, a relationship between the phrases “
Ottawa
” and “
Canada
” is “
is the capital of
”. These extracted relations could be used in a variety of applications ranging from
Question Answering
to building databases from unstructured text.
While relation extraction systems work accurately for English and a few other languages, where tools for syntactic analysis such as parsers, part-of-speech taggers and named entity analyzers are readily available, there is relatively little work in developing such systems for most of the world's languages where linguistic analysis tools do not yet exist. Fortunately, because we do have translation systems between English and many other languages (such as
Google Translate
), we can translate text from a non-English language to English, perform relation extraction and project these relations back to the foreign language.
Relation extraction in a Spanish sentence using the cross-lingual relation extraction pipeline.
In
Multilingual Open Relation Extraction Using Cross-lingual Projection
, that will appear at the
2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies
(NAACL HLT 2015), we use this idea of cross-lingual projection to develop an algorithm that extracts open-domain relation
tuples
, i.e. where an arbitrary phrase can describe the relation between the arguments, in multiple languages from
Wikipedia
. In this work, we also evaluated the performance of extracted relations using human annotations in French, Hindi and Russian.
Since there is no such publicly available corpus of multilingual relations, we are
releasing a dataset
of automatically extracted relations from the Wikipedia corpus in 61 languages, along with the manually annotated relations in 3 languages (French, Hindi and Russian). It is our hope that our data will help researchers working on natural language processing and encourage novel applications in a wide variety of languages. More details on the corpus and the file formats can be found in this
README file
.
We wish to thank Bruno Cartoni, Vitaly Nikolaev, Hidetoshi Shimokawa, Kishore Papineni, John Giannandrea and their teams for making this data release possible. This dataset is licensed by Google Inc. under the
Creative Commons Attribution-ShareAlike 3.0 License
.
Teaching machines to read between the lines (and a new corpus with entity salience annotations)
Monday, August 25, 2014
Posted by Dan Gillick, Research Scientist, and Dave Orr, Product Manager
Language understanding systems are largely trained on freely available data, such as the
Penn Treebank
, perhaps the most widely used linguistic resource ever created. We have previously released
lots of linguistic data
ourselves, to contribute to the language understanding community as well as encourage further research into these areas.
Now, we’re releasing a new dataset, based on another great resource: the
New York Times Annotated Corpus
, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages
use of the metadata
for all kinds of things, and has set up
a forum
to discuss related research.
We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people -- we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.
One way to approach the problem is to look for words that appear more often than their ordinary rates. For example, if you see the word “coach” 5 times in a
581 word article
, and compare that to the usual frequency of “coach” --
more like 5 in 330,000 words
-- you have reason to suspect the article has something to do with coaching. The term “basketball” is even more extreme, appearing 150,000 times more often than usual. This is the idea of the famous
TFIDF
, long used to index web pages.
Congratulations to
Becky Hammon
, first female NBA coach! Image via Wikipedia.
Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the
Knowledge Graph
, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about.
Background information about entities ought to help us decide which of them are most salient. After all, an article’s author assumes her readers have some general understanding of the world, and probably a bit about sports too. Using background knowledge, we might be able to infer that the WNBA is a salient entity in the Becky Hammon article even though it only appears once.
To encourage research on leveraging background information, we are releasing a large dataset of annotations to accompany the New York Times Annotated Corpus, including resolved
Freebase entity IDs
and labels indicating which entities are salient. The salience annotations are determined by automatically aligning entities in the document with entities in accompanying human-written abstracts. Details of the salience annotations and some baseline results are described in our recent paper:
A New Entity Salience Task with Millions of Training Examples
(Jesse Dunietz and Dan Gillick).
Since our entity resolver works better for named entities like WNBA than for nominals like “coach” (this is the notoriously difficult
word sense disambiguation
problem, which we’ve
previously touched on
), the annotations are limited to names.
Below is sample output for a document. The first line contains the NYT document ID and the headline; each subsequent line includes an entity index, an indicator for salience, the mention count for this entity in the document as determined by our coreference system, the text of the first mention of the entity, the byte offsets (start and end) for the first mention of the entity, and the resolved Freebase MID.
Features like mention count and document positioning give reasonable salience predictions. But because they only describe what’s explicitly in the document, we expect a system that uses background information to expose what’s implicit could give better results.
Download the data directly
from Google Drive
, or visit the project home page with more information at
our Google Code site
. We look forward to seeing what you come up with!
CDC Birth Vital Statistics in BigQuery
Friday, January 13, 2012
Posted by Dan Vanderkam, Software Engineer
Google’s
BigQuery Service
lets enterprises and developers crunch large-scale data sets quickly. But what if you don’t have a large-scale data set of your own?
To help the data-less masses, BigQuery offers several
large, public data sets
. One of these is the
natality
data set, which records information about live births in the United States. The data is derived from the
Division of Vital Statistics
at the
Centers for Disease Control and Prevention
, which has collected an electronic record of birth statistics
since 1969
. It is one of the longest-running electronic records in existence.
Each row in this database represents a live birth. Using simple queries, you can discover fascinating trends from the last forty years.
For example, here’s the average age of women giving birth to their first child:
The average age has increased from 21.3 years in 1969 to 25.1 years in 2008. Using more complex queries, one could analyze the factors which have contributed to this increase, i.e. whether it can be explained by changing racial/ethnic composition of the population.
You can see more
examples
like this one on the BigQuery site.
More Google Cluster Data
Tuesday, November 29, 2011
Posted by John Wilkes, Principal Software Engineer
Google has a strong interest in promoting high quality systems research, and we believe that providing information about real-life workloads to the academic community can help.
In support of this we published a small (7-hour) sample of resource-usage information from a Google production cluster in 2010 (
research blog on Google Cluster Data
). Approximately a dozen researchers at UC Berkeley, CMU, Brown, NCSU, and elsewhere have made use of it.
Recently, we released a larger dataset. It covers a longer period of time (29 days) for a larger cell (about 11k machines) and includes significantly more information, including:
the original resource requests, to permit scheduling experiments
request constraints and machine attriibutes
machine availability and failure events
some of the reasons for task exits
(obfuscated) job and job-submitter names, to help identify repeated or related jobs
more types of usage information
CPI (cycles per instruction) and memory traffic for some of the machines
Note that this trace primarily provides data about resource requests and usage. It contains no information about end users, their data, or access patterns to storage systems and other services.
More information can be found via
this link
, which will (after a short questionnaire) take you to a site that provides access instructions, a description of the data schema, and information about how the data was derived and its meaning.
We hope this data will facilitate a range of research in cluster management. Let us know if you find it useful, are willing to share tools that analyze it, or have suggestions for how to improve it.
Slicing and dicing data for interactive visualization
Monday, February 28, 2011
Posted by Benjamin Yolken, Google Public Data Product Manager
A year ago, we introduced the
Google Public Data Explorer
, a tool that allows users to interactively explore public-interest datasets from a variety of influential sources like the World Bank, IMF, Eurostat, and the US Census Bureau. Today, users can visualize over 300 metrics across
31 datasets
, including everything from
labor productivity
(OECD) to
Internet speed
(Ookla) to
gender balance in parliaments
(UNECE) to
government debt levels
(IMF) to
population density by municipality
(Statistics Catalonia), with more data being added every week.
Last week, as part of the launch of our
dataset upload interface
, we released one of the key pieces of technology behind the product: the
Dataset Publishing Language
(DSPL). We created this format to address a key problem in the Public Data Explorer and other, similar tools, namely, that existing data formats don’t provide enough information to support easy yet powerful data exploration by non-technical users.
DSPL addresses this by adding an additional layer of metadata on top of the raw, tabular data in a dataset. This metadata, expressed in XML, describes the
concepts
in the dataset, for instance “country”, “gender”, “population”, and “unemployment”, giving descriptions, URLs, formatting properties, etc. for each. These concepts are then referenced in
slices
, which partition the former into
dimensions
(i.e., categories) and
metrics
(i.e., quantitative values) and link them with the underlying data tables (provided in CSV format). This structure, along with some additional metadata, is what allows us to provide rich, interactive dataset visualizations in the Public Data Explorer.
With the release of DSPL, we hope to accelerate the process of making the world’s datasets searchable, visualizable, and understandable, without requiring a PhD in statistics. We encourage you to
read more
about the format and try it yourself, both in the
Public Data Explorer
and in your own software. Stay tuned for more DSPL extensions and applications in the future!
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
Android
API
App Engine
App Inventor
April Fools
Audio
Australia
Automatic Speech Recognition
Awards
Cantonese
China
Chrome
Cloud Computing
Collaboration
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
data science
datasets
Deep Learning
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Exacycle
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Genomics
Gmail
Google Books
Google Drive
Google Science Fair
Google Sheets
Google Translate
Google Voice Search
Google+
Government
grants
HCI
Health
High Dynamic Range Imaging
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Klingon
Korean
Labs
Linear Optimization
localization
Machine Hearing
Machine Intelligence
Machine Learning
Machine Translation
MapReduce
market algorithms
Market Research
ML
MOOC
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Ngram
NIPS
NLP
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
ph.d. fellowship
PiLab
Policy
Professional Development
Public Data Explorer
publication
Publications
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Search
search ads
Security and Privacy
SIGCOMM
SIGMOD
Site Reliability Engineering
Software
Speech
Speech Recognition
statistics
Structured Data
Systems
TensorFlow
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Follow @googleresearch
Give us feedback in our
Product Forums
.