Google Cloud Platform Blog: bigquery

Processing logs at scale using Cloud Dataflow

Posted: Tuesday, November 24, 2015

Logs generated by applications and services can provide an immense amount of information about how your deployment is running and the experiences your users are having as they interact with the products and services. But as deployments grow more complex, gleaning insights from this data becomes more challenging. Logs come from an increasing number of sources, so they can be hard to collate and query for useful information. And building, operating and maintaining your own infrastructure to analyze log data at scale requires extensive expertise in running distributed systems and storage. Today, we’re introducing a new solution paper and reference implementation that will show how you can process logs from multiple sources and extract meaningful information by using Google Cloud Platform and Google Cloud Dataflow.

Log processing typically involves some combination of the following activities:

Configuring applications and services
Collecting and capturing log files
Storing and managing log data
Processing and extracting data
Persisting insights

Each of those components has it’s own scaling and management challenges, often using different approaches at different times. These sorts of challenges can slow down the generation of meaningful, actionable information from your log data.

Cloud Platform provides a number of services that can help you to address these challenges. You can use Cloud Logging to collect logs from applications and services, and then store them in Google Cloud Storage buckets or stream them to Pub/Sub topics. Dataflow can read from Cloud Storage or Pub/Sub (and many more), process log data, extract and transform metadata and compute aggregations. You can persist the output from Dataflow in BigQuery, where it can be analyzed or reviewed anytime. These mechanisms are offered as managed services—meaning they can scale when needed. That also means that you don't need to worry about provisioning resources up front.

The solution paper and reference implementation describe how you can use Dataflow to process log data from multiple sources and persist findings directly in BigQuery. You’ll learn how to configure Cloud Logging to collect logs from applications running in Container Engine, how to export those logs to Cloud Storage, and how to execute the Dataflow processing job. In addition, the solution shows you how to reconfigure Cloud Logging to use Pub/Sub to stream data directly to Dataflow, so you can process logs in real-time.

Check out the Processing Logs at Scale using Cloud Dataflow solution to learn how to combine logging, storage, processing and persistence into a scalable log processing approach. Then take a look at the reference implementation tutorial on Github to deploy a complete end-to-end working example. Feedback is welcome and appreciated; comment here, submit a pull request, create an issue, or find me on Twitter @crcsmnky and let me know how I can help.

- Posted by Sandeep Parikh, Google Solutions Architect

Correlating the Patterns of World History With BigQuery

Posted: Wednesday, August 13, 2014

Two months ago, Kalev Leetaru of Georgetown University announced the availability of the entire quarter-billion-record GDELT Event Database in Google BigQuery. This dataset monitors the broadcast, print, and web news media from across the world in over 100 languages. It's a database of what’s happening throughout the globe - a continuously-updated, computable catalog of human society compiled from the world’s news media.

With the GDELT database publicly accessible through BigQuery, you can query and dig through a quarter-billion records in real-time. To explore what BigQuery can do, GDELT used its ability to compute correlations. Computing correlations allows us, for example, to look at a timeline of events in Egypt before the revolution of 2011 and then search 35 years of history for other countries around the world with similar patterns.

With a single SQL query, the GDELT team has been doing exactly that: using BigQuery to run more than 2.5 million correlations in a few minutes to trace the patterns of global society as captured by GDELT’s archive. Instead of only examining small slices of the data suggested by theory or domain expertise, this experiment showcases the use of GDELT's raw data to leverage the enormous power of BigQuery to exhaustively sift out every correlation from the entire quarter-billion record dataset, surfacing highly unexpected patterns and findings.

On their in-depth post, the GDELT team runs a query like this:

SELECT
  STRFTIME_UTC_USEC(a.ending_at, "%Y-%m-%d") ending_at1,
  STRFTIME_UTC_USEC(b.ending_at-60*86400000000, "%Y-%m-%d") starting_at2,
  STRFTIME_UTC_USEC(b.ending_at, "%Y-%m-%d") ending_at2,
  a.country, b.country, CORR(a.c, b.c) corr, COUNT(*) c
FROM (
  SELECT country, date+i*86400000000 ending_at, c, i
  FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a 
  CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 60) b
) b
JOIN (
  SELECT country, date+i*86400000000 ending_at, c, i
  FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a 
  CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 60) b
  WHERE country='Egypt'
  AND date+i*86400000000 = PARSE_UTC_USEC('2011-01-27')
) a
ON a.i=b.i
WHERE a.ending_at != b.ending_at
GROUP EACH BY ending_at1, ending_at2, starting_at2, a.country, b.country
HAVING (c = 60 AND ABS(corr) > 0.254)
ORDER BY corr DESC

This query has 2 subqueries: The smaller one finds the timeline of 30 days in Egypt before 2011-01-27, while the left side collects all sets of 30 days events for every country through GDELT's ever-growing dataset. With a cross join between the first set and all the sets on the left side, BigQuery is capable of sifting through this over a million combinations computed in real-time and calculate the Pearson correlation of each timeline pair. For a visual explanation, see the linked IPython notebook.

After running this query, the GDELT team obtained from BigQuery a list of all the worldwide periods from the last 35 years, as monitored by GDELT, that have been most similar to Egypt’s two months preceding the core of its revolution. Mathematically these periods present a statistically significant correlation with this specific time, and GDELT team proceeded to look into the details of why and its meaning. Read Kalev's post on the official GDELT blog.

You can run your own experiments with based on the GDELT database or other public datasets with your free monthly terabyte to query with Google BigQuery.

-Posted by Felipe Hoffa, Developer Advocate

World's largest event dataset now publicly available in BigQuery

Posted: Thursday, May 29, 2014

Today’s guest blog comes from Kalev H. Leetaru, a fellow and adjunct faculty in the Edmund A. Walsh School of Foreign Service at Georgetown University in Washington DC. His award-winning work centers on the application of high performance computing and "big data" to grand challenge problems.

The entire quarter-billion-record GDELT Event Database is now available as a public dataset in Google BigQuery.

BigQuery is Google’s powerful cloud-based analytical database service, designed for the largest datasets on the planet. It allows users to run fast, SQL-like queries against multi-terabyte datasets in seconds. Scalable and easy to use, BigQuery gives you real-time insights about your data. With the availability of GDELT in BigQuery, you can now access realtime insights about global human society and the planet itself!

You can take it for a spin here. (If it's your first time, you'll have to sign-up to create a Google project, but no credit card or commitment is needed).

The GDELT Project pushes the boundaries of “big data,” weighing in at over a quarter-billion rows with 59 fields for each record, spanning the geography of the entire planet, and covering a time horizon of more than 35 years. The GDELT Project is the largest open-access database on human society in existence. Its archives contain nearly 400M latitude/longitude geographic coordinates spanning over 12,900 days, making it one of the largest open-access spatio-temporal datasets as well.

From the very beginning, one of the greatest challenges in working with GDELT has been in how to interact with a dataset of this magnitude. Few traditional relational database servers offer realtime querying or analytics on data of this complexity, and even simple queries would normally require enormous attention to data access patterns and intricate multi-column indexing to make them possible. Traditional database servers require the creation of indexes over the most-accessed columns to speed queries, meaning one has to anticipate apriori how users are going to interact with a dataset.

One of the things we’ve learned from working with GDELT users is just how differently each of you needs to query and analyze GDELT. The sheer variety of access patterns and the number of permutations of fields that are collected together into queries makes the traditional model of creating a small set of indexes impossible. One of the most exciting aspects of having GDELT available in BigQuery is that it doesn’t have the concept of creating explicit indexes over specific columns – instead you can bring together any ad-hoc combination of columns and query complexity and it still returns in just a few seconds. This means that no matter how you access GDELT, what columns you look across, what kinds of operators you use, or the complexity of your query, you will still see results pretty much in near-realtime.

For us, the most groundbreaking part of having GDELT in BigQuery is that it opens the door not only to fast complex querying and extracting of data, but also allows for the first time real-world analyses to be run entirely in the database. Imagine computing the most significant conflict interaction in the world by month over the past 35 years, or performing cross-tabbed correlation over different classes of relationships between a set of countries. Such queries can be run entirely inside of BigQuery and return in just a handful of seconds. This enables you to try out “what if” hypotheses on global-scale trends in near-real time.

On the technical side, BigQuery is completely turnkey: you just hand it your data and start querying that data – that’s all there is to it. While you could spin up a whole cluster of virtual machines somewhere in the cloud to run your own distributed clustered database service, you would end up spending a good deal of your time being a systems administrator to keep the cluster working and it wouldn’t support BigQuery’s unique capabilities. BigQuery eliminates all of this so all you have to do is focus on using your data, not spending your days running computer servers.

We automatically update the public dataset copy of GDELT in BigQuery every morning by 5AM ET, so you don’t even have to worry about updates – the BigQuery copy always has the latest global events. In a few weeks when GDELT unveils its move from daily updates to updating every 15 minutes, we’ll be taking advantage of BigQuery’s new stream updating capability to ensure the data reflects the state of the world moment-by-moment.

Check out the GDELT blog for future posts where we will showcase how to harness some of BigQuery’s power to perform some pretty incredible analyses, all of them running entirely in the database system itself. For example, we’re particularly excited about the ability to use features like BigQuery’s new Pearson correlation support to be able to search for patterns across the entire quarter-billion-record dataset in just seconds. And we can’t wait to see what you do with it.

Give it a try here. (If it's your first time, you'll have to sign-up to create a Google project, but no credit card nor commitment is needed).

Then copy and paste the query below into the query box and click the bright red “Run Query” button:

SELECT Year, Actor1Name, Actor2Name, Count FROM (
SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rank
FROM 
(SELECT Actor1Name, Actor2Name,  Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),  (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name  and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),
WHERE Actor1Name IS NOT null
AND Actor2Name IS NOT null
GROUP EACH BY 1, 2, 3
HAVING Count > 100
)
WHERE rank=1
ORDER BY Year

You just processed more than 250 million records detailing worldwide events from the last 30 years and discovered the top defining relationship for each year – all of this in around 6 seconds and for free (BigQuery gives its users a free monthly querying quota). Even more excitingly, you did all of this entirely in BigQuery, showcasing how you can run real analyses entirely in the database itself!

Now try copying and pasting this query and click the “Run Query” button:

SELECT MonthYear MonthYear, INTEGER(norm*100000)/1000 Percent
FROM (
SELECT ActionGeo_CountryCode, EventRootCode, MonthYear, COUNT(1) AS c, RATIO_TO_REPORT(c) OVER(PARTITION BY MonthYear ORDER BY c DESC) norm FROM [gdelt-bq:full.events]
GROUP BY ActionGeo_CountryCode, EventRootCode, MonthYear
)
WHERE ActionGeo_CountryCode='UP' and EventRootCode='14'
ORDER BY ActionGeo_CountryCode, EventRootCode, MonthYear;

Congratulations, you just scanned more than a quarter-billion records to compile every protest in Ukraine that GDELT found in the world’s news media, by month, from 1979 to present. Even more powerfully, BigQuery has converted the raw count of protests per month into a normalized “intensity” measure that accounts for the fact that there is a lot more news media today in 2014 than there was in 1979 and that machine processing of news isn’t perfect.

You can click “Download as CSV” at the top right of the query results table and open the spreadsheet in your favorite spreadsheet program and graph it as a timeline. You can instantly see the “Revolutions of 1989,” the violent “Ukraine without Kuchma” protests of March 2001, the “Orange Revolution” of November 2004, and the Euromaidan protests of November 2013 to present.

Let me know if you come up with a particularly cool query you want to share with the world and we might just add it to the GDELT Blog. We can’t wait to see what you do with GDELT and BigQuery.

Announcing Google BigQuery and Datastore Connectors for Hadoop

Posted: Wednesday, April 16, 2014

Today, we are making it easier for you to run Hadoop jobs directly against your data in Google BigQuery and Google Cloud Datastore with the Preview release of Google BigQuery connector and Google Cloud Datastore connector for Hadoop. The Google BigQuery and Google Cloud Datastore connectors implement Hadoop’s InputFormat and OutputFormat interfaces for accessing data. These two connectors complement the existing Google Cloud Storage connector for Hadoop, which implements the Hadoop Distributed File System interface for accessing data in Google Cloud Storage.

The connectors can be automatically installed and configured when deploying your Hadoop cluster using bdutil simply by including the extra “env” files:

./bdutil deploy bigquery_env.sh
./bdutil deploy datastore_env.sh
./bdutil deploy bigquery_env.sh datastore_env.sh

Diagram of Hadoop on Google Cloud Platform

These three connectors allow you to directly access data stored in Google Cloud Platform’s storage services from Hadoop and other Big Data open source software that use Hadoop's IO abstractions. As a result, your valuable data is available simultaneously to multiple Big Data clusters and other services, without duplications. This should dramatically simplify the operational model for your Big Data processing on Google Cloud Platform.

Here are some word-count MapReduce code samples to get you started:

As always, we would love to hear your feedback and ideas on improving these connectors and making Hadoop run better on Google Cloud Platform.

-Posted by Pratul Dublish, Product Manager

Geoip geolocation with Google BigQuery

Posted: Monday, March 3, 2014

Aggregating numbers by geo location is a powerful way to analyze your data, but not an easy task when you have millions of IP addresses to analyze. In this post, we'll check how we can we use Google BigQuery to quickly solve this use case using a publicly available dataset.

We take the developer community seriously and it’s a great way for us to see what your use cases are. This is where I found a very interesting question: 'user2881671' on Stack Overflow had created a way to transform IP address into geographical locations in BigQuery, and asked for help optimizing their query. We worked out an optimized solution there, and today I'm happy to present an even better solution.

For example, if you want to peek at what are the top cities contributing modifications to Wikipedia, you can run this query:

SELECT COUNT(*) c, city, countryLabel, NTH(1, latitude) lat, NTH(1, longitude) lng
FROM (
 SELECT
   INTEGER(PARSE_IP(contributor_ip)) AS clientIpNum,
   INTEGER(PARSE_IP(contributor_ip)/(256*256)) AS classB
 FROM
   [publicdata:samples.wikipedia]
 WHERE contributor_ip IS NOT NULL
   ) AS a
JOIN EACH [fh-bigquery:geocode.geolite_city_bq_b2b] AS b
ON a.classB = b.classB
WHERE a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
AND city != ''
GROUP BY city, countryLabel
ORDER BY 1 DESC

We can visualize the results on a map:

You can do the same operation with your own tables containing ipv4 IP addresses. Just take the previous query and replace [publicdata:samples.wikipedia] with your own table, and contributor_ip with the name of its column containing ipv4 addresses.

Technical details

First, I downloaded the Creative Commons licensed GeoLite City IPv4 made available by MaxMind in its .csv format. There is a newer database available too, but I didn't work with it as it's only available in binary form for now. I uploaded its 2 tables into BigQuery: blocks and locations.

To get better performance later, some processing was needed: For each rule I extracted into a new column its class B prefix (192.168.x.x) and generated duplicate rules for segments that spanned more than one B class. I also joined both original tables, to skip that step when processing data. In the StackOverflow question 'user2881671' went even further, generating additional rules for segments without a location mapping (cleverly using the LAG() window function), but I skipped that step here (so addresses without a location will be skipped rather than counted). In total, only 32,702 new rows were needed.

The final query JOINs the class B prefix from your IP addresses with the lookup table, to prevent the performance hit of doing a full cross join

You can find the new table with the BigQuery web UI, or using the REST based API to integrate these queries and dataset with your own software.

To get started with BigQuery, you can visit our check out our site and the "What is BigQuery" introduction. You can post questions and get quick answers about BigQuery usage and development on Stack Overflow. Follow the latest BigQuery news at www.reddit.com/r/bigquery. We love your feedback and comments. Join the discussion on +Google Cloud Platform using the hashtag #BigQuery.

-Posted by Felipe Hoffa, Developer Programs Engineer

This post includes GeoLite data created by MaxMind, available from http://www.maxmind.com, distributed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

BigQuery in Practice - Loading Data Sets that are Terabytes and Beyond

Posted: Friday, January 31, 2014

We all know the story of David and Goliath. But did you know that King Saul prepared David for battle by fully arming him? He put a coat of armor and a bronze helmet on him, and gave David his sword. David tried walking around in them but they didn't feel right to him. In the end, he decided to carry five small stones and a sling instead. These were the tools that he used to fight off lions as a shepherd boy. We know what the outcome was. David showed us that picking the right tools and using them well is one of the keys to success.

Let's suppose you are tasked to start a Big Data project. You decide to use Google BigQuery because:

Its hosting model allows you to quickly run your data analysis without having to set up a costly computing infrastructure.
The interactive speed allows your analysts to quickly validate hypothesis about their insights.

To get started though, your Goliath is to load multi-terabytes of data into BigQuery. The technical article, BigQuery in practice - Loading Data Sets that are Terabytes and Beyond, is intended for IT Professionals and Data Architects who are planning to deploy large data sets to Google BigQuery. When dealing with multi-terabytes to petabytes of data, managing the processing of data such as uploading, failure recovery, cost and quota management becomes paramount.

Just as David showed us the importance of using the right tools effectively, the paper presents various options and considerations to help you to decide on the optimal solution. It follows the common ingestion workflow as depicted in the following diagram and discusses the tools that you can use during each stage - from uploading the data to the Google Cloud Storage, running your Extract Transform and Load (ETL) pipelines, to loading the data into BigQuery.

Scenarios for data ingestion into BigQuery

When dealing with large data sets the correct implementation can mean a savings of hours or days, while an improper design may mean weeks of re-work. David was so successful that King Saul gave him a high rank in the army. Similarly, we are here to help you use Google Cloud Platform successfully so your Big Data project will achieve the same level of success.

- Posted by Wally Yau, Cloud Solutions Architect

Outfit 7’s Talking Friends built on Google App Engine, recently hit one billion downloads

Posted: Wednesday, December 11, 2013

Today’s guest blogger is Igor Lautar, senior director of technology at Outfit7 (Ekipa2 subsidiary), one of the fastest-growing media entertainment companies on the planet. Its flagship franchise Talking Tom and Friends has achieved over 1.2 billion downloads since its launch in 2010 and continues to grow with 170 million active users each month. In today’s post, Igor explains how the company has been successful building the backends of its entertainment apps on Google App Engine.

Outfit7 is one of the most downloaded mobile publishers in the world, most famous for creating Talking Tom, an app in which a cat named Tom responds to your touch and repeats what you say. Talking Tom and Friends are unique fully animated 3D animal characters who love to be petted and played with through an array of in-app user functions. Each with their own distinct personality, Talking Tom, Talking Angela, Talking Ginger, Talking Ben, Talking Gina and a host of lovable friends are fully-interactive and can engage in two-way conversations with users. Fans can even create and share videos of interactions with their favourite characters via Facebook and YouTube.

The popular characters started life in the digital world and now Talking Tom and Friends extend from mobile apps to chart-topping YouTube singles, animated web series, innovative merchandise and a soon to be released TV series. The company has published more than 20 apps to-date with users in every country in the world.

In order to run and maintain all of these apps, a robust backend is required to track the state of the app and virtual currency and push new content and promotions across multiple platforms, which is why we turned to Google Cloud Platform.

Why Google App Engine?
Outfit7 was founded by a group of entrepreneurs in 2009 whose mission is to bring fun and entertainment to all. Like most startups of its kind, much of Outfit7’s team were engineers who had first-hand experience with hardware and knew its limitations, specifically the dedicated resources needed to maintain it.

Thus, from the beginning, we knew the types of resources that would be required to grow the company. We started researching cloud solutions that would enable us to scale-up as we grew and have a backend that could handle any workload. After researching and interacting with a number of companies, the team decided to move forward with Google App Engine because of its low maintenance and ease of use. To put this in perspective, we looked at virtual solutions in addition to cloud; however, virtual machines would still require a dedicated IT administrator. Our developers could work directly with App Engine without IT support, which is why we went with Google.

From App Engine to Cloud Datastore and BigQuery
With App Engine, we were able to take advantage of Google’s infrastructure. As any mobile app developer knows, a scalable backend is essential from day one. Apps can go from zero to one million downloads faster than you think - which is one of the primary reasons that we went with Google. And we still run a significant amount of original code, some of it four years old, which proves the value of App Engine and allows us to focus on new features instead of maintaining old ones.

We also implemented Google Cloud Datastore alongside App Engine. This directly supported our backend from the outset with migration from Master/Slave to HRD. The performance and scaling capabilities proved excellent, and we have even seen a drop in access time as Cloud Datastore has improved over the past few years.

After building this foundation with App Engine and Cloud Datastore, we expanded our use of Cloud Platform to include Google Cloud Messaging, an Android-specific service that allows you to send data from your server to your users’ Android devices. While Talking Tom is our most popular app, we also have similar apps with different characters like Talking Angela and Talking Santa. We leveraged Cloud Messaging to increase user engagement by sending fun messages that attract users’ attention.

More recently, we started using Google BigQuery for data analytics. Some of the tables we have are quite large and growing fast, but the performance of the queries remain very consistent (i.e., seconds, not minutes). BigQuery’s scaling ability is just as impressive as the other platform tools, and we’re excited to expand our usage of the tool.

All of our apps communicate via App Engine. The state is stored in a number of datastores supported by memcache. Most of the processing is done directly by frontend instances, and some operations are delegated to task queues. We also have a few backend instances for long-running jobs or complex operations that require more memory.

Some of our apps use push notifications quite heavily. For Android, we send them directly to Cloud Messenger, whereas iOS push notifications are sent to our own forwarding service, which is still legacy.

Data is pushed to BigQuery from our own servers. After processing it and gathering it from various sources, including logs, stats from stores, and downloads, BigQuery is then queried directly via the data visualization tools that we use.

Our experience with Google Cloud Platform has been very positive. The benefits are obvious — consistent performance with great scalability, leaving us more time to focus on app development. We are very happy with the performance and reliability of the platform, and as our vice president of technology, Luka Renko, says, “It’s nice to have a platform that solves more problems than it creates. That’s rare!”

Google Cloud Platform has been the foundation that has enabled us to produce some of the most popular apps in the world. Throughout our relationship with Google the support team has been amazing, helping us unlock all the power of Cloud Platform. We’re excited to continue working with Google.

-Contributed by Igor Lautar, Senior Director of Technology, Outfit7 (Ekipa2 subsidiary)

Google App Engine helps MAG Interactive handle explosive growth of word game, Ruzzle

Posted: Tuesday, November 5, 2013

Today we hear from Daniel Hasselberg, co-founder and chief executive officer of mobile game development company, MAG Interactive, based in Stockholm, Sweden. MAG Interactive produces some of the most popular games in the world, including Ruzzle, which has more than 45 million players in 142 different countries.

When we launched our word game Ruzzle in 2012, we had no idea it would become an international sensation almost overnight. We initially promoted the game only to our family and friends, but within two weeks of our launch, Ruzzle was the No.1 game on the Swedish App Store.

I believe if we hadn’t used Google App Engine to build the backend of Ruzzle, we wouldn’t have been able to scale fast enough with our own servers, which would have killed the app in the marketplace. There were about a million downloads of Ruzzle per month in the Nordic region, Holland, Spain and Italy through 2012. As we refined the game’s social integration through channels like Facebook and Twitter, we grew rapidly in Italy and the United States. In 2013, Ruzzle became the No. 1 game download on Google Play and the App Store in Italy, Sweden, the United States and many other countries.

Things were especially crazy at the end of last year. We were seeing about 700,000 new players each day from December 2012 through January 2013. We added 20 million users in a single month! It was incredible to see App Engine scale – and just keep on working – as we grew from about 5 million players to 25 million players in just a few weeks.

Our decision to use App Engine as the platform for Ruzzle and our new game, QuizCross, was strategic. Some of us at MAG Interactive helped develop the server platform for one of the most popular music download services in the Nordic region, so we knew about the challenges of having to scale quickly. While we didn’t anticipate Ruzzle’s popularity, we did recognize even before creating the game that we could face scaling problems if we were successful. So we decided from day one to use a cloud solution.

We looked at Amazon’s platform but preferred Google’s approach to cloud solutions. Google’s scalability was an important factor in our decision, but we also appreciated the company’s transparent pricing. The more efficient we became with App Engine, the less we paid.

The Google Cloud Platform team has been great to work with, as well. They are very supportive and appreciate our feedback. The technical support experts at Google are amazing, too – very hands-on. They know the platform extremely well and can help us work through any challenge.

We’re also using Google BigQuery for business intelligence. We track millions of events in the game every day so we know what users are doing – or not doing – and how we should improve the experience. We really like that we can throw enormous amounts of data at BigQuery, and it still performs. It only takes a few seconds to get results, and there are no scaling issues. It’s also easy to use. We have just one data analyst doing all the work with BigQuery but could probably use more people. If there are a few brilliant data mining experts out there who can imagine a future in Stockholm, please give us a call!

One thing we’ve learned from our BigQuery analysis is that the more users play Ruzzle, the more they improve their skills. New players typically find about 18 words in the two-minute time frame they’re given. After they play 100 games, they can find about 50 words, on average. I think that tracking player improvement is what keeps people playing and has helped to make Ruzzle so popular.

BigQuery offers our company a lot of insight into the use of our games and how we can improve them. We’re looking forward to expanding our relationship with Google as App Engine and Cloud Platform evolves.

-Contributed by Daniel Hasselberg, CEO, MAG Interactive

Unlocking Big Data

Posted: Saturday, October 12, 2013

Storing and analyzing billions of records is complicated. That’s why we try to make it as easy as possible to simplify your analytics needs.

Watch this video to see how Tableau and Leanplum tackle Big Data using Google’s infrastructure. Both of these companies are using BigQuery on Google Cloud Platform. You’ll hear from Francois Ajenstat, Director of Product Management at Tableau, about what he thinks is the “difference between the old way and the Google Cloud Platform way.”

-Posted by Benjamin Bechtolsheim, Product Marketing Manager

Jump-start your data pipelining into Google BigQuery

Posted: Tuesday, October 1, 2013

Once you get your data into Google BigQuery, you don’t have to worry about running out of machine capacity, because you use Google’s machines as if they were your own. But what if you want to transform your source data before putting it into BigQuery and you don’t have the server capacity to handle the transformation? In this case, how about using Google Compute Engine to run your Extract, Transform and Load (ETL) processing? To learn how, read our paper Getting Started With Google BigQuery. To get started, download the sample ETL tool for Google Compute Engine from GitHub.

The sample ETL tool is an application that automates the steps of getting the Google Compute Engine instance up and running, and installing the software you need to rapidly design, create and execute the ETL workflow. The application includes a sample workflow that uses KNIME to help you understand the entire process, as shown here:

If you already have an established process for performing the ETL process to prepare the data and load it into Google Cloud Storage, but need a reliable way to load the data from there into BigQuery, we have a another sample application to help you. The Automated File Loader for BigQuery sample app demonstrates how to automate data loading from Google Cloud Storage to BigQuery.

This application uses the Cloud Storage Object Change Notification API to receive notifications that files have been uploaded to a bucket in Google Cloud Storage, then uses the BigQuery API to load the data from the bucket into BigQuery. Download it now from GitHub.

Both these sample applications accompany the article Getting Started With Google BigQuery, which provides an overview of the end-to-end process from loading data into BigQuery to visualization, and design practices that should be considered when using BigQuery.

-Posted by Wally Yau, Solutions Architect

Google BigQuery goes real-time with streaming inserts, time-based queries, and more

Posted: Wednesday, September 18, 2013

Google BigQuery is designed to make it easy to analyze large amounts of data quickly. This year we've seen great updates: big scale JOINs and GROUP BYs, unlimited result sizes, smarter functions, bigger quotas, as well as multiple improvements to the web UI. Today we've gone even further, announcing several updates that give BigQuery the ability to work in real-time, query subsets of the latest data, more functions and browser tool improvements.

Real-time analytics
BigQuery is now able to load and analyze data in real time through a simple API call, the new tabledata().insertAll() endpoint. This enables you to store data as it comes in, rather than building and maintaining systems just to cache and upload in batches. The best part? The new data is available for querying instantaneously. This feature is great for time sensitive use cases like log analysis and alerts generation.

Using it is as easy as calling the new endpoint with your data in a JSON object (with a single row or multiple rows of data).

Here's a Python example:

body = {"rows":[
    {"json": {"column_name":7.7,}}
    ]}
response = bigquery.tabledata().insertAll(
    projectId=PROJECT_ID,
    datasetId=DATASET_ID,
    tableId=TABLE_ID,
    body=body).execute()

Streaming data into BigQuery is free for an introductory period until January 1st, 2014. After that it will be billed at a flat rate of 1 cent per 10,000 rows inserted. The traditional jobs().insert() method will continue to be free. When choosing which import method to use, check for the one that best matches your use case. Keep using the jobs().insert() endpoint for bulk and big data loading. Switch to the new tabledata().insertAll() endpoint if your use case calls for a constantly and instantly updating stream of data.

Table decorators for recent imported data
This release introduces a new syntax for "table decorators". Now you can define queries that only scan a range or spot in the previous 24 hours. Traditionally BigQuery has always done a "full column scan" when querying data, while the new syntax will allow you to focus only on a specific subset of the latest data. Querying a partial subset of data with these decorators will result in lower querying costs — proportional to the size of the subset of the queried data.

With the new capabilities you can query only the last hour of inserted data, or query only what was inserted before that hour, or get a snapshot of the table at a specific time. The table decorators also work in all table related operations (list, export, copy, and so on), giving you the ability to do operations like copying a table as it was before a bad import job.

New window and statistical functions
To the previously announced window functions, we've added SUM(), COUNT(), AVG(), MIN(), MAX(), FIRST_VALUE, and LAST_VALUE(). Refer to the previous release announcement for a run through of how window functions work.

We also recently introduced the new Pearson correlation function, and showed how to find interesting and anomalous patterns in ambient sensors, or how to predict tomorrow's flight delays using 70 million flights dataset. Today's release adds new statistical functions, for even richer capabilities: COVAR_POP(), COVAR_SAMP(), STDDEV_POP(), STDDEV_SAMP(), VAR_POP() and VAR_SAMP().

BigQuery browser tool: Improved history retrieval
You can now query your history faster in the browser tool using the Query History panel: More information about the queries is being surfaced and convenient buttons have been added for common tasks.

You can try out the new UI features, table decorators, and new functions at https://bigquery.cloud.google.com/. You can post questions and get quick answers about BigQuery usage and development on Stack Overflow. We love your feedback and comments. Join the discussion on +Google Cloud Platform using the hashtag #BigQuery.

- Posted by Felipe Hoffa, Developer Programs Engineer

Introducing CORR() to Google BigQuery

Posted: Thursday, September 5, 2013

Editor's note: Join Felipe Hoffa and Michael Manoochehri tomorrow, September 6th at 11:00AM PST on Google Developers Live where they'll discuss correlation with BigQuery.

Google BigQuery is designed to make it easy to analyze large amounts of data quickly. We are always looking into how to make BigQuery even more powerful, so today we'll introduce a feature that we couldn't wait to share with you: Pearson correlation.

BigQuery is transforming the ways in which we work with massive amounts of data. Our partners have created amazing tools to make that process even more streamlined: Visualizing, slicing, and dicing. Working with your intuition and these tools, you can discover surprising new insights, analyzing terabytes of data in mere seconds. What's still a challenge is feeding this intuition, discovering where to look for insights.

The new CORR() function in BigQuery is a powerful tool for your intuition process: Which variables are similar, or have surprising behaviors? Can you rank these surprising behaviors? What are the best variables to predict the future?

Let's look at the data we collected at the Data Sensing Lab at the Moscone Center for Google I/O. We had multiple sensors in multiple rooms, collecting temperature, humidity, noise and other data during these 3 days.

To start, we can look at what rooms' temperature behaved in a similar way during the 2nd and the 3rd day:

SELECT CORR(a.data, b.data) corr, a.room room, count(*) c
FROM (
  SELECT
    TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time, AVG(DATA) data, room
  FROM [io_sensor_data.moscone_io13]
  WHERE
    DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-16'
    AND sensortype='temperature'
    GROUP EACH BY time, room) a
JOIN EACH (
  SELECT
    TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time, AVG(data) data, room
  FROM [io_sensor_data.moscone_io13]
  WHERE
    DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-17'
    AND sensortype='temperature'
    GROUP EACH BY time, room) b
  ON a.time=b.time AND a.room = b.room
  GROUP EACH BY room
HAVING
  corr IS NOT NULL
  AND c > 800
  ORDER EACH BY corr DESC

corr	room	c
0.9387693711	sf desk 1st floor	1331
0.8488553811	chrome east	1418
0.8423597116	chrome hobbit	1372
0.8162574011	chrome west	1401
0.7696065852	chrome north	1374
...	...	...
-0.1048712561	Room 1	1390
-0.1508345595	keynote crowd	1358
-0.5467798237	android east	1402

This table says that many rooms behaved in a similar way during both days. For example, a room inside the main exhibition area:

We see that the temperature rose during the day, in a very similar way. A/C helps a lot with this. Meanwhile in the main reception area the temperature dropped considerably when the doors opened:

Even though the base temperature was considerably higher on Day 2, CORR() finds out that the behavior was very similar.

The room 'android east' shows a highly negative correlation. We can visualize why:

Turns out something was wrong with the sensors there a few hours during Day 2, and went to 0 during Day 3. It's too late now to go and fix them, but we can keep this as a good example of how CORR() can help us set alarms and notify on unusual events.

We can repeat this experiment with other dimensions, like humidity or noise. But we can go further, and find out what dimensions correlate with each other.

SELECT CORR(a.data, b.data) corr, a.sensortype a_sensortype, b.sensortype b_sensortype, a.room room, count(*) c
FROM (
  SELECT
    TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time,
    AVG(DATA) data, room, sensortype
  FROM [io_sensor_data.moscone_io13]
  WHERE
    DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-16'
    AND sensortype != 'pressure' AND sensortype != 'altitude' AND room != 'None'
    GROUP EACH BY time, room, sensortype) a
JOIN EACH (
  SELECT
    TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time,
    AVG(data) data, room, sensortype
  FROM [io_sensor_data.moscone_io13]
  WHERE
    DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-16'
    AND sensortype != 'pressure' AND sensortype != 'altitude' AND room != 'None'
    GROUP EACH BY time, room, sensortype) b
  ON a.time=b.time AND a.room = b.room
  WHERE a.sensortype > b.sensortype
  GROUP EACH BY room, a_sensortype, b_sensortype
HAVING
  corr IS NOT NULL
  AND c > 800
  ORDER EACH BY corr DESC

corr	a_sensortype	b_sensortype	room	c
0.9452553258	temperature	humidity	android east	1436
0.9230651809	temperature	gas	sf desk 1st floor	1425
...	...	...	...	...
0.7949466754	temperature	mic	maps	1434
...	...	...	...	...
0.0013869924	temperature	light	Room 12	1436
...	...	...	...	...
-0.9396195218	temperature	humidity	accesability	1341
-0.9564087934	temperature	humidity	chrome lounge	1135
-0.9925864599	temperature	humidity	sf desk 1st floor	1437

We can visualize how the air quality in the reception improved, once the doors were opened - as the drop in temperature shows:

Or how the temperature rose at the same time as the volume levels, in the Maps area:

Temperature and light with almost no correlation in Room 12 calls our attention, as we got some very bad measurements in that room:

Temperatures of over 2000? Something went wrong.

At the end of the table, it's a surprise that while humidity had a high negative correlation to temperature in many sectors, it had a high positive correlation in 'android east'. First we can look at the normal behavior in the 'chrome lounge sector':

And the unusual behavior at 'android east':

Turns out there is a highly unusual positive correlation because both sensors went to 0 at the same time.

That's the nature of real data. It's a hard job to keep it accurate and clean. CORR() is a tool that can also call our attention to what is going wrong.

We can go much further with this: What's the relationship between rooms? What sensors in which rooms can help us predict what will happen an hour later on a different room? For this and much more tune in for an upcoming Google Developers Live session where we'll discuss how we built this queries, and how to go beyond. Stay tuned!

-Posted by Felipe Hoffa, Developer Programs Engineer