Google Developers Blog: bigquery

Updates to Google BigQuery following Cloud Platform Live

Posted: Monday, March 31, 2014

By Felipe Hoffa, Cloud Platform team

Cross-posted from the Google Cloud Platform Blog
Editor's note: This post is a follow-up to the announcements we made on March 25th at Google Cloud Platform Live.

Last Tuesday we announced an exciting set of changes to Google BigQuery making your experience easier, faster and more powerful. In addition to new features and improvements like table wildcard functions, views, and parallel exports, BigQuery now features increased streaming capacity, lower pricing, and more.

1000x increase in streaming capacity

Last September we announced the ability to stream data into BigQuery for instant analysis, with an ingestion limit of 100 rows per second. While developers have enjoyed and exploited this capability, they've asked for more capacity. You now can stream up to 100,000 rows per second, per table into BigQuery - 1,000 times more than before.

For a great demonstration of the power of streaming data into BigQuery, check out the live demo from the keynote at Cloud Platform Live.

Users often partition their big tables into smaller units for data lifecycle and optimization purposes. For example, instead of having yearly tables, they could be split into monthly or even daily sets. BigQuery now offers table wildcard functions to help easily query tables that match common parameters.

The downside of partitioning tables is writing queries that need to access multiple tables. This would be easier if there was a way to tell BigQuery "process all the tables between March 3rd and March 25th" or "read every table which names start with an 'a'". You can do this with this release.

TABLE_DATE_RANGE() queries all tables that overlap with a time range (based on the table names), while TABLE_QUERY() accepts regular expressions to select the tables to analyze.

For more information, see the documentation and syntax for table wildcard functions.

Improved SQL support and table views

BigQuery has adopted SQL as its query language because it's one of the most well known, simple and powerful ways to analyze data. Nevertheless BigQuery used to impose some restrictions on traditional SQL-92, like having to write multiple sub-queries instead of simpler multi-joins. Not anymore, now BigQuery supports multi-join and CROSS JOIN, and improves its SQL capabilities with more flexible alias support, fewer ORDER BY restrictions, more window functions, smarter PARTITION BY, and more.

A notable new feature is the ability to save queries as views, and use them as building blocks for more complex queries. To define a view, you can use the browser tool to save a query, the API, or the newest version of the BigQuery command-line tool (by downloading the Google Cloud SDK).

User-defined metadata

Now you can annotate each dataset, table, and field with descriptions that are displayed within BigQuery. This way people you share your datasets with will have an easier time identifying them.

JSON parsing functions

BigQuery is optimized for structured data: before loading data into BigQuery, you should first define a table with the right columns. This is not always easy, as JSON schemas might be flexible and in constant flux. BigQuery now lets you store JSON encoded objects into string fields, and you can use the JSON_EXTRACT and JSON_EXTRACT_SCALAR functions to easily parse them later using JSONPath-like expressions.

For example:

SELECT json_extract_scalar(
   "{'book': { 
       'category':'fiction', 
       'title':'Harry Potter'}}", 
   "$.book.category");

Fast parallel exports

BigQuery is a great place to store all your data and have it ready for instant analysis using SQL queries. But sometimes SQL is not enough, and you might want to analyze your data with external tools. That's why we developed the new fast parallel exports: With this feature, you can define how many workers will be consuming the data, and BigQuery exports the data to multiple files optimized for the available number of workers.

Check the exporting data documentation, or stay tuned for the upcoming Hadoop connector to BigQuery documentation.

Massive price reductions

At Cloud Platform live, we announced a massive price reduction: Storage costs are going down 68%, from 8 cents per gigabyte per month to only 2.6, while querying costs are going down 85%, from 3.5 cents per gigabyte to only 0.5. Previously announced streaming costs are now reduced by 90%. And finally, we announced the ability to purchase reserved processing capacity, for even cheaper prices and the ability to precisely predict costs. And you always have the option to burst using on-demand capacity.

I want to take this space to celebrate the latest open source community contributions to the BigQuery ecosystem. R has its own connector to BigQuery (and a tutorial), as Python pandas too (check out the video we made with Pearson). Ruby developers are now able to use BigQuery with an ActiveRecord connector, and send all their logs with fluentd. Thanks all, and keep surprising us!

Felipe Hoffa is part of the Cloud Platform Team. He'd love to see the world's data accessible for everyone in BigQuery.

Posted by Louis Gray, Googler

Geoip geolocation with Google BigQuery

Posted: Monday, March 3, 2014

By Felipe Hoffa, Cloud Platform team

Cross-posted from the Google Cloud Platform Blog

Aggregating numbers by geolocation is a powerful way to analyze your data, but not an easy task when you have millions of IP addresses to analyze. In this post, we'll check how we can we use Google BigQuery to quickly solve this use case using a publicly available dataset.

We take the developer community seriously and it’s a great way for us to see what your use cases are. This is where I found a very interesting question: "user2881671" on Stack Overflow had created a way to transform IP addresses into geographical locations in BigQuery, and asked for help optimizing their query. We worked out an optimized solution there, and today I'm happy to present an even better solution.

For example, if you want to peek at what are the top cities contributing modifications to Wikipedia, you can run this query:

SELECT COUNT(*) c, city, countryLabel, NTH(1, latitude) lat, NTH(1, longitude) lng
FROM (
 SELECT
   INTEGER(PARSE_IP(contributor_ip)) AS clientIpNum,
   INTEGER(PARSE_IP(contributor_ip)/(256*256)) AS classB
 FROM
   [publicdata:samples.wikipedia]
 WHERE contributor_ip IS NOT NULL
   ) AS a
JOIN EACH [fh-bigquery:geocode.geolite_city_bq_b2b] AS b
ON a.classB = b.classB
WHERE a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
AND city != ''
GROUP BY city, countryLabel
ORDER BY 1 DESC

We can visualize the results on a map:

You can do the same operation with your own tables containing ipv4 IP addresses. Just take the previous query and replace [publicdata:samples.wikipedia] with your own table, and contributor_ip with the name of its column containing ipv4 addresses.

Technical details

First, I downloaded the Creative Commons licensed GeoLite City IPv4 made available by MaxMind in its .csv format. There is a newer database available too, but I didn't work with it as it's only available in binary form for now. I uploaded its 2 tables into BigQuery: blocks and locations.

To get better performance later, some processing was needed: For each rule I extracted into a new column its class B prefix (192.168.x.x) and generated duplicate rules for segments that spanned more than one B class. I also joined both original tables, to skip that step when processing data. In the StackOverflow question "user2881671" went even further, generating additional rules for segments without a location mapping (cleverly using the LAG() window function), but I skipped that step here (so addresses without a location will be skipped rather than counted). In total, only 32,702 new rows were needed.

The final query JOINs the class B prefix from your IP addresses with the lookup table, to prevent the performance hit of doing a full cross join.

You can find the new table with the BigQuery web UI, or using the REST-based API to integrate these queries and dataset with your own software.

To get started with BigQuery, you can check out our site and the "What is BigQuery" introduction. You can post questions and get quick answers about BigQuery usage and development on Stack Overflow. Follow the latest BigQuery news at www.reddit.com/r/bigquery. We love your feedback and comments. Join the discussion on +Google Cloud Platform using the hashtag #BigQuery.

This post includes GeoLite data created by MaxMind, available from http://www.maxmind.com, distributed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

Felipe Hoffa is part of the Cloud Platform Team. He'd love to see the world's data accessible for everyone in BigQuery.

Posted by Scott Knaster, Editor

Visualizing airport delay correlations with Google BigQuery and Maps API

Posted: Thursday, September 26, 2013

By Josh Livni, Maps Developer Relations Team

Cross-posted with the Google Geo Developers Blog

Last week Felipe and Michael from the Google Cloud Platform Developer Relations team released a video demonstrating the new Pearson correlation analysis available in Google BigQuery. Their example used more than 70 million flight records to find correlations on departure times, which could be used to predict whether your plane would be late. Specifically, you can find out which set of airports best predict your airport’s possible flight delays the following day.

As you can see from their code (watch the video for context), they’re using iPython Notebook to work with the BigQuery results, and matplotlib to get a quick visual of results. But one aspect that’s hard to see in these results is the spatial relationship of the airports. Maps to the rescue!

My goal was to create a simple visualization, where selecting any airport would highlight which airports you should look at to predict delays the next day. Using the Google Maps JavaScript API I went ahead and did just that:

To create this map, using the same code noted above, we first created a JSON dump of the correlations on departure time for each airport to each other airport. Then, using an article on visualizing earthquakes as a template, I modified the code so that we have a simple dictionary of all airports, keyed by their airport id. Last but not least, I added a function that gets called when an airport marker is clicked in order to restyle all the other airport markers according to their correlation. This method checks the selected season (winter, for example, has different correlations than spring), and highlights correlated markers by setting the color and scale of the circle symbol.

https://gist.github.com/jlivni/6602202

Thanks to BigQuery, running complex correlations over billions or trillions of attributes is surprisingly easy. But don’t forget if your data does have some spatial component, a quick mapping visual can add some great context to your results.

Josh Livni works with maps at Google, where he helps developers tell compelling stories using the Google Maps APIs. As you read this, he's probably writing code, thinking about snowboarding, or both.

Posted by Scott Knaster, Editor

Google BigQuery goes real-time with streaming inserts, time-based queries, and more

Posted: Wednesday, September 18, 2013

By Felipe Hoffa, Developer Programs Engineer

Cross-posted from the Google Cloud Platform Blog

Google BigQuery is designed to make it easy to analyze large amounts of data quickly. This year we've seen great updates: big scale JOINs and GROUP BYs, unlimited result sizes, smarter functions, bigger quotas, as well as multiple improvements to the web UI. Today we've gone even further, announcing several updates that give BigQuery the ability to work in real-time, query subsets of the latest data, more functions and browser tool improvements.

Real-time analytics
BigQuery is now able to load and analyze data in real time through a simple API call, the new tabledata().insertAll() endpoint. This enables you to store data as it comes in, rather than building and maintaining systems just to cache and upload in batches. The best part? The new data is available for querying instantaneously. This feature is great for time sensitive use cases like log analysis and alerts generation.

Using it is as easy as calling the new endpoint with your data in a JSON object (with a single row or multiple rows of data).

Here's a Python example:

body = {"rows":[
    {"json": {"column_name":7.7,}}
    ]}
response = bigquery.tabledata().insertAll(
    projectId=PROJECT_ID,
    datasetId=DATASET_ID,
    tableId=TABLE_ID,
    body=body).execute()

Streaming data into BigQuery is free for an introductory period until January 1st, 2014. After that it will be billed at a flat rate of 1 cent per 10,000 rows inserted. The traditional jobs().insert() method will continue to be free. When choosing which import method to use, check for the one that best matches your use case. Keep using the jobs().insert() endpoint for bulk and big data loading. Switch to the new tabledata().insertAll() endpoint if your use case calls for a constantly and instantly updating stream of data.

Table decorators for recent imported data
This release introduces a new syntax for "table decorators". Now you can define queries that only scan a range or spot in the previous 24 hours. Traditionally BigQuery has always done a "full column scan" when querying data, while the new syntax will allow you to focus only on a specific subset of the latest data. Querying a partial subset of data with these decorators will result in lower querying costs — proportional to the size of the subset of the queried data.

With the new capabilities you can query only the last hour of inserted data, or query only what was inserted before that hour, or get a snapshot of the table at a specific time. The table decorators also work in all table related operations (list, export, copy, and so on), giving you the ability to do operations like copying a table as it was before a bad import job.

New window and statistical functions
To the previously announced window functions, we've added SUM(), COUNT(), AVG(), MIN(), MAX(), FIRST_VALUE, and LAST_VALUE(). Refer to the previous release announcement for a run through of how window functions work.

We also recently introduced the new Pearson correlation function, and showed how to find interesting and anomalous patterns in ambient sensors, or how to predict tomorrow's flight delays using 70 million flights dataset. Today's release adds new statistical functions, for even richer capabilities: COVAR_POP(), COVAR_SAMP(), STDDEV_POP(), STDDEV_SAMP(), VAR_POP() and VAR_SAMP().

BigQuery browser tool: Improved history retrieval
You can now query your history faster in the browser tool using the Query History panel: More information about the queries is being surfaced and convenient buttons have been added for common tasks.

You can try out the new UI features, table decorators, and new functions at https://bigquery.cloud.google.com/. You can post questions and get quick answers about BigQuery usage and development on Stack Overflow. We love your feedback and comments. Join the discussion on +Google Cloud Platform using the hashtag #BigQuery.

Felipe Hoffa is part of the Cloud Platform team. He'd love to see the world's data accessible for everyone in BigQuery.

Posted by Scott Knaster, Editor

How Safari Books Online uses Google BigQuery for business intelligence

Posted: Wednesday, July 31, 2013

This guest post was written by Daniel Peter, Senior Programmer Analyst at Safari Books Online.

Cross-posted from the Google Cloud Platform Blog

Safari Books Online is a subscription service for individuals and organizations to access a growing library of over 30,000 technology and business books and videos. Our customers browse and search the library from web browsers and mobile devices, generating powerful usage data which we can use to improve our service and increase profitability. We wanted to quickly and easily build dashboards, improve the effectiveness of our sales teams and enable ad-hoc queries to answer specific business questions. With billions of records, we found it challenging to get the answers to our questions fast enough with our existing MySQL databases.

Looking for alternative solutions to build our dashboards and enable interactive ad-hoc querying, we played with several technologies, including Hadoop. In the end, we decided to use Google BigQuery.

Here’s how we pipe data into BigQuery:

Our data starts in our CDN and server logs, gets packaged up into compressed files, and runs through our ETL server before finishing in BigQuery.

Here’s one of the dashboards we built using the data:

You can see that with the help of BigQuery, we can easily categorize our books. This dashboard shows popular books by desktop and mobile, and with BigQuery, we are able to run quick queries to dive into other usage patterns as well.

BigQuery has been very valuable for our company, and we’re just scratching the surface of what is possible.

Check out the article for more details on how we manage our import jobs, transform our data, build our dashboards, detect abuse and improve our sales team's effectiveness.

Posted by Scott Knaster, Editor

Google BigQuery new features: bigger, faster, smarter

Posted: Tuesday, June 11, 2013

By Felipe Hoffa, Cloud Platform team

Google BigQuery is designed to make it easy to analyze large amounts of data quickly. Today we announced several updates that give BigQuery the ability to handle arbitrarily large result sets, use window functions for advanced analytics, and cache query results. You are also getting new UI features, larger interactive quotas, and a new convenient tiered pricing scheme. In this post we'll dig further into the technical details of these new features.

Large results

BigQuery is able to process terabytes of data, but until today BigQuery could only output up to 128 MB of compressed data per query. Many of you asked for more and from now on BigQuery will be able to output results as large as the largest tables our customers have ever had.
To get this benefit, you should enable the new "--allow_large_results" flag when issuing a query job, and specify a destination table. All results will be saved to the new specified table (or appended, if the table exists). In the updated web UI these options can be found under the new "Enable Options" menu.
With this feature, you can run big transformations on your tables, plus get big subsets of data to further analyze from the new table.

Analytic functions

BigQuery's power is in the ability to interactively run aggregate queries over terabytes of data, but sometimes counts and averages are not enough. That's why BigQuery also lets you calculate quantiles, variance and standard deviation, as well as other advanced functions.
To make BigQuery even more powerful, today we are adding support for window functions (also known as "analytical functions") for ranking, percentiles, and relative row navigation. These new functions give you different ways to rank results, explore distributions and percentiles, and traverse results without the need for a self join.
To introduce these functions with an advanced example, let's use the dataset we collected from the Data Sensing Lab at Google I/O. With the percentile_cont() function it's easy to get the median temperature over each room:

SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'

In this example, each original data row shows the median temperature for each room. To visualize it better, it's a good idea to group all results by room with an outer query:

SELECT MAX(median) AS median, room FROM (
  SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
  FROM [io_sensor_data.moscone_io13]
  WHERE sensortype='temperature'
)
GROUP BY room

We can add an additional outer query, to rank the rooms according to which one had the coldest median temperature. We'll use one of the new ranking window functions, dense_rank():

SELECT DENSE_RANK() OVER (ORDER BY median) rank, median, room FROM (
  SELECT MAX(median) AS median, room FROM (
    SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
    FROM [io_sensor_data.moscone_io13]
    WHERE sensortype='temperature'
  )
  GROUP BY room
)

We've updated the documentation with descriptions and examples for each of the new window functions. Note that they require the OVER() clause, with an optional PARTITION BY and sometimes required ORDER BY arguments. ORDER BY tells the window function what criteria to use to rank items, while PARTITION BY allows you to define multiple groups to be analyzed independently of each other.
The window functions don't work with the big GROUP EACH BY and JOIN EACH BY operators, but they do work with the traditional GROUP BY and JOIN BY. As a reminder, we announced GROUP EACH BY and JOIN EACH BY last March, to allow large join and group operations.

Query caching

BigQuery now remembers values that you've previously computed, saving you time and the cost of recalculating the query. To maintain privacy, queries are cached on a per-user basis. Cached results are only returned for tables that haven't changed since the last query, or for queries that are not dependent on non-deterministic parameters (such as the current time). Reading cached results is free, but each query still counts against the max number of queries per day quota. Query results are kept cached for 24 hours, on a best effort basis. You can disable query caching with the new flag --use_cache in bq, or "useQueryCache" in the API. This feature is also accessible with the new query options on the BigQuery Web UI.

BigQuery Web UI: Query validator, cost estimator, and abandonment

The BigQuery UI gets even better: You'll get instant information while writing a query if its syntax is valid. If the syntax is not valid, you'll know where the error is. If the syntax is valid, the UI will inform you how much the query would cost to run. This feature is also available with the bq tool and API, using the --dry_run flag.
An additional improvement: When running queries on the UI, previously you had to wait until its completion before starting another one. Now you have the option to abandon it, to start working on the next iteration of the query without waiting for the abandoned one.

Pricing updates

Starting in July, BigQuery pricing becomes more affordable for everyone: Data storage costs are going from $0.12/GB/month to $0.08/GB/month. And if you are a high-volume user, you'll soon be able to opt-in for tiered query pricing, for even better value.

Bigger quota

To support larger workloads we're doubling interactive query quotas for all users, from 200GB + 1 concurrent query, to 400 GB of concurrent queries + 2 additional queries of unlimited size.
These updates make BigQuery a faster, smarter, and even more affordable solution for ad hoc analysis of extremely large datasets. We expect they'll help to scale your projects, and we hope you'll share your use cases with us on Google+.

The BigQuery UI features a collection of public datasets for you to use when trying out these new features. To get started, visit our sign-up page and Quick Start guide. You should take a look at our API docs, and ask questions about BigQuery development on Stack Overflow. Finally, don't forget to give us feedback and join the discussion on our Cloud Platform Developers Google+ page.

Felipe Hoffa has recently joined the Cloud Platform team. He'd love to see the world's data accessible for everyone in BigQuery.

Posted by Ashleigh Rentz, Editor Emerita

BigQuery gets big new features to make data analysis even easier

Posted: Thursday, March 14, 2013

By Michael Manoochehri, Developer Programs Engineer, Cloud Platform

Google BigQuery is designed to make it easy to analyze large amounts of data quickly. Overwhelmingly, developers have asked us for features to help simplify their work even further. Today we are launching a collection of updates that gives BigQuery a greater range of query and data types, more flexibility with table structure, and better tools for collaborative analysis.

Big JOIN and Big Group Aggregations

Extracting insights from multiple datasets can be challenging and time-consuming. This is especially true when datasets become too large to query with a traditional database system. With traditional databases, SQL functions like joining and grouping are typically used to bring together data for analysis. What happens when your data is too large to fit into a conventional database? Working with multi-terabyte datasets often requires developing complicated MapReduce workflows, investing in expensive infrastructure, and great deal of time. Very often, it's a combination of all three.

In response to developer feedback, we're launching new features that enable analysts and developers to run fast SQL-like join and aggregate queries on datasets without the need for batch-based processing. Our new Big JOIN feature gives users the ability to produce a result set by merging data from two large tables by a common key. Big JOIN simplifies data analysis that would otherwise require a data transformation step, by allowing users to specify JOIN operations using SQL.

Popular web applications produce user activity logs that can grow by billions of rows each week. Dividing users into smaller groups is a key step for analysis. However, each group of users can number in the millions. To handle this for such large volumes, we've enabled Big Group Aggregations, which significantly increases the number of distinct values that can be grouped in a result set.

To use these new features, simply add the EACH modifier to JOIN or GROUP BY clauses.

/* JOIN EACH example
* Selects the top 10 most edited Wikipedia pages
* of words that appear in works of Shakespeare.
*/

SELECT
TOP(wiki.title, 10), COUNT(*)
FROM
[publicdata:samples.wikipedia] AS wiki
JOIN EACH
[publicdata:samples.shakespeare] AS shakespeare
ON
shakespeare.word = wiki.title;

For more information, including best practices, when using JOIN EACH and GROUP EACH BY, visit the BigQuery query reference.

Native support for TIMESTAMP data type

We are also adding a new TIMESTAMP data type, in response to one of our most frequent requests from developers. This new data type lets you import date and time values in formats familiar to users of databases such as MySQL, while preserving timezone offset information.

Along with the new data type come new functions for converting TIMESTAMP fields into other formats, calculating intervals, and extracting components such as the hour, day of week, and quarter.

/* TIMESTAMP example
* Which hours in the day are the most popular for GitHub actions?
* This query converts github_timeline "created_at" date time
* strings to BigQuery TIMESTAMP, and extracts the hour from each.
*/

SELECT
HOUR(TIMESTAMP(created_at)) AS event_create_hour,
COUNT(*) AS event_count
FROM
[publicdata:samples.github_timeline]
GROUP BY
event_create_hour
ORDER BY
event_count DESC;

Read more about the available TIMESTAMP functions in our query reference guide.

Add columns to existing BigQuery tables

When working with large amounts of fast moving data, it's not uncommon to find out that you need to add additional fields to your tables. In response to developer feedback, we have added the ability to add new columns to existing BigQuery tables.

To take advantage of this feature, simply provide a new schema with additional columns using either the "Tables: update" or "Tables: patch" BigQuery API methods.

For more information on this feature, visit the BigQuery API reference.

BigQuery Web UI: Dataset links and dataset sharing notifications

BigQuery has always provided project owners with very fine-grained control of how their datasets are shared. To make it easier for teams to work on collaborative data analysis, we've added direct links to individual datasets in the BigQuery Web UI. This provides a convenient way for authorized users to quickly access a dataset, and allows for bookmarking and sharing.

In addition, we've also added email notifications to inform users when they've been given dataset access privileges. When a dataset has been shared with another user via the sharing control panel, BigQuery sends a notification email containing a direct link to the dataset.

The BigQuery UI features a collection of public datasets for you to use when trying out these new features. To get started, visit our sign up page and Quick Start guide. You should take a look at our API docs, and ask questions about BigQuery development on Stack Overflow. Finally, don't forget to give us feedback and join the discussion on our Cloud Platform Developers Google+ page.

Michael Manoochehri is a Developer Programs Engineer supporting the Google Cloud Platform. His goal is to help make cloud computing and data analysis universally accessible and useful.

Posted by Scott Knaster, Editor

Find sample code and more for Google Cloud Platform, now on GitHub

Posted: Tuesday, January 22, 2013

By Julia Ferraioli, Developer Advocate, Google Compute Engine

Cross-posted from the Google Open Source Blog

Today, we’re announcing that you can now find Google Cloud Platform on GitHub! The GitHub organization for the Google Cloud Platform is your destination for samples and tools relating to App Engine, BigQuery, Compute Engine, Cloud SQL, and Cloud Storage. Most Google Cloud Platform existing open source tools will be migrated to the organization over time. You can quickly get your app running by forking any of our repositories and diving into the code.

Currently, the GitHub organization for the Google Cloud Platform has 36 public repositories, some of which are currently undergoing their initial code reviews, which you can follow on the repo. The Google Cloud Platform Developer Relations Team will be using GitHub to maintain our starter projects, which show how to get started with our APIs using different stacks. We will continue to add repositories that illustrate solutions, such as the classic guest book app on Google App Engine. For good measure, you will also see some tools that will make your life easier, such as an OAuth 2.0 helper.

From getting started with Python on Google Cloud Storage to monitoring your Google Compute Engine instances with App Engine, our GitHub organization is home to it all.

Trick of the trade: to find samples relating to a specific platform, try filtering on the name in the “Find a Repository” text field.

We set up this organization not only to give you an easy way to find and follow our samples, but also to give you a way to get involved and start hacking alongside us. We’ll be monitoring our repositories for any reported issues as well as for pull requests. If you’re interested in seeing what a code review looks like for Google’s open source code, you can follow along with the discussion happening right on the commits.

Let us know about your suggestions for samples. We look forward to seeing what you create!

Julia Ferraioli is a Developer Advocate for Google Compute Engine, based in Seattle. She helps developers harness the power of Google's infrastructure to tackle their computationally intensive processes and jobs. She comes from an industrial background in software engineering, and an academic background in machine learning and assistive technology.

Posted by Scott Knaster, Editor

Streak brings CRM to the inbox with Google Cloud Platform

Posted: Tuesday, October 9, 2012

By Aleem Mawani, Co-Founder of Streak

Cross-posted with the Google App Engine Blog

This guest post was written by Aleem Mawani, Co-Founder of Streak, a startup alum of Y Combinator, a Silicon Valley incubator. Streak is a CRM tool built into Gmail. In this post, Aleem shares his experience building and scaling their product using Google Cloud Platform.

Everyone relies on email to get work done – yet most people use separate applications from their email to help them with various business processes. Streak fixes this problem by letting you do sales, hiring, fundraising, bug tracking, product development, deal flow, project management and almost any other business process right inside Gmail. In this post, I want to illustrate how we have used Google Cloud Platform to build Streak quickly, scalably and with the ability to deeply analyze our data.

We use several Google technologies on the backend of Streak:

App Engine to serve our app.

App Engine Datastore to persist user data.

Memcache to make operations fast.

BigQuery to analyze our logs and power dashboards.

App Engine Search API to let users sift through their data.

Prediction API to machine learn over user data.

Google Translate API to translate our app to over 40 languages.

Our core learning is that you should use the best tool for the job. No one technology will be able to solve all your data storage and access needs. Instead, for each type of functionality, you should use a different service. In our case, we aggressively mirror our data in all the services mentioned above. For example, although the source of truth for our user data is in the App Engine Datastore, we mirror that data in the App Engine Search API so that we can provide full text search, Gmail style, to our users. We also mirror that same data in BigQuery so that we can power internal dashboards.

System Architecture

App Engine - We use App Engine for Java primarily to serve our application to the browser and mobile clients in addition to serving our API. App Engine is the source of truth for all our data, so we aggressively cache using Memcache. We also use Objectify to simplify access to the Datastore, which I highly recommend.

Google Cloud Storage - We mirror all of our Datastore data as well as all our log data in Cloud Storage, which acts as a conduit to other Google cloud services. It lets us archive the data as well as push it to BigQuery and the Prediction API.

BigQuery - Pushing the data into BigQuery allows us to run non-realtime queries that can help generate useful business metrics and slice user data to better understand how our product is getting used. Not only can we run complex queries over our Datastore data but also over all of our log data. This is incredibly powerful for analyzing the request patterns to App Engine. We can answer questions like:

Which requests cost us the most money?
What is the average response time for every URL on our site over the last 3 days?

BigQuery helps us monitor error rates in our application. We process all of our log data with debug statements, as well as something called an “error type” for any request that fails. If it’s a known error, we'll log something sensible, and we log the exception type if we haven’t seen it before. This is beneficial because we built a dashboard that queries BigQuery for the most recent errors in the last hour grouped by error type. Whenever we do a release, we can monitor error rates in the application really easily.

A Streak dashboard powered by BigQuery showing current usage statistics

In order to move the data into Cloud Storage from the Datastore and LogService, we developed an open source library called Mache. It’s a drop-in library that can be configured to automatically push data into BigQuery via Cloud Storage. The data can come from the Datastore or from LogService and is very configurable - feel free to contribute and give us feedback on it!

Google Cloud Platform also makes our application better for our users. We take advantage of the App Engine Search API and again mirror our data there. Users can then query their Streak data using the familiar Gmail full text search syntax, for example, “before:yesterday name:Foo”. Since we also push our data to the Prediction API, we can help users throughout our app by making smart suggestions. In Streak, we train models based on which emails users have categorized into different projects. Then, when users get a new email, we can suggest the most likely box that the email belongs to.

One issue that arises is how to keep all these mirrored data sets in sync. It works differently for each service based on the architecture of the service. Here’s a simple breakdown:

Having these technologies easily available to us has been a huge help for Streak. It makes our products better and helps us understand our users. Streak’s user base grew 30% every week for 4 consecutive months after launch, and we couldn’t have scaled this easily without Google Cloud Platform. To read more details on why Cloud Platform makes sense for our business, check out our case study and our post on the Google Enterprise blog.

Aleem Mawani is the co-founder of Streak.com, a CRM tool built into Gmail. Previously, Aleem worked on Google Drive and various ads products at Google. He has a degree from the University of Waterloo in Software engineering and an MBA from Harvard University.

Posted by Scott Knaster, Editor

Got big JSON? BigQuery expands data import for large scale web apps

Posted: Monday, October 1, 2012

By Ryan Boyd, Developer Advocate

JSON is the data format of the web. JSON is used to power most modern websites, is a native format for many NoSQL databases hosting top web applications, and provides the primary data format in many REST APIs. Google BigQuery, our cloud service for ad-hoc analytics on big data, has now added support for JSON and the nested/repeated structure inherent in the data format.

JSON opens the door to a more object-oriented view of your data compared to CSV, the original data format supported by BigQuery. It removes the need for duplication of data required when you flatten records into CSV. Here are some examples of data you might find a JSON format useful for:

Log files, with multiple headers and other name-value pairs.
User session activities, with information about each activity occurring nested beneath the session record.
Sensor data, with variable attributes collected in each measurement.

Nested/repeated data support is one of our most requested features. And while BigQuery's underlying infrastructure supports it, we'd only enabled it in a limited fashion through M-Lab's test data. Today, however, developers can use JSON to get any nested/repeated data into and out of BigQuery.

For more information on importing JSON and nested/repeated data into BigQuery, check out the new guide in our documentation. You should also see the Dealing with Data section for details on the new querying syntax available for this type of data.

Improvements to Data Loading Pipeline

We’ve made it much easier to ingest data into BigQuery – up to 1TB of data per load job, with each file up to 100GB uncompressed JSON or CSV. We’ve also eliminated the 2 imports per minute rate limit, enabling you to submit all your ingestion jobs and let us handle the queuing as necessary. In a recent project I’ve been working on, import jobs for 3TB of data that previously took me 12 hours to run now take me only 36 minutes – a 20x improvement!

We’ve published a new Ingestion Cookbook that explains how to take advantage of these new limits.

We’re initiating a small trusted tester program aimed at making it easier to move your data from the App Engine Datastore to BigQuery for analysis. If you store a lot of data in Datastore and are also using BigQuery, we’d like to hear from you. Please sign up now to be considered for the trusted tester program.

Learn more this week

Michael Manoochehri, Siddartha Naidu and I are in London this week talking about BigQuery and these new features at the Strata big data conference. Ju-kay Kwek will also be talking about BigQuery at the Interop NYC conference tomorrow. Please stop by, say hi, and let us know what you’re doing with big data.

We’ll also be producing a Google Developers Live session from Campus London on Friday at 16:00 BST (15:00 GMT).

Ryan Boyd is a Developer Advocate, focused on big data. He's been at Google for 6 years and previously helped build out the Google Apps ISV ecosystem. He published his first book, "Getting Started with OAuth 2.0", with O'Reilly.

Posted by Scott Knaster, Editor

How redBus uses BigQuery to master big data

Posted: Friday, September 28, 2012

By Pradeep Kumar, redBus

This guest post was written by Pradeep Kumar. Pradeep is a technical architect at redBus, an online travel agency in India that provides a unified online bus ticketing service. We recently published a business case study for redBus and wanted to dive into some more technical detail for the readers of the Google Developers Blog.

Our company has been providing Internet bus ticketing for India since 2006. There are more than 10,000 bus routes available for booking, and we have dozens of machines processing booking requests. Each step in the booking process produces a lot of data – on search terms, route availability, server health and more. We needed tools to to be able to process this data quickly and easily to determine whether decreases in customer bookings are the result of server problems or simply less demand.

While we typically use relational databases to store and analyze data, we knew we needed something more powerful if we wanted to analyze 500GB or more, so we started to look at open source frameworks like Hadoop and analysis platforms like Hive and Pig. We found that these frameworks require considerable in-house expertise and infrastructure investments and wouldn’t give us answers to our questions as fast as we wanted. We decided to try out Google BigQuery as a trusted tester, with hopes that it would give us the ability to perform quick iterative analysis without much up-front investment. Our initial tests went very well, so we started building our analysis tools on top of BigQuery.

BigQuery allows us to run SQL-like queries to understand the bus routes in highest demand and what types of searches users are performing. We’ve also used it to build internal dashboards that give us a snapshot of system health.

For more information on how we structured our immutable tables, pipelined our data into BigQuery for analysis using RabitMQ, and to see example SQL queries we’ve used, check out my article on developers.google.com.

Pradeep Kumar is a technical architect at redBus.

Posted by Scott Knaster, Editor

Now in BigQuery: batch queries and a connector for Excel

Posted: Wednesday, August 29, 2012

By Ryan Boyd, Developer Advocate for Cloud Data Services

Businesses and developers are using BigQuery to solve a wide variety of use cases – from optimizing advertising campaigns, to spotting inventory shortfalls, to understanding customer behavior. Accommodating these varied use cases requires BigQuery to be flexible, both for the developers integrating applications with the API and for the analysts running ad-hoc queries. Today we’ve made it more flexible by adding batch queries and a connector for Microsoft Excel.

Batch priority queries

BigQuery was designed for ad-hoc, iterative analytics on millions-to-billions of rows of data. When you’re diving into your data to gain insights, you want your queries to run in seconds rather than waiting minutes or hours. Sometimes our customers don’t need these fast responses when they’re running nightly jobs to update dashboards or reports, but want to use the same BigQuery technology and underlying datasets for these queries. We’ve now added batch pricing to accommodate these developers, allowing them to run their queries at a significantly lower cost.

Here’s how to set the priority to ‘batch’ when submitting a new query via the Google APIs Client Library for Java:

    Job job = new Job();
    JobConfiguration config = new JobConfiguration();
    JobConfigurationQuery queryConfig = new JobConfigurationQuery();
    config.setQuery(queryConfig);

    job.setConfiguration(config);
    queryConfig.setQuery(querySql);
    queryConfig.setPriority("BATCH");

    com.google.api.services.bigquery.Bigquery.Jobs.Insert insert =
     bigquery.jobs().insert(projectId, job);

Batch queries will execute between 30 minutes and 3 hours after they are submitted. See more information in our Developers Guide.

BigQuery Connector for Excel

Spreadsheets are a popular tool for analysts, executives and and developers to explore data. Last year we launched the ability for users of Google Spreadsheets to execute BigQuery queries using the Google Apps Script integration. Today, we’re launching the BigQuery Connector for Excel, which allows Microsoft Excel users to do the same with the ‘External Data’ functionality built into the product. Once the BigQuery results are in Excel, you can easily make pivot tables, create charts and integrate it with data from other sources. If you’re interested, you can try it right now!

Let us know what you think of these new features and what else you’d like to see in the roadmap by reaching out on Google+. We’ll also be holding office hours this Friday at 10 AM PDT on Google Developers Live to talk about these new features and answer any questions you have about BigQuery.

Microsoft and Excel are registered trademarks of Microsoft Corporation

Ryan Boyd is a Developer Advocate, focused on cloud data services. He's been at Google for 6 years and previously helped build out the Google Apps ISV ecosystem. He published his first book "Getting Started with OAuth 2.0" with O'Reilly.

Posted by Scott Knaster, Editor

Helping developers build more applications on Google BigQuery

Posted: Tuesday, July 24, 2012

Amanda

Ju-kay

By Ju-kay Kwek and Amanda Bradford, Google BigQuery Team

In May we launched Google BigQuery, a fully managed cloud-based service that enables businesses to analyze enormous amounts of data in the cloud. While we're continually amazed by the range of business problems being solved, we recognize that writing one-off scripts to ingest data, or creating custom front-end integration, requires effort and takes away time from the fun stuff: getting results.

So today we're pleased to highlight a few new members of our Cloud Platform Partner Program that are here to help you be even more productive – by delivering tools integrated with BigQuery that make it much easier to automatically load data from a broad set of sources, as well as to analyze and visualize the data with spectacular dashboards.

Import data from multiple sources into BigQuery

We’ve partnered with Informatica, Pervasive Software, Talend and SQLstream to make it easier to bring data from a variety of sources into BigQuery. This means you can use their BigQuery connectors to move data very easily from on-premise or cloud IT systems to BigQuery. For example, TribusPoint, a consulting firm, leveraged the Informatica Cloud BigQuery connector to rapidly move large files from their on-premise data centers to BigQuery.

Build rich interactive dashboards on BigQuery

We’ve partnered with data visualization providers QlikTech, Jaspersoft, Bime Analytics and Metric Insights to help you build rich, interactive dashboards for a broad range of customers. You can use their tools to build dashboards and reports very easily. For example, Pixelfish leveraged Metric Insights BigQuery integration to create dashboards that delivered a 300% improvement in customer engagement.

Click the partner links above to see more specific customer examples of each. We have just scratched the surface on Google BigQuery. We can’t wait to see what other cool applications you can build on Google BigQuery using our APIs. Hack away!

Ju-kay Kwek is the Product Management Lead for Google's Cloud Big Data initiative. In this role, he focuses on creating services that enable businesses and developers to harness Google's unparalleled data processing infrastructure and algorithms to tackle Big Data needs.

Amanda Bradford works in Business Development at Google, driving strategic alliances and partnerships for Google and specializing in the Google Cloud Platform.

Posted by Scott Knaster, Editor

Google Compute Engine: Computing without limits

Posted: Thursday, June 28, 2012

By Craig McLuckie, Product Manager, Google Compute Engine

Over the years, Google has built some of the most high performing, scalable and efficient data centers in the world by constantly refining our hardware and software. Since 2008, we've been working to open up our infrastructure to outside developers and businesses so they can take advantage of our cloud as they build applications and websites and store and analyze data. So far this includes products like Google App Engine, Google Cloud Storage, and Google BigQuery.

Today, in response to many requests from developers and businesses, we're going a step further. We're introducing Google Compute Engine, an Infrastructure-as-a-Service product that lets you run Linux Virtual Machines (VMs) on the same infrastructure that powers Google. This goes beyond just giving you greater flexibility and control; access to computing resources at this scale can fundamentally change the way you think about tackling a problem.

Google Compute Engine offers:

Scale. At Google we tackle huge computing tasks all the time, like indexing the web, or handling billions of search queries a day. Using Google's data centers, Google Compute Engine reduces the time to scale up for tasks that require large amounts of computing power. You can launch enormous compute clusters - tens of thousands of cores or more.

Performance. Many of you have learned to live with erratic performance in the cloud. We have built our systems to offer strong and consistent performance even at massive scale. For example, we have sophisticated network connections that ensure consistency. Even in a shared cloud you don’t see interruptions; you can tune your app and rely on it not degrading.

Value. Computing in the cloud is getting even more appealing from a cost perspective. The economy of scale and efficiency of our data centers allows Google Compute Engine to give you 50% more compute for your money than with other leading cloud providers. You can see pricing details here.

The capabilities of Google Compute Engine include:

Compute. Launch Linux VMs on-demand. 1, 2, 4 and 8 virtual core VMs are available with 3.75GB RAM per virtual core.

Storage. Store data on local disk, on our new persistent block device, or on our Internet-scale object store, Google Cloud Storage.

Network. Connect your VMs together using our high-performance network technology to form powerful compute clusters and manage connectivity to the Internet with configurable firewalls.

Tooling. Configure and control your VMs via a scriptable command line tool or web UI. Or you can create your own dynamic management system using our API.

At launch, we have worked with a number of partners - such as RightScale, Puppet Labs, OpsCode, Numerate, Cliqr and MapR - to integrate their products with Google Compute Engine. These partners offer management services that make it easy for you to move your applications to the cloud and between different cloud environments.

You can learn more about Google Compute Engine here. We’re going to pace ourselves and start with Google Compute Engine in limited preview (sign up here), but our goal is to give you all the pieces you need to build anything you want in the cloud. Whether you need a platform like Google App Engine, or virtual machines like Google Compute Engine, these days, you define your limits. We’re just at the start of what the cloud can do.

Craig McLuckie is the Product Management Lead for Google Compute Engine. He spends his days working with an amazing engineering team to open Google’s infrastructure to the world.

Posted by Scott Knaster, Editor

Use Google BigQuery for your visual interactive dashboards

Posted: Monday, June 18, 2012

Ju-kay

Momchil

By Ju-kay Kwek and Momchil Filev,
BigQuery Team

(Cross-posted with the Google Enterprise Blog)

Last month we announced the public launch of Google BigQuery, which enables developers and businesses to gain real-time business insights from massive amounts of data without any hardware or software investments.

Since then, we’ve added new features to Google BigQuery every week. For example, our most recent release includes support for running up to 20 concurrent queries, depending on the volume of data. This enables developers to build visually interactive dashboards on Google BigQuery.

Today, we’re highlighting two data visualization providers, QlikView and Bime, who are using Google BigQuery’s latest features to build dashboards with snappier and richer experiences.

QlikView

QlikView, one of the leaders in the Business Intelligence market, has developed a dashboard that visualizes the birth-record data for all babies born to mothers of different ages and races. With the help of BigQuery, QlikView can crunch millions of rows of data in seconds to answer questions like, “What's the average age of a mother in New York vs. in Texas?"

Bime

Bime, a cloud-based Business Intelligence provider based in France, is another early adopter of Google BigQuery. They’ve built a slick UI on top of the Google BigQuery platform that allows users to slice and dice 432 million rows of business data. For example, you can adjust a few simple parameters to see the sales distribution across products or regions on a map.

This is just a snapshot of how developers can use Google BigQuery to build interactive visual dashboards using a browser and without the hassle of managing SQL. Sign up and share your BigQuery use cases via our developer feedback form or on the Google Enterprise Google+ page.

Ju-kay Kwek is the Product Management Lead for Google's Cloud Big Data initiative. In this role, he focuses on creating services that enable businesses and developers to harness Google's unparalleled data processing infrastructure and algorithms to tackle Big Data needs.

Momchil Filev works in Business Development at Google, driving strategic alliances and partnerships for Google and specializing in the Google Cloud Platform.

Posted by Ashleigh Rentz, Editor Emerita

Using Google BigQuery to learn from GitHub data

Posted: Thursday, May 3, 2012

By Ilya Grigorik, Web Performance Engineer

Open-source developers all over the world contribute to millions of projects every day: writing and reviewing code, filing and discussing bug reports, updating documentation and project wikis, and so forth. The data generated from this activity can reveal interesting trends across many industries, including popularity of programming languages over time, defect rates, contribution metrics, and popularity of specific frameworks and libraries.

The challenge in extracting these trends is gathering the data. Each project has its own distributed workflow, code repositories, and conventions. Having hosted dozens of my own projects on GitHub, I've long wanted to analyze the developer activity from the 2.6M+ public projects hosted on GitHub. Hence, earlier this year GitHub Archive was born!

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. Each day it archives over 120,000 public activities, ranging from new commits and fork events to opening and closing tickets, each with detailed metadata.

Once I collected the data, I needed a tool to analyze it, and that is when I found Google BigQuery. Based on the research behind Dremel, a popular internal tool at Google for analyzing web-scale datasets, BigQuery allowed me to easily import the entire dataset and use a familiar SQL like syntax to comb through the gigabytes of data in seconds. Plus the tool will scale to terabyte datasets, so there is plenty of room to grow!

The best news is that thanks to collaboration from the GitHub and BigQuery teams, the GitHub dataset is now public and available for you to slice and dice in any way you like. No need to worry about data gathering or database schemas: BigQuery will do all the heavy lifting, and you can just compose your queries to be executed in realtime.

Here's a real-world example. What are the most popular programming languages on GitHub over the past month?

chart showing number of commits by language

If you are curious for more, sign up for BigQuery and follow the instructions on githubarchive.org to access the GitHub dataset. You can use the free 100GB query quota to run your analysis and perhaps even win some of the prizes from the GitHub Data Challenge!

Ilya Grigorik is a Web Performance Engineer and Advocate at Google, an open-source evangelist, and an analytics geek. You can find him on GitHub under igrigorik, and blogging about web performance at igvita.com.

Posted by Scott Knaster, Editor