In this tutorial, you'll perform real-time data analysis of Twitter data using a pipeline built on Google Compute Engine, Kubernetes, Google Cloud Pub/Sub, and BigQuery. This type of analysis has a number of useful applications, including:
- Performing sentiment analysis
- Identifying general patterns and trends in data
- Monitoring the performance and reach of campaigns
- Crafting responses in real time
The following diagram describes the architecture of the example application.
The example application uses Cloud Pub/Sub to buffer incoming tweets. A replicated pod, defined by using a Deployment reads Twitter's public sample stream using the Twitter streaming API, publishing incoming tweets to a Cloud Pub/Sub topic. Two additional replicated pods subscribe to the topic. When new tweet data is published to the topic, these pods add the data to BigQuery using the BigQuery API.
Objectives
- Download the source code.
- Configure and start a Kubernetes cluster.
- Create a Twitter application.
- Create a Cloud Pub/Sub topic.
- Start up the Kubernetes app.
- Use BigQuery to query for Twitter data.
Costs
This tutorial uses several billable components of Google Cloud Platform, including:
- 5 Compute Engine
n1-standard-1virtual machine instances. - 5 Compute Engine 10GB persistent disks.
- BigQuery.
The cost of running this tutorial varies depending on run time. Use the pricing calculator to generate a cost estimate based on your projected usage.
New Cloud Platform users might be eligible for a free trial.Before you begin
- Use Linux or Mac OS X.
- Install Docker.
- Install the Google Cloud SDK .
- Create a Twitter account.
Create and configure a new project
To work through this example, you must have a project with the required APIs enabled. In the Google Cloud Platform Console, create a new project, and then enable the following APIs:
- Google Compute Engine API
- Google Pub/Sub API
You will be prompted to enable billing if you have not previously done so.
This tutorial also uses the following APIs that were enabled by default when you created your new project:
- BigQuery
- Google Cloud Storage JSON API
Set up the Google Cloud SDK
-
Authenticate using your Google account:
gcloud auth login -
Set the default project for the Cloud SDK to the project you selected in the previous section of this tutorial. Replace
PROJECT_IDwith your ID:gcloud config set project [PROJECT_ID] -
Update the components:
gcloud components update -
Install the
kubectlbinary:gcloud components update kubectl -
Add the directory to
kubectlto your path:export PATH=$PATH:/usr/local/share/google/google-cloud-sdk/bin/
Download the sample code
You can get the sample code either of two ways:
-
Download the zip archive. Unzip the code into a directory named
kube-pubsub-bq. -
Clone the Github repository. In a terminal window, run the following command:
git clone https://github.com/GoogleCloudPlatform/kubernetes-bigquery-python.git kube-pubsub-bq
Download Kubernetes
Kubernetes is an open source orchestration system for Docker containers. It schedules containers onto the instances in your Compute Engine cluster, manages workloads to ensure that their state matches your declared intentions, and organizes containers by label and by task type for easy management and discovery.
-
Unpack the file into the same parent directory from where you installed the example code. Note that the tar file will unpack into a directory named
kubernetes, so you don't need to create a new directory. For example, enter:tar -zxvf kubernetes.tar.gz
Creating your BigQuery table
Create a BigQuery table to store your tweets. BigQuery groups tables into
abstraction layers called datasets. Use the bq
command-line tool, which is included with the Cloud SDK, to create a new
BigQuery dataset named rtda.
-
In a terminal window, enter the following command:
bq mk rtda -
Now that you have a dataset in place, create a new table named
tweetsto contain the incoming tweets. Each BigQuery table must be defined by a schema. To save you time, this example provides a predefined schema,schema.json, that you can use to define your table. To create the table by using the predefined schema, enter the following command:bq mk -t rtda.tweets kube-pubsub-bq/bigquery-setup/schema.json -
To verify that your new dataset and table have been created, use the BigQuery web UI. You should see the dataset name in the left sidebar. If it isn't there, make sure you're looking at the correct project. When you click the arrow next to the dataset name, you should see your new table. Alternatively, you can view all datasets in a project or all tables in a given dataset by running the following commands:
bq ls [PROJECT_ID]:bq ls [DATASET_ID]In this case, the dataset ID is
rtda. -
Finally, edit
kube-pubsub-bq/pubsub/bigquery-controller.yaml. Update thevaluefor the following fields to reflect your BigQuery configuration:PROJECT_ID: Your project ID.BQ_DATASET: The BigQuery dataset containing your table (rtda).BQ_TABLE: The BigQuery table you just created (tweets).
Configuring and starting a Kubernetes cluster
Kubernetes is portable—it uses a set of shell scripts for starting up, shutting down, and managing configuration—so no installation is necessary.
-
Edit
kubernetes/cluster/gce/config-common.sh. -
Add the
bigqueryandhttps://www.googleapis.com/auth/pubsubscopes to the file'sNODE_SCOPESdefinition.
This setting allows your nodes to write to your BigQuery table. Save the configuration and close your file.
Start up your cluster
Now you are ready to start up your cluster.
-
Enter the following command:
kubernetes/cluster/kube-up.shStarting the cluster can take some time. During the startup process, you might be prompted to create a new SSH key for Compute Engine or, if you've already done so previously, to enter your SSH key passphrase.
-
After your cluster has started, enter the following commands to see your running instances:
kubectl get nodes
kubectl cluster-info
You should see one Kubernetes master instance and four Kubernetes nodes.
Creating a Twitter application
To receive tweets from Twitter, you need to create a Twitter application and add its key/secret and token/secret values to the specification for your Twitter-to-PubSub Kubernetes pod. In all, you will copy four values. Follow these steps:
- Create a new Twitter application.
- In the Twitter Application Management page, navigate to the Keys and Access Tokens tab.
- Click the Create my access token button to create a new access token.
- Edit
kube-pubsub-bq/pubsub/twitter-stream.yaml. - Replace the following values with your consumer
key and consumer secret:
CONSUMERKEYCONSUMERSECRET
- Replace the following values with your access
token and access token secret:
ACCESSTOKENACCESSTOKENSEC
The example application reads from a random sample of Twitter's public stream by default. To filter on a set of keywords instead:
- Edit
kube-pubsub-bq/twitter-stream.yamland change the value ofTWSTREAMMODEtofilter. - Edit
twitter-to-pubsub.pyand replace the keywords defined in thetrackvariable with the keywords you want to filter on. - Rebuild the container image. For instructions about how to build the container image, see the appendix.
Creating a Cloud Pub/Sub topic
Next, create a Cloud Pub/Sub topic to which your Kubernetes pods can publish and subscribe.
-
Visit the Try It! section of the relevant Cloud Pub/Sub API page.
-
Enable the Authorize requests using OAuth 2.0 switch and then click Authorize.
-
Enter the following path in the name field. Replace
[PROJECT_ID]with your project ID:projects/[PROJECT_ID]/topics/new_tweets -
Click Execute to create your new topic. You can verify that the topic has been created by checking response code in the Response section. If the response code is
200 OK, your topic was created successfully. -
Edit the following two files:
kube-pubsub-bq/pubsub/bigquery-controller.yaml
-
Update the
PUBSUB_TOPICvalues in the files with your new topic. Look for this line:value: projects/your-project/topics/your-topicReplace
your-projectwith your project ID. Replaceyour-topicwithnew_tweets.
Understanding the Docker image for your Kubernetes pods
This example uses two different Kubernetes pod templates to construct the analysis pipeline. The specification files point to a pre-built Docker image; you don't need to do anything special to use it. If you'd like to build the image yourself, follow the steps in the appendix.
The Docker image contains two main scripts that perform the work for this solution.
The code in twitter-to-pubsub.py streams incoming tweets from Twitter to Cloud
Pub/Sub.
The code in pubsub-to-bigquery.py streams cached data from Cloud Pub/Sub to
BigQuery.
Starting up your Kubernetes app
After optionally building and pushing your Docker image, you can begin starting up your Kubernetes app.
In Kubernetes, pods {: target='k8s' track-type='tutorial' track-name='externalLink' track-metadata-position='body'} – rather than individual application containers – are the smallest deployable units that can be created, scheduled, and managed.
A replica set {: target='k8s' track-type='tutorial' track-name='externalLink' track-metadata-position='body'} ensures that a specified number of pod replicas are running at any one time. If there are too many, it will shut down some. If there are too few, it will start more. As opposed to just creating singleton pods or even creating pods in bulk, a replica set replaces pods that are deleted or terminated for any reason, such as in the case of node failure.
A Deployment {: target='k8s' track-type='tutorial' track-name='externalLink' track-metadata-position='body'} provides declarative updates for pods and replica sets. You only need to describe the desired state in a Deployment object, and the Deployment controller will change the actual state to the desired state at a controlled rate for you.
You will use Deployments to launch your Kubernetes app.
Start the Cloud-Pub/Sub-to-BigQuery Deployment
Begin by starting the Deployment for the pods that will subscribe to your Cloud Pub/Sub topic and stream tweets to your BigQuery table as they become available.
Before you start the Deployment, you might find it useful to take a closer look at
kube-pubsub-bq/pubsub/bigquery-controller.yaml.
This file defines the Deployment, and specifies a replica set with two pod replicas. The
characteristics of these replicated pods are defined in the file by using a pod template.
-
Run the following command to start up the replicated pods:
kubectl create -f kube-pubsub-bq/pubsub/bigquery-controller.yaml -
To verify that the pods are running, run the following:
kubectl get podsIt can take about 30 seconds for your new pods to move from
ContainerCreatingtoRunning. In the list, you should see two BigQuery-to-Cloud-Pub/Sub pods labeledbigquery-controller. -
You can run:
kubectl get deployments
to see the system's defined deployments, and how many replicas each is specified to have.
Start the Twitter-to-Cloud-Pub/Sub Deployment
After your Cloud-Pub/Sub-to-BigQuery pipeline pods are up and running,
start the Deployment for the pod that
that pulls in tweets and publishes them to your
Cloud Pub/Sub topic. Like kube-pubsub-bq/pubsub/bigquery-controller.yaml, the
Twitter-to-Cloud-Pub/Sub specification file,
kube-pubsub-bq/pubsub/twitter-stream.yaml, defines a Deployment.
However, this time the Deployment only asks for one replicated pod,
as Twitter only allows one streaming API connection at a time.
-
To start up the Deployment:
kubectl create -f kube-pubsub-bq/pubsub/twitter-stream.yaml -
Verify that all of your pods are now running by running the following command. Again, it can take about 30 seconds for your new pod to move from
ContainerCreatingtoRunning.kubectl get pods
In addition to the pods you saw in the previous step, you should see a new pod
labeled twitter-stream. Congratulations! Your analysis pipeline is now up and
running.
-
You can again run:
kubectl get deployments
to see the system's defined deployments, and how many replicas each is specified to have. You should now see something like this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
bigquery-controller 2 2 2 2 2m
twitter-stream 1 1 1 1 2m
Querying your BigQuery table
Open the BigQuery web UI and click Compose Query to begin writing a new query. You can verify that it is working by running a simple query, as follows:
SELECT
text
FROM
[rtda.tweets]
LIMIT
1000 IGNORE CASE;
Let your pipeline collect tweets for a while – a few hours should do, but the longer you let it run, the richer your data set will be. After you have some more data in your BigQuery table, you can try running some interesting sample queries.
This example query demonstrates how to find the most retweeted tweets in your table, filtering on a specific term (in this case, "android"):
SELECT
text,
MAX(retweeted_status.retweet_count) AS max_retweets,
retweeted_status.user.screen_name
FROM
[rtda.tweets]
WHERE
text CONTAINS 'android'
GROUP BY
text,
retweeted_status.user.screen_name
ORDER BY
max_retweets DESC
LIMIT
1000 IGNORE CASE;
You might also find it interesting to filter your collected tweets by a set of terms. The following query filters by the words "Kubernetes," "BigQuery," "Cloud Pub/Sub," or "Twitter":
SELECT
created_at,
text,
id,
retweeted_status.retweet_count,
user.screen_name
FROM
[rtda.tweets]
WHERE
text CONTAINS 'kubernetes'
OR text CONTAINS 'BigQuery'
OR text CONTAINS 'cloud pub/sub'
OR text CONTAINS 'twitter'
ORDER BY
created_at DESC
LIMIT
1000 IGNORE CASE;
The following query looks for a correlation between the number of favorites and the number of retweets in your set of tweets:
SELECT
CORR(retweeted_status.retweet_count, retweeted_status.favorite_count),
lang,
COUNT(*) c
FROM [rtda.tweets]
GROUP BY lang
HAVING c > 2000000
ORDER BY 1
Going even deeper, you could also investigate whether the speakers of a specific language prefer favoriting to retweeting, or vice versa:
SELECT
CORR(retweeted_status.retweet_count, retweeted_status.favorite_count),
lang,
COUNT(*) c,
AVG(retweeted_status.retweet_count) avg_rt,
AVG(retweeted_status.favorite_count) avg_fv,
AVG(retweeted_status.retweet_count)/AVG(retweeted_status.favorite_count) ratio_rt_fv
FROM [rtda.tweets]
WHERE retweeted_status.retweet_count > 1 AND retweeted_status.favorite_count > 1
GROUP BY lang
HAVING c > 1000000
ORDER BY 1;
Cleaning up
Labels make it easy to select the resources you want to stop or delete. For example:
kubectl delete deployment -l "name in (twitter-stream, bigquery-controller)"
If you'd like to shut down your cluster instances altogether, run the following command:
kubernetes/cluster/kube-down.sh
This deletes all of the instances in your cluster.
Appendix: Building and pushing the Docker image
Your Kubernetes pods require a Docker image that includes the app scripts and
their supporting libraries. This image is used to start up the Twitter-to-Cloud-
Pub/Sub and Cloud-Pub/Sub-to-BigQuery pods that are part of your analysis
pipeline. The Dockerfile located in the
kube-pubsub-bq/pubsub/pubsub-pipe-image directory specifies this image
as follows:
This Dockerfile first instructs the Docker daemon to install some required Python libraries to the new image:
tweepy, a Python wrapper for the Twitter API.- The Google API Python libraries.
python-dateutil, an extension to Python's standarddatetimemodule.
Then, four scripts in pubsub-pipe-image are added:
controller.py: Provides a common execution point for the other two Python scripts in the pipeline image.utils.py: Contains some helper functions.pubsub-to-bigquery.py: Streams cached data from Cloud Pub/Sub to BigQuery.-
twitter-to-pubsub.py: Streams incoming tweets from Twitter to Cloud Pub/Sub. -
To build the pipeline image using the associated Dockerfile, run the following command. Replace
[PROJECT_ID]with your project ID:sudo docker build -t gcr.io/[PROJECT_ID]/pubsub_pipeline kube-pubsub-bq/pubsub/pubsub-pipe-image -
After you build your image, push it to the Google Container Registry so that Kubernetes can access it. Replace
[PROJECT_ID]with your project ID:sudo gcloud docker push gcr.io/[PROJECT_ID]/pubsub_pipeline
Update the Kubernetes specification files to use your custom image
Next, update the image values in the specification files to use your new
image. You need to update the following two files:
kube-redis-bq/redis/twitter-stream.yaml
kube-redis-bq/redis/bigquery-controller.yaml
-
Edit each file and look for this line:
gcr.io/google-samples/pubsub-bq-pipe:v3 -
Replace
google-sampleswith your project ID. Replacepubsub-bq-pipe:v3withpubsub_pipeline. -
Save and close the files.
What's next
- Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.