In this tutorial, you'll perform real-time data analysis of Twitter data using a pipeline built on Google Compute Engine, Kubernetes, Redis, and BigQuery. This type of analysis has a number of useful applications, including:
- Performing sentiment analysis
- Identifying general patterns and trends in data
- Monitoring the performance and reach of campaigns
- Crafting responses in real time
The following diagram describes the architecture of the example application.
The example application uses a replication controller with one replicated pod, to support a Redis master that caches incoming tweets. The Redis master is then fronted by a service.
A replicated pod, defined by using a Deployment {: target='k8s' track-type='tutorial' track-name='externalLink' track-metadata-position='body'} reads Twitter's public sample stream using the Twitter streaming API, dumping tweets that match its predefined set of filters into the Redis cache. Two additional replicated pods do blocking reads on the Redis cache. When tweet data is available, these pods add the data to BigQuery in small batches using the BigQuery API.
Objectives
- Download the source code.
- Configure and start a Kubernetes cluster.
- Create a Twitter application.
- Create a Cloud Pub/Sub topic.
- Start up the Kubernetes app.
- Use BigQuery to query for Twitter data.
Costs
This tutorial uses several billable components of Google Cloud Platform, including:
- 5 Compute Engine
n1-standard-1virtual machine instances. - 5 Compute Engine 10GB persistent disks.
- BigQuery.
The cost of running this tutorial varies depending on run time. Use the pricing calculator to generate a cost estimate based on your projected usage.
New Cloud Platform users might be eligible for a free trial.Before you begin
- Use Linux or Mac OS X.
- Install Docker.
- Install the Google Cloud SDK .
- Create a Twitter account.
Create and configure a new project
To work through this example, you must have a project with the required APIs enabled. In the Google Cloud Platform Console, create a new project, and then enable Google Compute Engine API. You will be prompted to enable billing if you have not previously done so.
This tutorial also uses the following APIs that were enabled by default when you created your new project:
- BigQuery
- Google Cloud Storage JSON API
Set up the Google Cloud SDK
-
Authenticate using your Google account:
gcloud auth login -
Set the default project for the Cloud SDK to the project you selected in the previous section of this tutorial. Replace
PROJECT_IDwith your ID:gcloud config set project [PROJECT_ID] -
Update the components:
gcloud components update -
Install the
kubectlbinary:gcloud components update kubectl -
Add the directory to
kubectlto your path:export PATH=$PATH:/usr/local/share/google/google-cloud-sdk/bin/
Download the example code
You can get the example code either of two ways:
-
Download the zip archive. Unzip the code into a directory named
kube-redis-bq. -
Clone the Github repository. In a terminal window, run the following command:
git clone https://github.com/GoogleCloudPlatform/kubernetes-bigquery-python.git kube-redis-bq
Download Kubernetes
Kubernetes is an open source orchestration system for Docker containers. It schedules containers onto the instances in your Compute Engine cluster, manages workloads to ensure that their state matches your declared intentions, and organizes containers by label and by task type for easy management and discovery.
-
Unpack the file into the same parent directory from where you installed the example code. Note that the tar file will unpack into a directory named
kubernetes, so you don't need to create a new directory. For example, enter:tar -zxvf kubernetes.tar.gz
Creating your BigQuery table
Create a BigQuery table to store your tweets. BigQuery groups tables into
abstraction layers called datasets. Use the bq
command-line tool, which is included with the Cloud SDK, to create a new
BigQuery dataset named rtda.
-
In a terminal window, enter the following command:
bq mk rtda -
Now that you have a dataset in place, create a new table named
tweetsto contain the incoming tweets. Each BigQuery table must be defined by a schema. To save you time, this example provides a predefined schema,schema.json, that you can use to define your table. To create the table by using the predefined schema, enter the following command:bq mk -t rtda.tweets kube-redis-bq/bigquery-setup/schema.json -
To verify that your new dataset and table have been created, use the BigQuery web UI. You should see the dataset name in the left sidebar. If it isn't there, make sure you're looking at the correct project. When you click the arrow next to the dataset name, you should see your new table. Alternatively, you can view all datasets in a project or all tables in a given dataset by running the following commands:
bq ls [PROJECT_ID]:bq ls [DATASET_ID]In this case, the dataset ID is
rtda. -
Finally, edit
kube-redis-bq/redis/bigquery-controller.yaml. Update thevaluefor the following fields to reflect your BigQuery configuration:PROJECT_ID: Your project ID.BQ_DATASET: The BigQuery dataset containing your table (rtda).BQ_TABLE: The BigQuery table you just created (tweets).
Configuring and starting a Kubernetes cluster
Kubernetes is portable—it uses a set of shell scripts for starting up, shutting down, and managing configuration—so no installation is necessary.
-
Edit
kubernetes/cluster/gce/config-common.sh. -
Add the
bigqueryscope to the file'sNODE_SCOPESdefinition.
This setting allows your nodes to write to your BigQuery table. Save the configuration and close your file.
(Optional) Update the startup scripts for long-term Redis reliability
You can improve the long-term reliability of Redis on Kubernetes by adding some suggested optimizations to the nodes' startup scripts. This step is helpful if you want your Redis master to run for a long period of time. If you don't intend to have your cluster up for very long, you can skip this step.
-
Edit
kubernetes/cluster/gce/configure-vm.shand scroll to the bottom of the file. Above the followingifstatement, after the row of#symbols:if [[ -z "${is_push}" ]]; then -
Insert the following commands:
/sbin/sysctl vm.overcommit_memory=1 echo never > /sys/kernel/mm/transparent_hugepage/enabled
These commands will set your Linux kernel to always overcommit memory and will disable transparent huge pages in the kernel. Save and close the file.
Start up your cluster
Now you are ready to start up your cluster.
-
Enter the following command:
kubernetes/cluster/kube-up.shStarting the cluster can take some time. During the startup process, you might be prompted to create a new SSH key for Compute Engine or, if you've already done so previously, to enter your SSH key passphrase.
-
After your cluster has started, enter the following commands to see your running instances:
kubectl get nodes
kubectl cluster-info
You should see one Kubernetes master instance and four Kubernetes nodes.
Creating a Twitter application
To receive tweets from Twitter, you need to create a Twitter application and add its key/secret and token/secret values to the specification for your Twitter-to-Redis Kubernetes pod. In all, you will copy four values. Follow these steps:
- Create a new Twitter application.
- In the Twitter Application Management page, navigate to the Keys and Access Tokens tab.
- Click the Create my access token button to create a new access token.
- Edit
kube-redis-bq/redis/twitter-stream.yaml. - Replace the following values with your consumer
key and consumer secret:
CONSUMERKEYCONSUMERSECRET
- Replace the following values with your access token
and access token secret:
ACCESSTOKENACCESSTOKENSEC
The example application reads from a random sample of Twitter's public stream by default. To filter on a set of keywords instead:
- Edit
kube-redis-bq/twitter-stream.yamland change the value ofTWSTREAMMODEtofilter. - Edit
twitter-to-redis.pyand replace the keywords defined in thetrackvariable with the keywords you want to filter on. - Rebuild the container image. For instructions about how to build the container image, see the appendix.
Understanding the Docker image for your Kubernetes pods
This example uses three different Kubernetes pod templates to construct the analysis pipeline. One is a Redis pod, and the other two types of pods run the example application's scripts.
The specification files point to a pre-built Docker image; you don't need to do anything special to use it. If you'd like to build the image yourself, follow the steps in the appendix.
The Docker image contains two main scripts that perform the work for this solution.
The code in twitter-to-redis.py streams incoming tweets from Twitter to Redis.
The code in redis-to-bigquery.py streams cached data from Redis to
BigQuery.
Starting up your Kubernetes app
After optionally building and pushing your Docker image, you can begin starting up your Kubernetes app.
In Kubernetes, pods {: target='k8s' track-type='tutorial' track-name='externalLink' track-metadata-position='body'} – rather than individual application containers – are the smallest deployable units that can be created, scheduled, and managed.
A replica set {: target='k8s' track-type='tutorial' track-name='externalLink' track-metadata-position='body'} ensures that a specified number of pod replicas are running at any one time. If there are too many replicas, it will shut down some. If there are too few, it will start more. As opposed to just creating singleton pods or even creating pods in bulk, a replica set replaces pods that are deleted or terminated for any reason, such as in the case of node failure.
A Deployment {: target='k8s' track-type='tutorial' track-name='externalLink' track-metadata-position='body'} provides declarative updates for pods and replica sets. You only need to describe the desired state in a Deployment object, and the Deployment controller will change the actual state to the desired state at a controlled rate for you.
You will use Deployments for the Redis-to-BigQuery and Twitter-to-Redis pods.
Start the Redis master pod
Start the Redis master pod first. If you're curious, you can see the
specification file for the Redis master pod in
kube-redis-bq/redis/redis-master.yaml:
-
To start the pod, run the following command:
kubectl create -f kube-redis-bq/redis/redis-master.yaml -
Run the following command to verify that the Redis master pod is running:
kubectl get pods
Look at the STATUS field to verify that the master is running. In addition,
you might see some other pods already running. These extra pods are added and
used by the Kubernetes framework.
It can take around 30 seconds for the pod to be placed on an instance. During
this period, the Redis master pod will list its host as <unassigned>. However,
the final value of Host will be the name of the instance that the pod is
running on. When the process completes, the pod's status will change from
ContainerCreating to Running.
Start the Redis master's service
In Kubernetes, a
service
defines a logical set of pods and a policy by which to access them. The
service is specified in kube-redis-bq/redis/redis-master-service.yaml as
follows:
The Redis pod that you created above has the label name: redis-master. The
selector field of the service determines which pods will receive the traffic
sent to the service.
-
Enter the following command to start the service:
kubectl create -f kube-redis-bq/redis/redis-master-service.yaml -
To view the running services, enter the following command:
kubectl get services
Start the Redis-to-BigQuery Deployment
Next, start the Deployment for the pods that pull tweets from the Redis cache and stream them to your BigQuery table.
Before you start the Deployment, you might find it useful to take a closer look at
kube-redis-bq/redis/bigquery-controller.yaml.
This file defines the Deployment, and specifies a replica set with two pod replicas. The
characteristics of these replicated pods are defined in the file by using a pod template.
-
Run the following command to start up the replicated pods:
kubectl create -f kube-redis-bq/redis/bigquery-controller.yaml -
To verify that the pods are running, run the following:
kubectl get podsIt can take about 30 seconds for your new pods to move from
ContainerCreatingtoRunning. You should see one Redis pod labeledredis-master, and two BigQuery-to-Redis pods labeledbigquery-controller. -
You can run:
kubectl get deployments
to see the system's defined Deployments, and how many replicas each is specified to have.
Start the Twitter-to-Redis Deployment
After your Redis master and Redis-to-BigQuery pipeline pods are running, start the Deployment
that pulls in tweets and adds them to your Redis cache.
Like kube-redis-bq/redis/bigquery-controller.yaml, the Twitter-to-Redis
specification file, kube-redis-bq/redis/twitter-stream.yaml, defines a
Deployment. However, this time the Deployment only asks for one replicated pod,
as Twitter only allows one streaming API connection at a time.
-
To start up the Deployment, enter the following command:
kubectl create -f kube-redis-bq/redis/twitter-stream.yaml -
Verify that all of your pods are now running by running the following command. Again, it can take about 30 seconds for your new pod to move from
ContainerCreatingtoRunning.kubectl get pods
In addition to the pods you saw in the previous step, you should see a new pod
labeled twitter-stream. Congratulations! Your analysis pipeline is now up and
running.
-
You can again run:
kubectl get deployments
to see the system's defined Deployments, and how many replicas each is specified to have. You should now see something like this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
bigquery-controller 2 2 2 2 2m
twitter-stream 1 1 1 1 2m
Querying your BigQuery table
Open the BigQuery web UI and click Compose Query to begin writing a new query. You can verify that it is working by running a simple query, as follows:
SELECT
text
FROM
[rtda.tweets]
LIMIT
1000 IGNORE CASE;
Let your pipeline collect tweets for a while – a few hours should do, but the longer you let it run, the richer your data set will be. After you have some more data in your BigQuery table, you can try running some interesting sample queries.
This sample query demonstrates how to find the most retweeted tweets in your table, filtering on a specific term (in this case, "android"):
SELECT
text,
MAX(retweeted_status.retweet_count) AS max_retweets,
retweeted_status.user.screen_name
FROM
[rtda.tweets]
WHERE
text CONTAINS 'android'
GROUP BY
text,
retweeted_status.user.screen_name
ORDER BY
max_retweets DESC
LIMIT
1000 IGNORE CASE;
You might also find it interesting to filter your collected tweets by a set of terms. The following query filters by the words "Kubernetes," "BigQuery," "Redis," or "Twitter:"
SELECT
created_at,
text,
id,
retweeted_status.retweet_count,
user.screen_name
FROM
[rtda.tweets]
WHERE
text CONTAINS 'kubernetes'
OR text CONTAINS 'BigQuery'
OR text CONTAINS 'redis'
OR text CONTAINS 'twitter'
ORDER BY
created_at DESC
LIMIT
1000 IGNORE CASE;
The following query looks for a correlation between the number of favorites and the number of retweets in your set of tweets:
SELECT
CORR(retweeted_status.retweet_count, retweeted_status.favorite_count),
lang,
COUNT(*) c
FROM [rtda.tweets]
GROUP BY lang
HAVING c > 2000000
ORDER BY 1
Going even deeper, you could also investigate whether the speakers of a specific language prefer favoriting to retweeting, or vice versa:
SELECT
CORR(retweeted_status.retweet_count, retweeted_status.favorite_count),
lang,
COUNT(*) c,
AVG(retweeted_status.retweet_count) avg_rt,
AVG(retweeted_status.favorite_count) avg_fv,
AVG(retweeted_status.retweet_count)/AVG(retweeted_status.favorite_count) ratio_rt_fv
FROM [rtda.tweets]
WHERE retweeted_status.retweet_count > 1 AND retweeted_status.favorite_count > 1
GROUP BY lang
HAVING c > 1000000
ORDER BY 1;
Cleaning up
Labels make it easy to select the resources you want to stop or delete. For example:
kubectl delete deployment -l "name in (twitter-stream, bigquery-controller)"
You can also delete a resource using its specification filename. For example, delete the Redis service and replication controller like this:
kubectl delete -f redis-master.yaml
kubectl delete -f redis-master-service.yaml
If you'd like to shut down your cluster instances altogether, run the following command:
kubernetes/cluster/kube-down.sh
This deletes all of the instances in your cluster.
Appendix: Building and pushing the Docker image
Your Kubernetes pods require a Docker image that includes the app scripts and
their supporting libraries. This image is used to start up the Twitter-to-Redis
and Redis-to-BigQuery pods that are part of your analysis pipeline. The
Dockerfile located in the kube-redis-bq/redis/redis-pipe-image directory
specifies this image as follows:
This Dockerfile first instructs the Docker daemon to install some required Python libraries to the new image:
tweepy, a Python wrapper for the Twitter API.redis-py, a Python wrapper for Redis.- The Google API Python libraries.
python-dateutil, an extension to Python's standarddatetimemodule.
Then, four scripts in redis-pipe-image are added:
controller.py:Provides a common execution point for the other two Python scripts in the pipeline image.utils.py: Contains some helper functions.redis-to-bigquery.py:Streams cached data from Redis to BigQuery.-
twitter-to-redis.py:Streams incoming tweets from Twitter to Redis. -
To build the pipeline image using the associated Dockerfile, run the following command. Replace
[PROJECT_ID]with your project ID:sudo docker build -t gcr.io/[PROJECT_ID]/pipeline_image kube-redis-bq/redis/redis-pipe-image -
After you build your image, push it to the Google Container Registry so that Kubernetes can access it. Replace
[PROJECT_ID]with the ID of your project when you run the following command:sudo gcloud docker push gcr.io/[PROJECT_ID]/pipeline_image
Update the Kubernetes specification files to use your custom image
Next, update the image values in the specification files to use your new
image. You need to update the following two files:
kube-redis-bq/redis/twitter-stream.yaml
kube-redis-bq/redis/bigquery-controller.yaml
-
Edit each file and look for this line:
gcr.io/google-samples/redis-bq-pipe:v3 -
Replace
your-projectwith your project ID. Replace the name of the image withpipeline_image. -
Save and close the files.
What's next
- Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.