Google Cloud

New year, new skills - How to reach your cloud career destination

Thu, 05 Jan 2023 17:00:00 -0000

Cloud is a great place to grow your career in 2023. Opportunity abounds, with cloud roles offering strong salaries and scope for growth as a constantly evolving field.¹ Some positions do not require a technical background, like project managers, product owners and business analysts. For others, like solutions architects, developers and administrators, coding and technical expertise are a must.

Either way, cloud knowledge and experience are required to land that dream job. But where do you start? And how do you keep up with the fast pace of ever-changing cloud technology? Check out these tips below. There are also suggested training opportunities to help support your growth, including no-cost options below!

Start by looking at your experience

Your experience can be a great way to get into cloud, even if it seems non-traditional. Think creatively about transferable skills and opportunities. Here are a few scenarios where you might find yourself today:

You already work in IT, but in legacy systems or the data center. Forrest Brazeal, Head of Content Marketing at Google Cloud, talks about that in detail in this video.
Use your sales experience to become a sales engineer, or your communications experience to become a developer advocate. Stephanie Wong, Developer Advocate at Google Cloud, discusses that here.
You don’t have that college degree that is included in the job requirements. I’ve talked about that in a recent video here.
Your company has a cloud segment, but your focus is in another area. Go talk to people! Access your colleagues who do what you want to do. Get their advice for skilling up.

Define where you need to fill in gaps

If you are looking at a technical position, you will need to show cloud applicable experience, so learn about the cloud and build a portfolio of work. Here are a few key skills we recommend everyone have to start¹:

Code is non-negotiable. People who come from software development backgrounds typically find it easier to get into and maneuver through the cloud environment because of their coding experience. Automation, basic data manipulation and scaling is a daily requirement. If you don’t have a language you already know, learning Python is a great place to begin.
Understand Linux. You’ll need to know the Linux filesystem, basic Linux commands and fundamentals of containerization.
Learn core networking concepts like the IP Protocol and the others that layer on top of it, DNS, and subnets.
Make sure you understand the cloud itself, and in particular the specifics about Google Cloud for a role at Google.
Familiarity with open source tooling. Terraform for automation and Kubernetes for containers are portable between clouds and are worth taking the time to learn.

Boost your targeted hands-on skills

Check out Google Cloud Skills Boost for a comprehensive collection of training to help you upskill into a cloud role, including hands-on labs that get you real-world experience in Google Cloud. New users can start off with a 30 day no-cost trial². Take a look at these recommendations:

No-cost labs and courses

A Tour of Google Cloud Hands-on Labs- 45 minutes
A Tour of Google Cloud Sustainability - 60 minutes
Introduction to SQL for BigQuery and Cloud SQL - 60 minutes
Infrastructure and Application Modernization with Google Cloud - Introductory course with three modules
Preparing for Google Cloud certification - Courses to help you prepare for Google Cloud certification exams

Build hands on projects

This part is critical for the interview portion. Take the cloud skills you have learned and create something tangible that you can use as a story during an interview. Consider building a project on Github so others can see it working live, and document it well. Be sure to include your decision making process. Here is an example:

Build an API or a web application
Develop the code for the application
Pick the infrastructure to deploy that application in the cloud, choose your storage option, and a database with which it will interact

Get valuable cloud knowledge for non-technical roles

For tech-adjacent roles, like those in business, sales or administration, having a solid knowledge of cloud principles is critical. We recommend completing the Cloud Digital Leader training courses, at no cost. Or go the extra mile and consider taking the Google Cloud Digital Leader Certification exam once you complete the training:

No-cost course

Cloud Digital Leader Learning Path - understand cloud capabilities, products and services and how they benefit organizations

$99 registration fee

Google Cloud Digital Leader Certification - validate your cloud expertise by earning a certification

Commit to learning in the New Year

A couple of other resources we have are the Google Cloud Innovators Program, which will help you grow on Google Cloud and connect you with other community members. There is no-cost to join, and it will give you access build your skills and the future of cloud! Join today.

Start your new year strong, whether you are exploring Google Cloud Data, DevOps or Networking certifications by completing Arcade games each week. This January play to win in The Arcade while you learn new skills and earn prizes on Google Cloud Skills Boost. Each week we will feature a new game to help you show and grow your cloud skills, while sampling certification-based learning paths.

Make 2023 the year to build your cloud career and commit to learning all year, with our $299/year annual subscription. This subscription includes $500 of Google Cloud credits (and a bonus $500 of Google Cloud credits after you successfully certify), a $200 certification voucher, $299 annual subscription to Google Cloud Skills Boost with access to the entire training catalog, live-learning events and quarterly technical briefings with executives.

^{1. Starting your career in cloud from IT - Forrest Brazeal, Head of Content Marketing, Google Cloud
2. Credit card required to activate a 30 day no-cost trial for new users.}

Optimize and scale your startup - A look into the Build Series

Thu, 22 Dec 2022 17:00:00 -0000

At Google Cloud, we want to provide you with the access to all the tools you need to grow your business. Through the Google Cloud Technical Guides for Startups, leverage industry leading solutions with how-to video guides andresourcescurated for startups.

This multi-series contains 3 chapters: Start, Build and Grow, which matches your startup’s journey:

The Start Series: Begin by building, deploying and managing new applications on Google Cloud from start to finish.
The Build Series: Optimize and scale existing deployments to reach your target audiences.
The Grow Series: Grow and attain scale with deployments on Google Cloud.

Additionally, at Google we have the Google for Startups Cloud Program, which is designed to help your business get off the ground and enable a sustainable growth plan for the future. The start of the Build Series delineates the benefits of the program, the application process, and more to help your business get started on Google Cloud.

A quick recap of the Build Series

Once you have applied for the Google for Startups Cloud Program, there’s so much to explore and try out on Google Cloud.

Figuring out a rapid but solid application development process can be key to many businesses in reducing time to market. Furthermore, learning what database to use to handle application data can be tricky. Deep dive into our Firestorevideo which walks through how Firestore can help you unlock application innovation with simplicity and speed.

We then move on to a deep dive intoBigQuery and how it can help businesses. BigQuery is designed to support analysis over petabytes of data regardless of whether it’s structured or unstructured. This video is the goto video for getting started on BigQuery!

If you are someone looking to run your Spark and Hadoop jobs faster and on the cloud, look to Dataproc. To learn more about Dataproc and how this has helped other customers with their Hadoop clusters, click the video below to learn all things Dataproc related.

Next, we find out what Dataflow can bring to your business; some advantages, sample architectures, demos on the console, and how other customers are using Dataflow.

We also talked about Machine Learning, starting from selecting the right ML solution to Machine Learning APIs on cloud to exploring Vertex AI. Following that we look into API management in Google Cloud and how Apigee helps operate your APIs with enhanced scale, security, and automation.

We ended the series with the last two episodes focusing around security deep-dive and using Cloud Tasks and Cloud Scheduler.

Coming up next - The Grow Series

Dive into the next chapter of this multi-series, with our upcoming Grow Series, where we will be focusing on growing and attaining scale with deployments on Google Cloud.

Check out our website and join us by checking out the video series on the Google Cloud Tech channel, and subscribe to stay up to date.

See you in the cloud!

Document AI adds three new capabilities to its OCR engine

Wed, 21 Dec 2022 19:00:00 -0000

Documents are indispensable parts of our professional and personal lives. They give us crucial insights that help us become more efficient, that organize and optimize information, and that even help us to stay competitive. But as documents become increasingly complex, and as the variety of document types continues to expand, it has become increasingly challenging for people and businesses to sift through the ocean of bits and bytes in order to extract actionable insights.

This is where Google Cloud’s Document AI comes in. It is a unified, AI-powered suite for understanding and organizing documents. Document AI consists of Document AI Workbench (state-of-the-art custom ML platform), Document AI Warehouse (managed service with document storage and analytics capabilities), and a rich set of pre-trained document processors. Underpinning these services is the ability to extract text accurately from various types of documents with a world-class Document Optical Character Recognition (OCR) engine.

Google Cloud’s Document AI OCR takes an unstructured document as input and extracts text and layout (e.g., paragraphs, lines, etc.) from the document. Covering over 200 languages, Document AI OCR is powered by state-of-the-art machine learning models developed by Google Cloud and Google Research teams.

Today, we are pleased to announce three new OCR features in Public Preview that can further enhance your document processing workflows.

1. Assess page-level quality of documents with Intelligent Document Quality (IDQ)

With Document AI OCR, Google Cloud customers and partners can programmatically extract key document characteristics – word frequency distributions, relative positioning of line items, dominant language of the input document, etc. – as critical inputs to their downstream business logic. Today, we are adding another important document assessment signal to this toolbox: Intelligent Document Quality (IDQ) scores.

IDQ provides page-level quality metrics in the following eight dimensions:

Blurriness
Level of optical noise
Darkness
Faintness
Presence of smaller-than-usual fonts
Document getting cut off
Text spans getting cut off
Glares due to lighting conditions

Being able to discern the optical quality of documents helps assess which documents must be processed differently based on their quality, making the overall document processing pipeline more efficient. For example, Gary Lewis, Managing Director of lending and deposit solutions at Jack Henry, noted, “Google’s Document AI technology, enriched with Intelligent Document Quality (IDQ) signals, will help businesses to automate the data capture of invoices and payments when sending to our factoring customers for purchasing. This creates internal efficiencies, reduces risk for the factor/lender, and gets financing into the hands of cash-constrained businesses quickly.”

Overall, document quality metrics pave the way for more intelligent routing of documents for downstream analytics. The reference workflow below uses document quality scores to split and classify documents before sending them to either the pre-built Form Parser (in the case of high document quality) or a Custom Document Extractor trained specifically on lower-quality datasets.

2. Process digital PDF documents with confidence with built-in digital PDF support

The PDF format is popular in various business applications such as procurement (invoices, purchase orders), lending (W-2 forms, paystubs), and contracts (leasing or mortgage agreements). PDF documents can be image-based (e.g., a scanned driver’s license) or digital, where you can hover over, highlight, and copy/paste embedded text in a PDF document the same way as you interact with a text file such as Google Doc or Microsoft Word.

We are happy to announce digital PDF support in Document AI OCR. The digital PDF feature extracts text and symbols exactly as they appear in the source documents, therefore making our OCR engine highly performant in complex visual scenarios such as rotated texts, extreme font sizes and/or styles, or partially hidden text.

Discussing the importance and prevalence of PDF documents in banking and finance (e.g., bank statements, mortgage agreements, etc.), Ritesh Biswas, Director, Google Cloud Practice at PwC, said, “The Document AI OCR solution from Google Cloud, especially its support for digital PDF input formats, has enabled PwC to bring digital transformation to the global financial services industry.”

3. “Freeze” model characteristics with OCR versioning

As a fully managed cloud-based service, Document AI OCR regularly upgrades the underlying AI/ML models to maintain its world-class accuracy across over 200 languages and scripts. These model upgrades, while providing new features and enhancements, may occasionally lead to changes in OCR behavior compared to an earlier version.

Today, we are launching OCR versioning, which enables users to pin to a historical OCR model behavior. The “frozen” model versions, in turn, give our customers and partners peace of mind, ensuring consistent OCR behavior. For industries with rigorous compliance requirements, this update also helps maintain the same model version, thus minimizing the need and effort to recertify stacks between releases. According to Jagadheeswaran Kathirvel, Senior Principal Architect at Mr. Cooper, “Having consistent OCR behavior is mission-critical to our business workflows. We value Google Cloud’s OCR versioning capability that enables our products to pin to a specific OCR version for an extended period of time.”

With OCR versioning, you have the full flexibility to select the versioning option that best fits your business needs.

Getting Started on Document AI OCR

Learn more about the new OCR features and tutorials in the Document AI Documentation or try it directly in your browser (no coding required). For more details on what’s new with Document AI, don’t forget to check out our breakout session from Google Cloud Next 2022.

New control plane connectivity and isolation options for your GKE clusters

Wed, 21 Dec 2022 17:00:00 -0000

Once upon a time, all Google Kubernetes Engine (GKE) clusters used public IP addressing for communication between nodes and the control plane. Subsequently, we heard your security concerns and introduced private clusters enabled by VPC peering.

To consolidate the connectivity types, starting in March 2022, we began using Google Cloud’s Private Service Connect (PSC) for new public clusters’ communication between the GKE cluster control plane and nodes, which has profound implications for how you can configure your GKE environment. Today, we’re presenting a new consistent PSC-based framework for GKE control plane connectivity from cluster nodes. Additionally, we’re excited to announce a new feature set which includes cluster isolation at the control plane and node pool levels to enable more scalable, secure — and cheaper! — GKE clusters.

New architecture

Starting with GKE version 1.23 and later, all new public clusters created on or after March 15th, 2022 began using Google Cloud’s PSC infrastructure to communicate between the GKE cluster control plane and nodes. PSC provides a consistent framework that helps connect different networks through a service networking approach, and allows service producers and consumers to communicate using private IP addresses internal to a VPC.

The biggest benefit of this change is to set the stage for using PSC-enabled features for GKE clusters.

Figure 1: Simplified diagram of PSC-based architecture for GKE clusters

The new set of cluster isolation capabilities we’re presenting here is part of the evolution to a more scalable and secure GKE cluster posture. Previously, private GKE clusters were enabled with VPC peering, introducing specific network architectures. With this feature set, you now have the ability to:

Update the GKE cluster control plane to only allow access to a private endpoint
Create or update a GKE cluster node pool with public or private nodes
Enable or disable GKE cluster control plane access from Google-owned IPs.

In addition, the new PSC infrastructure can provide cost savings. Traditionally, control plane communication is treated as normal egress and is charged for public clusters as a normal public IP charge. This is also true if you’re running kubectl for provisioning or other operational reasons. With PSC infrastructure, we have eliminated the cost of communication between the control plane and your cluster nodes, resulting in one less network egress charge to worry about.

Now, let’s take a look at how this feature set enables these new capabilities.

Allow access to the control plane only via a private endpoint

Private cluster users have long had the ability to create the control plane with both public and private endpoints. We now extend the same flexibility to public GKE clusters based on PSC. With this, if you want private-only access to your GKE control plane but want all your node pools to be public, you can do so.

This model provides a tighter security posture for the control plane, while leaving you to choose what kind of cluster node you need, based on your deployment.

To enable access only to a private endpoint on the control plane, use the following gcloud command:

code_block: [StructValue([(u'code', u'gcloud container clusters update CLUSTER_NAME \\\r\n --enable-private-endpoint'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e499665c110>)])]

Allow toggling and mixed-mode clusters with public and private node pools

All cloud providers with managed Kubernetes offerings offer both public and private clusters. Whether a cluster is public or private is enforced at the cluster level, and cannot be changed once it is created. Now you have the ability to toggle a node pool to have private or public IP addressing.

You may also want a mix of private and public node pools. For example, you may be running a mix of workloads in your cluster in which some require internet access and some don’t. Instead of setting up NAT rules, you can deploy a workload on a node pool with public IP addressing to ensure that only such node pool deployments are publicly accessible.

To enable private-only IP addressing on existing node pools, use the following gcloud command:

code_block: [StructValue([(u'code', u'gcloud container node-pools update POOL_NAME \\\r\n --cluster CLUSTER_NAME \\\r\n --enable-private-nodes'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e49966da5d0>)])]

To enable private-only IP addressing at node pool creation time, use the following gcloud command:

code_block: [StructValue([(u'code', u'gcloud container node-pools create POOL_NAME \\\r\n --cluster CLUSTER_NAME \\\r\n --enable-private-nodes'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e49965e9410>)])]

Configure access from Google Cloud

In some scenarios, users have identified workloads outside of their GKE cluster, for example, applications running in Cloud Run or any GCP VMs sourced with Google Cloud public IPs were allowed to reach the cluster control plane. To mitigate potential security concerns, we have introduced a feature that allows you to toggle access to your cluster control plane from such sources.

To remove access from Google Cloud public IPs to the control plane, use the following gcloud command:

code_block: [StructValue([(u'code', u'gcloud container clusters update CLUSTER_NAME \\\r\n --no-enable-google-cloud-access'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e49965e9d10>)])]

Similarly, you can use this flag at cluster creation time.

Choose your private endpoint address

Many customers like to map IPs to a stack for easier troubleshooting and to track usage. For example — IP block x for Infrastructure, IP block y for Services, IP block z for the GKE control plane, etc. By default, the private IP address for the control plane in PSC-based GKE clusters comes from the node subnet. However, some customers treat node subnets as infrastructure and apply security policies against it. To differentiate between infrastructure and the GKE control plane, you can now create a new custom subnet and assign it to your cluster control plane.

code_block: [StructValue([(u'code', u'gcloud container clusters create CLUSTER_NAME \\\r\n --private-endpoint-subnetwork=SUBNET_NAME'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e49965e9d90>)])]

What can you do with this new GKE architecture?

With this new set of features, you can basically remove all public IP communication for your GKE clusters! This, in essence, means you can make your GKE clusters completely private.

You currently need to create the cluster as public to ensure that it uses PSC, but you can then update your cluster using gcloud with the --enable-private-endpoint flag, or the UI, to configure access via only a private endpoint on the control plane or create new private node pools.

Alternatively, you can control access at cluster creation time with the --master-authorized-networks and --no-enable-google-cloud-access flags to prevent access from public addressing to the control plane.

Furthermore, you can use the REST API or Terraform Providers to actually build a new PSC-based GKE cluster with the default (thus first) node pools to have private nodes. This can be done by setting the enablePrivateNodes field to true (instead of leveraging the public GKE cluster defaults and then updating afterwards, as currently required with gcloud and UI operations).

Lastly, the aforementioned features extend not only to Standard GKE clusters, but also to GKE Autopilot clusters.

When evaluating if you’re ready to move these PSC-based GKE cluster types to take advantage of private cluster isolation, keep in mind that the control plane’s private endpoint has the following limitations:

Private addresses in URLs for new or existing webhooks that you configure are not supported. To mitigate this incompatibility and assign an internal IP address to the URL for webhooks, set up a webhook to a private address by URL, create a headless service without a selector and a corresponding endpoint for the required destination.
The control plane private endpoint is not currently accessible from on-premises systems.
The control plane private endpoint is not currently globally accessible: Client VMs from different regions than the cluster region cannot connect to the control plane's private endpoint.

All public clusters on version 1.25 and later that are not yet PSC-based are currently being migrated to the new PSC infrastructure; therefore, your clusters might already be using PSC to communicate with the control plane.

To learn more about GKE clusters with PSC-based control plane communication, check out these references:

Here are the more specific features in the latest Terraform Provider, handy to integrate into your automation pipeline:

Terraform Providers Google: release v4.45.0

Google Cloud wrapped: Top 22 news stories of 2022, according to you

Tue, 20 Dec 2022 17:00:00 -0000

What a year! Over here at Google Cloud, we’re winding things down, but not before taking some time to reflect on everything that happened over the past twelve months.

Inspired by the custom Spotify Wrapped playlist playing in our earbuds, we pulled the data about the best-read Google Cloud news posts of the year, to better understand which stories resonated most with you.

Many of your favorite stories came as no surprise, as they tracked with major news, product launches, and events. But there were some sleeper hits in there too — stories whose viral success and staying power took us a bit by surprise. We also uncovered some fascinating data about the older posts that you keep coming back to, month after month, year after year (stay tuned for more on that in 2023). So, without further ado, here are the top 22 Google Cloud news stories of 2022, according to you, our readers.

Recognize any of your favorites? We thought you might. See anything you missed? Now’s your chance to catch up.

aside_block: [StructValue([(u'title', u'A transformative top 10'), (u'body', <wagtail.wagtailcore.rich_text.RichText object at 0x3e67f7738c10>), (u'btn_text', u'Read the top 10'), (u'href', u'https://cloud.google.com/blog/transform/top-10-digital-transformation-cloud-stories-trends-2022'), (u'image', <GAEImage: Transform Top 10 2022>)])]

Let’s take a deeper look at these top posts as they landed throughout the year.

January

Raising the bar in Security Operations: Google Acquires Siemplify (#10)

We set off some new year’s fireworks by acquiring security operations specialist Siemplify, combining their proven security orchestration, automation and response technology with our Chronicle security analytics to build a next-generation security operations workflow.

Google Cloud launches new dedicated Digital Assets Team (#13)

News flash: blockchain technology has huge potential. So it was no big surprise that readers responded with gusto to the news of Google Cloud’s new Digital Assets Team, whose charter is to support customers’ needs in building, transacting, storing value, and deploying new products on blockchain-based platforms.

February

Protecting customers against cryptomining threats with VM Threat Detection in Security Command Center (#3)

Who wants their VMs to be hijacked by hackers mining crypto? No one. To help, we added a new layer of threat detection to our Security Command Center that can help detect threats such as cryptomining malware inside virtual machines running on Google Cloud.

Here's what to know about changes to kubectl authentication coming in GKE v1.26 (#1)

The open-source Kubernetes community made a big move when it decided to require that all provider-specific code that currently exists in the OSS code base be removed (starting with v1.26). We responded with a blockbuster post (the #1 post of the year, in terms of readership) that outlines how this move impacts the client side.

Supercharge your event-driven architecture with new Cloud Functions (2nd gen) (#19)

Developers eyeing serverless platforms responded with enthusiasm to news of our next-generation Functions-as-a-Service product, which offers more powerful infrastructure, advanced control over performance and scalability, more control around the functions runtime, and support for triggers from over 90 event sources.

Build a data mesh on Google Cloud with Dataplex, now generally available (#12)

Building a data mesh is hard to do. But doing so lets data teams centrally manage, monitor, and govern their data across all manner of data lakes, data warehouses, and data marts, so they can make the data available to various analytics and data science tools. With Dataplex, data teams got a new way to do just that.

March

The L’Oréal Beauty Tech Data Platform - A data story of terabytes and serverless (#11)

Serverless, event-driven architecture, cross-cloud analytics… This customer story from L’Oréal about how it built its Beauty Tech Data Platform had it all.

Contact Center AI reimagines the customer experience through full end-to-end platform(#14)

Customers rely on contact centers for help when they encounter urgent problems with a product or service, but contact centers often struggle to provide timely help. To bridge this gap with the power of AI, Google Cloud built Contact Center AI (CCAI) to streamline and shorten this time to value. CCAI Platform, the addition announced here, expanded this effort by introducing end-to-end call center capabilities.

Automate Public Certificates Lifecycle Management via RFC 8555 (ACME) (#16)

With this announcement, Google Cloud customers were able to acquire public certificates for their workloads that terminate TLS directly or for their cross-cloud and on-premises workloads using the Automatic Certificate Management Environment (ACME) protocol. This is the same standard used by Certificate Authorities to enable automatic lifecycle management of TLS certificates.

April

Bringing together the best of both sides of BI with Looker and Data Studio (#18)

When Google Cloud acquired Looker in 2020 for its business intelligence and analytics platform, inquiring minds instantly began asking what would become of Data Studio, Google’s existing self-serve BI solution. This blog began to answer that question.

May

Introducing AlloyDB for PostgreSQL: Free yourself from expensive, legacy databases (#6)

Live from Shoreline at Google I/O, we made one of our largest product announcements of the year, launching a PostgreSQL database that can handle both transactional and analytical workloads, without sacrificing performance.

AlloyDB for PostgreSQL under the hood: Intelligent, database-aware storage (#17)

Readers couldn’t get enough about AlloyDB, piling on to learn about the inner workings of its database-aware storage (not to mention its columnar engine).

June/July

Even more pi in the sky: Calculating 100 trillion digits of pi on Google Cloud (#5)

A follow up to a reader favorite from 2019, we broke the record (again) by calculating the most digits of pi, leaning into significant advancements in Google Cloud compute, networking and storage.

Unveiling the 2021 Google Cloud Partner of the Year Award Winners (#15)

Who consistently demonstrates a creative spirit, collaborative drive, and a customer-first approach? Google Cloud partners, of course! With this blog, we were proud to recognize you and to call you our partners!

Introducing Google Public Sector (#8)

The U.S. government had been asking for more choice in cloud vendors who could support its missions, and protect the health, safety, and security of its citizens. With the announcement of Google Public Sector, a subsidiary of Google LLC that will bring Google Cloud and Google Workspace technologies to U.S. public sector customers, we delivered.

August

How Google Cloud blocked the largest Layer 7 DDoS attack at 46 million rps (#2)

Distributed denial-of-service (DDoS) attacks have been increasing in frequency and growing in size exponentially. In this post, we described how Cloud Armor protected one Google Cloud customer from the largest DDoS attack ever recorded — an attack so large that it was like receiving all of the requests that Wikipedia receives in a day in just 10 seconds.

September

Google + Mandiant: Transforming Security Operations and Incident Response (#9)

Here, we took a moment to reflect on the completion of our acquisition of threat intelligence firm Mandiant. Bringing Mandiant into the Google Cloud fold will allow us to deliver a security operations suite to help enterprises globally stay protected at every stage of the security lifecycle, and focus on eliminating entire classes of threats.

Announcing the 2022 Accelerate State of DevOps Report: A deep dive into security (#20)

For eight years now, DevOps professionals have pored over the results of DORA’s annual Accelerate State of DevOps Report. This year’s installment focused on the relationship between security and DevOps, using the Supply-chain Levels for Secure Artifacts (SLSA) and NIST Secure Software Development frameworks.

October

Introducing the next evolution of Looker, your unified business intelligence platform (#4)

In April, we began to lay out our strategy for Looker and Data Studio. At Google Cloud Next ‘22, we took the next step, consolidating the two under the Looker brand umbrella, and adding important new capabilities.

Introducing Blockchain Node Engine: fully managed node-hosting for Web3 development (#7)

Remember how in January we said that blockchain has a lot of potential? About that. News of the fully managed Blockchain Node Engine node-hosting service took readers by storm, catapulting it to the top ten of 2022, with just over two months left in the year.

November/December

Making Cobalt Strike harder for threat actors to abuse (#21)

Legitimate versions of Cobalt Strike are a very popular red team software tool, but older, cracked versions are often used by malicious hackers to spread malware. We made available to the security community a set of open-source YARA Rules that can be deployed to help stop the illicit use of Cobalt Strike.

Securing tomorrow today: Why Google now protects its internal communications from quantum threats (#22)

Google and Google Cloud have taken steps to harden our cryptographic algorithms used to protect internal communications against quantum computing threats. We explain here why we did it, and what challenges we face to achieve this type of future-proofing.

That’s a wrap!

Barring any last minute surprises, we’re pretty confident that what we have here is the definitive list of your favorite news stories of 2022 — you’ve got great taste. We can’t wait to see what stories inspire you in the new year. Happy holidays, and thanks for reading!

What’s new with Google Cloud

Mon, 19 Dec 2022 19:00:00 -0000

Want to know the latest from Google Cloud? Find it here in one handy location. Check back regularly for our newest updates, announcements, resources, events, learning opportunities, and more.

Tip: Not sure where to find what you’re looking for on the Google Cloud blog? Start here: Google Cloud blog 101: Full list of topics, links, and resources.

Week of Dec 19 - Dec 23, 2022

Eventarc adds support for 85+ new direct events from the following services: API Gateway, Apigee Registry, BeyondCorp, Certificate Manager, Cloud Data Fusion, Cloud Functions, Cloud Memorystore for Memcached, Database Migration, Datastream, Eventarc, and Workflows. Direct events provide strongly typed events with lower latency. This launch brings the total event sources supported by Eventarc to 150+ Google and third-party services with 7000+ direct and Cloud audit log based events.

Week of Dec 12 - Dec 16, 2022

Storage Transfer Service now offers Preview support for event-driven transfers - serverless, real-time replication from AWS S3 to Cloud Storage, and between Cloud Storage buckets. With this new capability, you can accelerate your event-driven analytics pipeline, enable automatic replication across Cloud Storage buckets, create a backup copy of data in a different region or project, or perform live migration. Read more here.
Learn about Memorystore for Redis best practices to achieve the optimal performance and availability with your implementation. Prescriptive guidance around monitoring your Memorystore instance is also provided. Read more about these topics here.

Week of Dec 5 - Dec 9, 2022

A Google Cloud first-party supported open-source Kafka Connector for Pub/Sub and Pub/Sub Lite is now generally available. See how it enables an easy drop-in solution for moving data between Kafka clusters and Pub/Sub and Pub/Sub Lite here.
Eventarc support for customer-managed encryption keys (CMEK) is generally available (GA).
Pub/Sub Lite now offers export subscriptions to Pub/Sub. This new subscription type writes Lite messages directly to Pub/Sub - no code development or Dataflow jobs needed. Great for connecting disparate data pipelines and migration from Lite to Pub/Sub. Learn more.

Week of Nov 28 - Dec 2, 2022

Zeotap partnered with Google Cloud to build a next-generation customer data platform with focus on Privacy, Security and Compliance. This blog post describes their journey using Google Data Cloud including BigQuery, BI Engine, Vertex AI to build customized Audience segments at scale. Read more here.

Week of Nov 14 - Nov 18, 2022

Apigee has been named aleader in the 2022 Gartner Magic Quadrant for API Management, marking the seventh time in a row we’ve earned this recognition. We remain the top API Management vendor in our Ability to Execute, with a strong product offering, customer experience, and sales execution. Please help us share the good news via Twitter, Facebook, and LinkedIn.
Connected-Stories has built an end-to-end creative management platform on Google Cloud including BigQuery, Vertex AI to develop, serve and optimize interactive video and display Ads that scale across any channel. Read more here.

Week of Nov 7 - Nov 11, 2022

Private Marketplace functionality is now available in preview for Google Cloud Marketplace to help organizations scale compliant product discovery. Learn more here.
No-cost access to some of our popular training is available on Coursera until December 31,2022. Get hands-on experience to enhance your technical skills in the cloud environment for the most in-demand job roles. Training is available for both technical and non-technical professionals and spans foundational to advanced content. You’ll also earn a shareable certificate. Learn more about this training offer today.

Week of Oct 31 - Nov 4, 2022

IAM Deny, a security guardrail to help Google Cloud customers harden their security posture at scale, is now Generally Available (GA). IAM Deny policies manage access to Google Cloud resources based on principal, resource type, and permissions they're trying to use. It enables administrators to harden their cloud security posture easily and at scale.
True Fit, a data-driven personalization platform built on Google Data Cloud describe their data journey to unlock Partner growth. True Fit publishes a number of BigQuery dataset for its Retail partners using Analytics Hub. Data sharing using Google Cloud has elevated True Fit’s business using real-world data in real-time. They achieved this in conjunction with the Built with BigQuery program from Cloud Partner Engineering. Read more.
Google Cloud Workstations is now in public preview.

Week of Oct 24 - Oct 28, 2022

Google Cloud and Sibros Technology with their award winning Deep Connected Platform is enabling vehicle manufacturers and suppliers to reach the next level in their use of data to gain valuable insights that should mitigate risks, reduce costs, add innovative products, drive sustainability and introduce value-added use cases services in the automotive industry. Read more.
Data Exploration Workbench in Dataplex is now Generally Available - it offers a Spark-powered serverless data exploration experience with one-click access to Spark SQL scripts and Jupyter notebooks. With the workbench, Data Consumers can spend more time generating insights rather than integrating different tools and platforms.Learn more

The Squire’s guide to automated deployments with Cloud Build

Mon, 19 Dec 2022 17:00:00 -0000

Audience: (Intermediate level) Targeting readers that have not yet interacted with Google Cloud before, but have engineering continuous integration, package management, beginner-level container and messaging experience. Requiring these individuals to have a pre-existing frontend application and supporting API server locally in place.

Technologies:

Cloud Build
Cloud Build Triggers
Artifact Registry
Cloud Run
Pub/Sub

Requirements before getting started:

Functional client-side repository
Functional API server repository
Pre-existing GCP project with billing enabled
Unix machine

A Hero’s Journey - The Quest Begins

In the initial stages of development, it’s easy to underestimate the grunt work needed to containerize and deploy your application, especially if you are new to the cloud. Could Google Cloud help you complete your project without adding too much bloat to the work? Let’s find out! This blog will take you on a quest to get to the heart of quick automated deployments by leveraging awesome features from the following products:

To help on this learning journey, we’d like to arm you with a realistic example of this flow as you are fashioning your own CI/CD pipeline. This blog will be referencing an open source Github project that models a best practices architecture using Google Cloud serverless patterns, Emblem. (Note: References will be tagged with Emblem).

Note: This blog will showcase the benefits of using Pub/Sub with multiple triggers, as it does in Emblem. If you are looking for a more direct path to building and deploying your containers with one trigger, check out the following quickstarts: ”Deploying to Cloud Run using Cloud Build” and “Deploying to Cloud Run using Cloud Deploy”.

Quest goals

The following goals will lead you to create a lean automated deployment flow for your API service that will be triggered by any change to the main branch of its source Github repository.

Manual deployment with Cloud Build and Cloud Run
Before you run off and attempt to automate anything, you will need a solid understanding of what commands you will be adding to your future cloudbuild.yaml files.

Build an image with a Cloud Build trigger
Creating the first trigger and cloudbuild.yaml file in the Cloud Build product that will react to any new changes to the main branch of your Github project.

React to Cloud Build events with Pub/Sub
Using a cool in-built feature of Artifact Registry repositories, create a pubsub topic.

Deploy with a Cloud Build trigger
Creating a new Cloud Build trigger that listens to the above Pub/Sub topic and a new cloudbuild.yaml file that will initiate deployment of newly created container images from Artifact Registry.

Before getting started

For the purposes of this blog, the following is required:

gcloud cli installed on Unix machine
An existing REST API server with associated Dockerfile
Google Cloud project with billing enabled (pricing)

You will create a new Github project repository epic-quest-projectand adding your existing REST API server code directory (i.e Emblem: content-api) to create the following project file structure:

code_block: [StructValue([(u'code', u'epic-quest-project/\r\n\u251c\u2500\u2500 ops/ # where build triggers will live\r\n\u2514\u2500\u2500 server-side/ # where your API server code lives\r\n \u251c\u2500\u2500 main.py\r\n \u251c\u2500\u2500 requirements.txt\r\n \u251c\u2500\u2500 Dockerfile\r\n \u2514\u2500\u2500 \u2026'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ed124cf5990>)])]

Now onto the quest!

Goal #1: Manual deployment with Cloud Build and Cloud Run

You will be building and deploying your containers using Google Cloud products, Cloud Build and Cloud Run, via the Google Cloud CLI, also known as gcloud.

Within an open terminal, you will be setting up the following environment variables that declare the Google Cloud project ID, and which region you will be basing your project from. You will also need to enable the following product APIs (Cloud Run, Cloud Build & Artifact Registry APIs) within the Google Cloud project.

code_block: [StructValue([(u'code', u'# setting environment variables\r\nexport PROJECT_ID=<add your project ID>\r\nexport REGION=<add your region/location>\r\n\r\n# enable relevant apis\r\ngcloud services enable run.googleapis.com \\\r\nartifactregistry.googleapis.com compute.googleapis.com cloudbuild.googleapis.com \r\n\r\n# update gcloud with project id and region\r\ngcloud config set project $PROJECT_ID\r\ngcloud config set compute/region $REGION'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ed124cf5bd0>)])]

The container images you create from the server-side/ directory will be stored in an image repository named “epic-quest”, managed by Artifact Registry.

code_block: [StructValue([(u'code', u'gcloud artifacts repositories create epic-quest --repository-format="DOCKER" --location=$REGION'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ed124cf5b90>)])]

Now that the “epic quest” Artifact Registry repository has been created you can begin pushing container images to it! Use gcloud builds submit to build and tag an image from the server-side/ directory with Artifact Registry repository specific format: <region>-docker.pkg.dev/<project-id>/<repository>

code_block: [StructValue([(u'code', u'# root: epic-quest-project\r\ncd server-side/\r\n\r\n# root: create "server-side" image\r\ngcloud builds submit . --tag $REGION-docker.pkg.dev/$PROJECT_ID/epic-quest/server-side'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ed124cf5190>)])]

After pushing your server-side container image to the Artifact Registry repository, you’re all set to deploy it with Cloud Run!

code_block: [StructValue([(u'code', u'gcloud run deploy --image=$REGION-docker.pkg.dev/$PROJECT_ID/epic-quest/server-side'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ed124cf5050>)])]

code_block: [StructValue([(u'code', u'Service name: server-side\r\nAllow unauthenticated invocations to [server-side] (y/N)? y'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ed124cf5290>)])]

Excellent, you’ve created a basic manual CI/CD pipeline! Now, you can explore what it looks like to have this pipeline automated.

Goal #2: Build an image with a Cloud Build trigger

To start automating your small pipeline you will need to create a cloudbuild.yaml file that will configure your first Cloud Build trigger.

In the ops directory of epic-quest-project, create a new file named api-build.cloudbuild.yaml. This new yaml file will describe the steps Cloud Build will use to build your container image and push it to the Artifact Registry.

(Emblem: ops/api-build.cloudbuild.yaml)

code_block: [StructValue([(u'code', u"touch ops/api-build.cloudbuild.yaml \r\n\r\n\r\n# api-build.cloudbuild.yaml contents\r\n\r\nsteps:\r\n # Docker Build \r\n - name: 'gcr.io/cloud-builders/docker'\r\n args: \r\n - 'build'\r\n - '-t'\r\n - '${_REGION}-docker.pkg.dev/${PROJECT_ID}/epic-quest/server-side:${_IMAGE_TAG}'\r\n - 'server-side/.'\r\n\r\n# Store in Artifact Registry\r\nimages:\r\n - '${_REGION}-docker.pkg.dev/${PROJECT_ID}/epic-quest/server-side:${_IMAGE_TAG}'"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ed116cd5c10>)])]

To configure Cloud Build to automatically execute the steps in the above yaml, use the Cloud Console to create a new Cloud Build trigger:

Remember to select `Push to a branch` as the event that will activate the build trigger and, under `Source` connect your “epic-quest-project” Github repository.

You may need to authenticate with your GitHub account credentials to connect a repository to your Google Cloud project. Once you have a repository connected, specify the location of the cloud build configuration in that repository:

Submitting this configuration will create a new trigger named api-new-build that will be invoked whenever a change is committed and merged into the main branch of the repository with changes to the server-side/ folder.

After committing your changes to server-side/ files locally, you can verify this trigger works by merging a new commit into the main branch of your repository. Once merged, you will be able to observe the build trigger at work in the Build History page of the Cloud Console.

Excellent, the container build is now automated! How will Cloud Run know when a new build is ready to deploy? Enter Pub/Sub.

Goal #3: React to Cloud Build events with Pub/Sub

By default, Artifact Registry will publish messages about changes in its repositories to a Pub/Sub topic named gcr if it exists. Let’s take advantage of that feature for your next Cloud Build trigger. First, create a Pub/Sub topic named gcr:

code_block: [StructValue([(u'code', u'gcloud pubsub topics create gcr'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ed127e6a050>)])]

Now, every time a new build is pushed to any Artifact Registry repository, a message is published to the gcr topic with a build digest that identifies that build. Next it’s time to configure your second trigger to complete the automated deployment pipeline.

Goal #4: Deploy with a Cloud Build trigger

Now you’re arriving at the final step, creating the deployment trigger! This Cloud Build trigger is the last link to complete your automated deployment story.

Note: Read more about our opinionated way to perform this step here using Cloud Run with checkbox CD and check out the new support for Cloud Run in Cloud Deploy.

In the ops directory of the epic-quest-project, create a new file named api-deploy.cloudbuild.yaml. In short, this will perform the deployment action of the new container image on your behalf. ( Emblem: ops/deploy.cloudbuild.yaml) .

code_block: [StructValue([(u'code', u'touch ops/api-deploy.cloudbuild.yaml \r\n\r\n\r\n# api-deploy.cloudbuild.yaml contents\r\n\r\nsteps:\r\n # Print the full Pub/Sub message for debugging\r\n - id: "Echo Pub/Sub message" \r\n name: gcr.io/cloud-builders/gcloud\r\n entrypoint: /bin/bash\r\n args:\r\n - \'-c\'\r\n - |\r\n echo ${_BODY}\r\n\r\n # Cloud Run Deploy\r\n - id: "Deploy to Cloud Run"\r\n name: gcr.io/cloud-builders/gcloud\r\n args:\r\n - run\r\n - deploy\r\n - ${_SERVICE}\r\n - --image=${_IMAGE_NAME}\r\n - --region=${_REGION}\r\n - --revision-suffix=${_REVISION}\r\n - --project=${_PROJECT_ID}\r\n - --allow-unauthenticated\r\n - --tag=${_IMAGE_TAG}'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ed12606b710>)])]

The first step in this Cloud Build configuration will print to the build job log the body of the message published by Artifact Registry and the second step will deploy to Cloud Run.

Open the console and create another new Cloud Build Trigger with the following configuration:

Instead of choosing a repository event like in the api-build trigger, select Pub/Sub message to create a subscription to the desired Pub/Sub topic along with the trigger:

Once again, provide the location of the corresponding Cloud Build configuration file in the repository. Additionally, include values for the substitution variables that exist in the configuration file. Those variables are identifiable by the underscore prefix (_). Note that the _BODY, _IMAGE_NAME and _REVISION variables reference data included in the body of the Pub/Sub message:

The Cloud Build service account by default will initiate the deployment to Cloud Run, so it will need to have Cloud Run Developer and Service Account User IAM rolesgranted to it in the project where the Cloud Run services reside.

After granting those roles, check that the pipeline is working by creating a commit to the server-side/ directory in your epic-quest-project GitHub repository. It should result in the automatic invocation of the api-new-build trigger followed closely by the api-deploy trigger, and finally with a new revision in the corresponding Cloud Run service.

Your final project setup should similar to the following:

code_block: [StructValue([(u'code', u'epic-quest-project/\r\n\u2514\u2500\u2500 ops/\r\n \u251c\u2500\u2500 api-build.cloudbuild.yaml\r\n \u2514\u2500\u2500 api-deploy.cloudbuild.yaml\r\n\u2514\u2500\u2500 server-side/\r\n \u251c\u2500\u2500 main.py\r\n \u251c\u2500\u2500 requirements.txt\r\n \u251c\u2500\u2500 Dockerfile\r\n \u2514\u2500\u2500 \u2026'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ed1249939d0>)])]

Quest complete!

Excellent, you now have a shiny automated pipeline and leveled up your deployment game!

After reading today’s post, we hope you have a better understanding of how to manually create and spin up a container using just Cloud Build and Cloud Run, use Cloud Build triggers to react to Github repository actions, writing cloudbuild.yaml files to add additional configuration to your build triggers, and magical benefits of using Artifact Registry repositories.

If you want to learn even more, check out the open source serverless project Emblem on Github.

Automate data governance, extend your data fabric with Dataplex-BigLake integration

Fri, 16 Dec 2022 17:00:00 -0000

Unlocking the full potential of data requires breaking down the silo between open-source data formats and data warehouses. At the same time, it is critical to enable data governance team to apply policies regardless of where the data happens, whether - on file or columnar storage.

Today, data governance teams have to become subject matter experts on each storage system the corporate data happens to reside on. Since February 2022, Dataplex has offered a unified place to apply policies, which are propagated across both lake storage and data warehouses in GCP. Rather than specifying policies in multiple places, bearing the cognitive load of translating policies from “what you want the storage system to do” to “how your data should behave” Dataplex offers a single point for unambiguous policy management. Now, we are making it easier for you to use BigLake.

Earlier this year, we launched BigLake into general availability, BigLake unifies data fabric between Data Lakes and Data Warehouses by extending BigQuery storage to open file formats. Today, we announce BigLake Integration with Dataplex (available in preview). This integration eliminates the configuration steps for the admin taking advantage of BigLake and managing policies across GCS and BigQuery from a unified console.

Previously, you could point Dataplex at a Google Cloud Storage (GCS) bucket, and Dataplex will discover and extract all metadata from the data lake and register this metadata in BigQuery (and Dataproc Metastore, Data Catalog) for analysis and search. With the BigLake integration capability, we are building on this capability by allowing an “upgrade” of a bucket asset, and instead of just creating external tables in BigQuery for analysis - Dataplex will create policy-capable BigLake tables!

The immediate implication is that admins can now assign column, row, and table policies to the BigLake tables auto-created by Dataplex, as with BigLake - the infrastructure (GCS) layer is separate from the analysis layer (BigQuery). Dataplex will handle the creation of a BigQuery connection and a BigQuery publishing dataset and ensure the BigQuery service account has the correct permissions on the bucket.

But wait - there’s more.

With this release of Dataplex, we are also introducing advanced logging called governance logs. Governance logs allow tracking the exact state of policy propagation to tables and columns - adding an additional level of detail going beyond the high-level “status” for the bucket and into fine-grained status and logs for tables, columns.

What’s next?

We have updated our documentation for managing buckets and have additional detail regarding policy propagation and the upgrade process.
Stay tuned for an exciting roadmap ahead, with more automation around policy management.

For more information, please visit:

Google Cloud Dataplex

How HSBC is upskilling at scale with Google Cloud

Fri, 16 Dec 2022 17:00:00 -0000

Editor’s note: Founded in 1865, HSBC is one of the world’s largest banking and financial services organizations. In today’s post, Adrian Phelan, Global Head of Google Cloud, HSBC, explains how the organization is working with Google Cloud to drive cloud adoption at scale.

Close to 90% of corporations say they’re affected by digital skills gaps, or expect to be within the next five years. Technologies and business models are evolving rapidly, and companies are deploying a multi-pronged approach to ensure they have the right skills in the right places.

Here at HSBC, one of the bank’s strategic priorities is digitizing at scale. As people operate in a more digital world, we want to supply them with services quickly and in ways they want to use them. We initially worked with Google Cloud to implement more than 1,700 data analytics, customer experience, cybersecurity and emission reduction projects. A big part of rolling these out has been getting our teams skilled in the right way.

Empowering employees with a culture of learning

This approach has evolved over time, but central to it has been proactively instilling a culture of learning. We started out in 2018 with a few small-scale training projects, and it quickly became clear that the teams who had participated in them delivered better and faster than those who hadn’t. They were also more independent and less dependent on central expertise.

This inspired us to scale up our learning programs across the organization, which was a challenge because of the sheer size of our technical staff: tens of thousands of employees.

After some really positive feedback for our early training programs with Google Cloud, we set up our Google Accelerated Certification Program (GACP). It’s a 10-week blended learning model including self-learning, case studies, and hands-on practice followed by an examination preparation boot camp.

This combination of theory and practice in a safe environment helped build employees’ confidence. Two thousand people have gone through this training so far, and it’s really helped accelerate their journey towards achieving Google Cloud certification. The learning programs also offer other digital credentials, such as completion badges and skill badges, which provide encouragement and help participants measure their progress.

Company-wide knowledge building

When we started our learning journey, we focused on IT roles for obvious reasons, but we are increasingly moving towards training people in business functions.

One of our aims is to educate our less technical employees about the broad capabilities that exist within the cloud. IT teams are often the ones to say, "Hey, we could do this in a better, more efficient, different way by using the cloud", and to make that happen we need to work in close collaboration with our business colleagues, so it’s equally important that they understand the technology.

To enable this kind of innovation, you have to educate the whole organization in the ‘art of the possible’. One of the ways we did that was by organizing a month-long Cloud Festival that reached 10,000 employees, which included three Google Cloud sessions. This really helped us build a foundational level of knowledge with business and technology colleagues across the organization.

As we continue along our training path, interest in the cloud within the organization continues to increase. Our channel for communicating any changes related to cloud technology, processes or ways of working now has an audience of close to 8,000 employees.

Looking to the future with targeted training

The Google Cloud team has provided a lot of support in helping us get our training off the ground. It has always been a true process of co-creation, of listening, testing things, and seeing what works best. We meet weekly in order to keep our learning journey moving forward, listen to the demands of the business, understand what the pipeline of work is, and what the up and coming Google Cloud product launches are, so that we can stay one step ahead.

One example of this is the bespoke training we introduced for business leaders. So far, 250 senior business leaders have completed it with great feedback. They have told us that the program improved their understanding of how the cloud can help to more quickly meet customer expectations, increase speed to market, reduce overheads and grow revenue through new product streams and continuous innovation. It also covered potential business activities suitable for migration to the cloud.

When it comes to learning and training, you can either let it happen organically, or you can drive it. Our choice was to drive it and invest in it, and I’d highly recommend anybody trying to adopt cloud at scale does the same: they will see the return on that investment many times over.

Learn more about Google Cloud training and certificationand the impact it can have on your team.

BigQuery Omni: solving cross-cloud challenges by bringing analytics to your data

Thu, 15 Dec 2022 18:00:00 -0000

Research shows that over 90% of large organizations already deploy multicloud architectures, and their data is distributed across several public cloud providers. Additionally, data is also increasingly split across various storage systems such as warehouses, operational and relational databases, object stores, etc. With the proliferation of new applications, data is serving many more use cases such as data sciences, business intelligence, analytics, streaming and the list goes on. With these data trends, customers are increasingly gravitating towards an open multicloud data lake. However, multicloud data lakes present several challenges such as data silos, data duplication, fragmented governance, complexity of tools, and increased costs.

With Google's data cloud technologies, customers can leverage the unique combination of distributed cloud services. They can create an agile cross-cloud semantic business layer with Looker and manage data lakes and data warehouses across cloud environments at scale with BigQuery and capabilities like BigLake and BigQuery Omni.

BigLake is a storage engine that unifies data warehouses and lake houses by standardizing across different storage formats including BigQuery managed table and open file formats such as Parquet and Apache Iceberg on object storage. BigQuery Omni provides the compute engine that runs locally to the storage on AWS or Azure, which customers can use to query data in AWS or Azure seamlessly. This provides several key benefits such as:

A single pane of glass to query your multicloud data lakes (across Google Cloud Platform, Amazon Web Services, and Microsoft Azure)
Cross-cloud analytics by combining data across different platforms with little to no egress costs
Unified governance and secure management of your data wherever it resides

In this blog, we will share cross-cloud analytics use cases customers are solving with Google’s Data Cloud and the benefits they are realizing.

Unified marketing analytics for 360-degree insights

Organizations want to perform marketing analytics - ads optimization, inventory management, churn prediction, buyer propensity trends and many more such analytics. To do this before BigQuery Omni, customers had to use data from several different sources such as Google Analytics, public datasets and other proprietary information stored across cloud environments. This requires moving large amounts of data, managing duplicate copies and incremental costs to perform any cross-cloud analytics and derive actionable insights. With BigQuery Omni, organizations are able to greatly simplify this workflow. Using the familiar BigQuery interface, users can access data residing in AWS or Azure, discover and select just the relevant data that needs to be combined for further analysis. This subset of data can be moved to Google Cloud using Omni’s new Cross-Cloud Transfer capabilities. Customers can combine this data with other Google Cloud datasets and these consolidated tables can be made available to key business stakeholders through advanced analytics tools such as Looker and Looker Studio. Customers are also able to tie in this data now with world class AI models via Vertex AI.

As an illustrative example, consider a retailer who has sales & inventory, user and search data spread across multiple data silos. Using BigQuery Omni they can seamlessly bring these datasets together and power several marketing analytics scenarios like customer segmentation, campaign management and demand forecasting etc.

"Interested in performing cross-cloud analytics, we tested BigQuery Omni and really liked the SQL support to easily get data from AWS S3. We have seen great potential and value in BigQuery Omni for adopting a multi-cloud data strategy." — Florian Valeye, Staff Data Engineer,Back Market, a leading online marketplace for renewed technology based out of France

Data platform with consistent and unified cross-cloud governance

Another pattern is customers looking to analyze operational, transactional and business data across data silos in different clouds through a unified data platform. These data silos are a result of various factors such as merger and acquisitions, standardization of analytical tools, leveraging best of breed solutions in different clouds and diversification of data footprint across clouds. In addition to a single pane of glass for data access across silos, customers deeply desire consistent and uniform governance of their data across clouds.

“Achieve is looking to deliver a consistent analytics experience to all our customers and stakeholders. With our financial and credit report data distributed across clouds, accessing and getting insights holistically is difficult. Through our exploration with Omni, we are able to access datasets in different clouds using a single familiar BigQuery interface; we see its promise as one of the primary tools in our multi-cloud platform." — James Simonson, Senior Data Engineer,Achieve

With BigLake and BigQuery Omni abstracting the storage and compute layers respectively, organizations can access and query their data in Google Cloud irrespective of where it resides. They can also set fine-grained row level and column access policies in BigQuery and consistently govern it across clouds. These building blocks enable data engineering teams to build a unified and governed data platform for their data users without having to deal with the complexity of building and managing complex data pipelines. Furthermore, with BigQuery Omni’s integration with Dataplex and Data Catalog, you can discover, search your data across clouds and enrich your data by adding relevant business context with business glossary and rich text.

"Several SADA customers use GCP to build and manage their data analytics platform. During many explorations and proofs of concepts, our customers have seen the great potential and value in BigQuery Omni. Enabling seamless cross-cloud data analytics has allowed them to realize the value of their data quicker while lowering the barrier to entry for BigQuery adoption in a low-risk fashion." — Brian Suk, Associate Chief Technology Officer,SADA, one of the strategic partners of Google Cloud.

Simplified data sharing between data providers and their customers

A third emerging pattern in cross cloud analytics is data sharing. Several services have the business need to share information such as inventory data, subscriber data to their customers or users who in turn analyze or aggregate the data with their proprietary data and oftentimes share the results back with the service provider. In several cases, the two parties are on different cloud environments, requiring them to move data back and forth.

Consider a company operating in the customer data platform (CDP) space. CDPs were designed to help activate customer data, and a critical first step of that was unifying and managing that customer data. To enable this, many CDP vendors built their solution choosing one of the available cloud infrastructure technologies and copied data from the client’s systems.“Copying data from client applications and infrastructure has always been a requirement to deploy a CDP, but it doesn’t have to be anymore" — Justin DeBrabant, Senior Vice President of Product, ActionIQ.

While a small percentage of customers are fine with moving data across cloud environments, the majority are hesitant to onboard new services and would rather prefer providing governed access to their data sets.

“A new architectural pattern is emerging, allowing organizations to keep their data at one location and make it accessible, with the proper guardrails, to applications used by the rest of the organization’s stack”addsJustin at ActionIQ.

With BigQuery Omni, services in Google Cloud Platform can more easily access and share data with their customers and users in other cloud environments with limited data movement. One of UK's largest statistics providers has explored Omni for their data sharing needs.

"We tested BigQuery Omni and really like the ability to get data from AWS directly into BQ. We're excited about managing data sharing with different organizations without onboarding new clouds" – Simon Sandford-Taylor, Chief Information and Digital Officer, UK's Office for National Statistics

With BigQuery Omni, customers are able to:

Access and query data across clouds through a single user interface
Reduce the need for data engineering before analyzing data
Lower operational overhead and risks by deploying an application that runs across multiple clouds which leverages the same, consistent security controls
Accelerate access to insights by significantly reducing the time for data processing and analysis
Create consistent and predictable budgeting across multiple cloud footprints
Enable long term agility and maximize the benefits every cloud investment

Over the last year, we’ve seen great momentum in customer adoption and added significant innovations to BigQuery Omni including improved performance and scalability for querying your data in AWS S3 or Azure Blob Storage, Iceberg support for Omni, Larger query result set size up to 20GB and Cross-cloud transfer that helps customers easily, securely, and cost effectively move just enough data across cloud environments for advanced analytics.

BigQuery Omni has launched several features to support unified governance of your data across multiple clouds - you can get fine-grained access to your multi-cloud data with row level and column level security. Building on this, we are excited to announce that BigQuery Omni now supports data masking. We’ve also made it easy for customers to try and see the benefits of BigQuery Omni through the limited time free trial available until March 30, 2023.

BigQuery Omni running on other public clouds outside of Google Cloud is available in AWS US East1 (N.Virginia) and Azure US East2 (US East) regions. We are also excited to share that we will be bringing BigQuery Omni to more regions in the future, starting with Asia Pacific (AWS Korea) coming soon.

Getting Started

Get started with a free trial to learn about Omni. Check out the documentation to learn more about BigQuery Omni. You can also leverage the self paced labs to learn how to set up BigQuery Omni easily.

Efficient PyTorch training with Vertex AI

Thu, 15 Dec 2022 17:00:00 -0000

Vertex AI provides flexible and scalable hardware and secured infrastructure to train PyTorch based deep learning models with pre-built containers and custom containers. For model training with large amounts of data, using the distributed training paradigm and reading data from Cloud Storage is the best practice. However, training with data on the cloud such as remote storage on Cloud Storage, introduces a new set of challenges. For example, when a dataset consists of many small individual files, randomly accessing them can introduce network overhead. Another challenge is data throughput, the speed at which data is fed to the hardware accelerators (GPU) to keep them fully utilized.

In this post, we walk through methods to improve training performance step-by-step, starting first without distributed training followed by distributed training paradigms using data on cloud. Finally we can boost the training by 6x faster with data on Cloud Storage approaching the same speed as data on a local disk. We will show how Vertex AI Training service with Vertex AI Experiments and Vertex AI TensorBoard can be used to keep track of experiments and results.

You can find the accompanying code for this blog post on the GitHub Repo.

PyTorch distributed training

PyTorch natively supports distributed training strategies.

DataParallel (DP) is a simple strategy often used for single-machine multi-GPU training, but the single process it relies on could be the bottleneck of performance. This approach loads an entire mini-batch on the main thread and then scatters the sub mini-batches across the GPUs. The model parameters are only updated on the main GPU and then broadcasted to other GPUs at the beginning of the next iteration.
DistributedDataParallel (DDP) fits multi-node multi-GPU scenarios where the model is replicated on each device which is controlled by an individual process. Each process loads its own mini-batch and passes them to its GPU. Each process also has its own optimizer with no parameter broadcast reducing the communication overhead. Finally, an all-reduce operation is performed across GPUs unlike DP. This multi-process benefits the training performance.
FullyShardedDataParallel (FSDP) is another data parallel paradigm similar to DDP, which enables fitting more data and larger models by sharding the optimizer states, gradients and parameters into multiple FSDP units, unlike DDP where model parameters are replicated on each GPU.

Different distributed training strategies can ideally fit different training scenarios. However, sometimes it is not easy to pick the best one for specific environment configurations. For example, effectiveness of data loading pipeline to GPUs, batch size and network bandwidth in a multi-node setup can affect performance of a distributed training strategy.

In post, we will use PyTorch ResNet-50 as the example model and train it on ImageNet validation data (50K images) to measure the training performance for different training strategies.

Demonstration

Environment configurations

For the test environment, we create custom jobs on Vertex AI Training with following setup:

Here are training hyperparameters setup for all of the following experiments:

For each of the following experiments, we train the model for 10 epochs and use the averaged epoch time as the training performance. Please note that we focused on improving the training time and not on the model performance itself.

Read data from Cloud Storage with `gcsfuse` and WebDataset

We use gcsfuse to access data on Cloud Storage from Vertex AI Training jobs. Vertex AI training jobs have Cloud Storage buckets already mounted via gcsfuse and there is no additional work required to use gcsfuse. With gcsfuse training jobs on Vertex AI can access data on Cloud Storage as simply as files in the local file system. This also provides high throughput for large file sequential reads.

code_block: [StructValue([(u'code', u"open('/gcs/test-bucket/path/to/object', 'r')"), (u'language', u'lang-py'), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e5c6a006a50>)])]

Data loading pipeline could be a bottleneck of distributed training when it reads individual data files from the cloud. WebDataset is a PyTorch dataset implementation designed to improve streaming data access especially in remote storage settings. The idea behind WebDataset is similar to TFRecord, it collects multiple raw data files and compiles them into one POSIX tar file. But unlike TFRecord, it doesn’t do any format conversion and doesn’t assign object semantics to data and the data format is the same in the tar file as it is on disk. Refer to this blog post for key pipeline performance enhancements we can achieve with WebDataset.

WebDataset shards a large number of individual images into a small number of tar files. During training, each single network request will be able to fetch multiple images and cache them locally for the next couple of batches. Thus the sequential I/O allows much lower overhead of network communication. In the below demonstration, we will see the difference between training using data on Cloud Storage with and without WebDataset using gcsfuse.

NOTE: WebDataset has been incorporated into the official TorchData library as torchdata.datapipes.iter.WebDataset. But the TorchData lib is currently in the Beta stage and doesn’t have a stable version. So we stick to the original WebDataset as the dependency.

Without distributed training

We train the ResNet-50 on one single GPU first to get a baseline performance:

From the result we can see that, when training on one single GPU, using data on Cloud Storage takes about 2x the time of using a local disk. Keep this in mind, we will use multiple methods to improve the performance step by step.

DataParallel (DP)

The DataParallel strategy is the simplest method introduced by PyTorch to enable single-machine multiple-GPU training with the smallest code change. Actually as small as one line code change:

code_block: [StructValue([(u'code', u'model = torch.nn.DataParallel(model)'), (u'language', u'lang-py'), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e5c69e8df10>)])]

We train the ResNet-50 on single node with 4 GPUs using the DP strategy:

After applying DP on 4 GPUs, we can see that:

Training with data on the local disk gets 3x faster (from 489s to 157s).
Training with data on Cloud Storage gets faster a little bit (from 804s to 738s).

It’s apparent that the distributed training with data on Cloud Storage becomes an input bound training, waiting for data to be read due to network bottleneck.

DistributedDataParallel (DDP)

DistributedDataParallel is more sophisticated and powerful than DataParallel. It’s recommended to use DDP over DP, despite the added complexity, because DP is single-process multi-thread which suffers from Python GIL contention and DDP can fit more scenarios like multi-node and model-parallel. Here we experimented with DDP on a single node with 4 GPUs where each GPU is handled by an individual process.

We use the nccl backend to initialize the process group for DDP and construct the model:

code_block: [StructValue([(u'code', u"dist.init_process_group(\r\n backend='nccl', init_method='env://',\r\n world_size=4, rank=rank)\r\n\r\ntorch.nn.parallel.DistributedDataParallel(model)"), (u'language', u'lang-py'), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e5c4e54d590>)])]

We train the ResNet-50 on 4 GPUs using the DDP strategy and WebDataset:

After enabling DDP on 4 GPUs, we can see that:

Training with data on the local disk gets further faster than DP (from 157s to 134s).
Training with data on Cloud Storage gets much better (from 738s to 432s), but it is 3x times slower than using a local disk.
Training with data on Cloud Storage gets a lot faster (from 432s to 133s) when using source files in WebDataset format, which is very close or as good as to the speed of training with data on the local disk.

The input bound problem is kind of relieved when using DDP, which is expected because there’s no Python GIL contention any more for reading data. And despite the addition of data preprocessing work, sharding data with WebDataset benefits the performance by removing the overhead of network communication. Finally, DDP and WebDataset improve training performance by 6x (from 804s to 133s) in comparison to without distributed training and individual smaller files.

FullyShardedDataParallel (FSDP)

FullyShardedDataParallel wraps model layers into FSDP units. It gathers full parameters before the forward and backward operations and runs reduce-scatter to synchronize gradients. It achieves lower peak memory usage than DDP with some configurations.

code_block: [StructValue([(u'code', u'# policy to recursively wrap layers with FSDP\r\nfsdp_auto_wrap_policy = functools.partial(\r\n size_based_auto_wrap_policy, \r\n min_num_params=100)\r\n\r\n# construct the model to shard model parameters \r\n# across data parallel workers\r\nmodel = torch.distributed.fsdp.FullyShardedDataParallel(\r\n model, \r\n auto_wrap_policy=fsdp_auto_wrap_policy)'), (u'language', u'lang-py'), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e5c6b4d8190>)])]

We train the ResNet-50 on 4 GPUs using the FSDP strategy and WebDataset:

We can see that using FSDP achieves a similar training performance as DDP in this configuration on a single node with 4 GPUs.

Comparing performance across these different training strategies, with and without WebDataset format, we see an overall 6x performance improvement with data on Cloud Storage using WebDataset and choosing DistributedDataParallel or FullyShardedDataParallel distributed training strategies. The training performance with data on Cloud Storage is similar to when trained with data on a local disk.

Tracking with Vertex AI TensorBoard and Experiments

As you have seen so far, we carried out performance improvement trials step-by-step and it was necessary to run the experiments with several configurations and track the development and outcome. Vertex AI Experiments enable seamless experimentation along with tracking. You can track parameters, visualize and compare the performance metrics of your model and pipeline experiments.

You would use Vertex AI Python SDK to create an experiment, and log both parameters, metrics, and artifacts associated with experiment runs. The SDK provides a handy initialization method to create a TensorBoard instance using Vertex AI TensorBoard for logging model time series metrics. For example, we tracked training loss, validation accuracy and training run times for each epoch.

Below is the snippet to start an experiment, log model parameters, run the training job and track metrics at the end of the training session:

code_block: [StructValue([(u'code', u"# Create Tensorboard instance and initialize Vertex AI client\r\nTENSORBOARD_RESOURCE_NAME = aiplatform.Tensorboard.create()\r\naiplatform.init(project=PROJECT_ID,\r\n location=REGION,\r\n experiment=EXPERIMENT_NAME,\r\n experiment_tensorboard=TENSORBOARD_RESOURCE_NAME,\r\n staging_bucket=BUCKET_URI)\r\n\r\n# start experiment run\r\naiplatform.start_run(EXPERIMENT_RUN_NAME)\r\n\r\n# log parameters to the experiment\r\naiplatform.log_params(exp_params)\r\n\r\n# create job\r\njob = aiplatform.CustomJob(\r\n display_name=DISPLAY_NAME, \r\n worker_pool_specs=WORKER_SPEC,\r\n staging_bucket=BUCKET_URI,\r\n base_output_dir=BASE_OUTPUT_DIR\r\n)\r\n\r\n#run job\r\njob.run(\r\n service_account=SERVICE_ACCOUNT,\r\n tensorboard=TENSORBOARD_RESOURCE_NAME\r\n)\r\n\r\n# log metrics to the experiment\r\nmetrics_df = pd.read_json(metrics_path, typ='series')\r\naiplatform.log_metrics(metrics_df[metrics_cols].to_dict())\r\n\r\n# stop the run\r\naiplatform.end_run()"), (u'language', u'lang-py'), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e5c4f8840d0>)])]

The SDK supports a handy get_experiment_df method to return experiment run information as a Pandas dataframe. Using this dataframe, we can now effectively compare performance between different experiment configurations:

Since the experiment is backed with TensorBoard using Vertex AI TensorBoard, you can access TensorBoard from the console and do a deeper analysis. For the experiment, we modified training code to add TensorBoard scalars with metrics that we were interested in.

Conclusion

In this post, we demonstrated how PyTorch training could be input bound when data is read from Google Cloud Storage and showed approaches to improve performance by comparing distributed training strategies and introducing WebDataset format.

Use WebDataset to shard individual files which can improve sequential I/O performance by reducing network bottlenecks.
When training on multiple GPUs, choose DistributedDataParallel or FullyShardedDataParallel distributed training strategies for better performance.
For large-scale datasets you cannot download to the local disk. Use gcsfuse to simplify implementation of data access to Cloud Storage from Vertex AI and use WebDataset to shard individual files reducing network overhead.
Vertex AI improves productivity when carrying out experiments while offering flexibility, security and control. Vertex AI Training custom jobs make it easy to run experiments with several training configurations, GPU shapes and machine specs. Combined with Vertex AI Experiments and Vertex AI TensorBoard, you can track parameters, visualize and compare the performance metrics of your model and pipeline experiments.

You can find the accompanying code for this blog post on this GitHub Repo.

Using Vertex AI to build an industry leading Peer Group Benchmarking solution

Thu, 15 Dec 2022 17:00:00 -0000

The modern world of financial markets is fraught with volatility and uncertainty. Market participants and members are rethinking the way they approach problems and rapidly changing the way they do business. Access to models, usage patterns, and data has become key to keeping up with ever evolving markets.

One of the biggest challenges firms face in futures and options trading is determining how they benchmark against their competitors. Market participants are continually looking for ways to improve performance, identifying what happened, why it happened, and any associated risks. Leveraging the latest technologies in automation and artificial intelligence, many organizations are using Vertex AI to build a solution around peer group benchmarking and explainability.

Introduction

Using the speed and efficiency of Vertex AI, we have developed a solution that will allow market participants to identify similar trading group patterns and assess performance relative to their competition. Machine learning (ML) models for dimensionality reduction, clustering, and explainability are trained to detect patterns and transform data into valuable insights. This blog post goes over these models in detail, as well as the ML operations (MLOps) pipeline used to train and deploy these models at scale.

A series of successive models are used that feed predictive results as training data into the next model (e.g. dimensionality reduction -> clustering -> explainability). This requires a robust automated system for training and maintaining models and data, and provides an ideal use case for the MLOps capabilities of Vertex AI.

The Solution

Data

A market analytics dataset was used which contains market participant trading metrics aggregated and averaged across a 3 month period. This dataset contains a high number of dimensions. Specific features include buying and selling counts, trade and order quantities, types, first and last fill times, aggressive vs. passive trading indicators, and a number of other features related to trading behavior.

Modeling

Dimensionality Reduction

Clustering in high dimensional space presents a challenge, particularly for distance-based clustering algorithms. As the number of dimensions grows, the distance between all points in the dataset converge and become more similar. This distance concentration problem makes it difficult to perform typical cluster analysis on highly dimensional data.

For the task of dimensionality reduction, an Artificial Neural Network (ANN) Autoencoder was used to learn a supervised similarity metric for each market participant in the dataset. This autoencoder takes in each market participant and their associated features. It pushes the information through a hidden layer that is constrained in size, forcing the network to learn how to condense information down into a small encoded representation.

The constrained layer is a vector (z) in latent space, where each element in the vector is a learned reduction of the original market participant features (X); thus, allowing dimensionality reduction by simply applying X * z. This results in a new distribution of customer data q(X’ | X) where the distribution is constrained in size to the shape of z. By minimizing the reconstruction error between the initial input X and the autoencoder’s reconstructed output X’ we can balance the overall size of the similarity space (the number of latent dimensions) and the amount of information lost.

The resulting output of the autoencoder is a 2-dimensional learned representation of the highly dimensional data.

Clustering

Experiments were conducted to determine the optimal clustering algorithm, number of clusters, and hyperparameters. A number of models were compared, including density-based spatial clustering of applications with noise (DBSCAN), agglomerative clustering, gaussian mixture model (GMM), and k-means. Using silhouette score as an evaluation criterion, it was ultimately determined that k-means performed best for clustering on the dimensionally reduced data.

The k-means algorithm is an iterative refinement technique that aims to separate data points into n groups of equal variance. Each of these groups are defined by a cluster centroid, which is the mean of the data points in the cluster. Cluster centroids are initially randomly generated, and iteratively reassigned until the within-cluster sum-of-squares is minimized. Below: within-cluster sum-of-squares criteria.

Explainability

Explainable AI (XAI) aims to provide insights into why a model predicts in a certain way. For this use case, XAI models are used to explain why a market participant was placed into a particular peer group. This is achieved through feature importance e.g. for each market participant, the top contributing factors towards a peer group cluster assignment.

Deriving explainability from clustering models is somewhat difficult. Clustering is an unsupervised learning problem, which means there are no labels or “ground truth” for the model to analyze. Distance-based clustering algorithms instead rely on creating labels for the data points based on their relative positioning to each other. These labels are assigned as part of the prediction by the k-means algorithm - each point in the dataset is given a peer group assignment that associates it with a particular cluster.

XAI models can be trained on top of k-means by fitting a classifier to these peer group cluster assignments. Using the cluster assignments as labels turns the problem into supervised learning, whereby the end goal is to determine feature importance for the classifier. Shapley values are used for feature importance, which explain the marginal contributions of each feature to the final classification prediction.

Shapley values are ranked to provide market participants with a powerful tool to analyze what features are contributing the most to their peer group assignments.

MLOps

MLOps is an ML engineering culture and practice that aims to unify ML system development (Dev) and ML system operation (Ops). Using Vertex AI, a fully functioning MLOps pipeline has been constructed that trains and explains peer group benchmarking models. This pipeline is complete with automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment and infrastructure management. It also includes a comprehensive approach for continuous integration / continuous delivery (CI/CD). Vertex AI’s end-to-end platform was used to meet these MLOps needs, including:

Distributed training jobs to construct ML models at scale using Vertex AI Pipelines
Hyperparameter tuning jobs to quickly tune complex models using Vertex AI Vizier
Model versioning using Vertex AI Model Registry
Batch prediction jobs using Vertex AI Prediction
Tracking metadata related to training jobs using Vertex ML Metadata
Tracking model experimentation using Vertex AI Experiments
Storing and versioning training data from prediction jobs using Vertex AI Feature Store
Data validation and monitoring using Tensorflow Data Validation (TFDV)

The MLOps pipeline is broken down into 5 core areas:

CI/CD & Orchestration
Data Ingestion & Preprocessing
Dimensionality Reduction
Clustering
Explainability

The CI/CD and orchestration layer was implemented using Vertex AI Pipelines, Cloud Source Repository (CSR), Artifact Registry, and Cloud Build. When changes are made to the code base, automatic Cloud Build Triggers are executed that run unit tests, build containers, push the containers to Artifact Registry, and compile and run the Vertex AI pipeline.

The pipeline is a sequence of connected components that run successive training and prediction jobs; the outputs from one model are stored in Vertex AI Feature Store and used as inputs into the next model. The end result of this pipeline is a series of trained models for dimensionality reduction, clustering, and explainability, all stored in Vertex AI Model Registry. Peer groups and explainable results are written to Feature Store and BigQuery respectively.

Working with AI Services in Google Cloud’s Professional Services Organization (PSO)

AI Services leads the transformation of enterprise customers and industries with cloud solutions. We are seeing widespread application of AI across Financial Services and Capital Markets. Vertex AI provides a unified platform for training and deploying models and helps enterprises more effectively make data driven decisions. You can learn more about our work at:

^{This post was edited with help from Mike Bernico, Eugenia Inzaugarat, Ashwin Mishra, and the rest of the delivery team. I would also like to thank core team members Rochak Lamba, Anna Labedz, and Ravinder Lota.}

How Vodafone Hungary migrated their data platform to Google Cloud

Wed, 14 Dec 2022 17:00:00 -0000

Vodafone is currently the second largest telecommunication company in Hungary, and recently acquired UPC Hungary to extend its previous mobile services with fix portfolio. Following the acquisition, Vodafone Hungary serves approximately 3.8 million residential and business subscribers. This story is about how Vodafone Hungary benefited from moving its data and analytics platform to Google Cloud.

To support this acquisition, Vodafone Hungary went through a large business transformation that required changes in many IT systems to create a future-ready IT architecture. The goal of the transformation was to provide future-proof services for customers in all segments of the Hungarian mobile market. During this transformation, Vodafone’s core IT systems changed, which created the challenge of building a new data and analytics environment in a fast and effective way. During the project data had to be moved from the previous on-premises analytics service to the cloud. This was achieved by migrating existing data and merging them with data coming from the new systems in a very short timeframe of around six months. During the project there were several changes in the source system data structure that needed to be adapted quickly on the analytics side to reach the Go Live date.

Data and analytics in Google Cloud

To answer this challenge, Vodafone Hungary decided to partner with Google Cloud. The partnership was based on implementing a full metadata-driven analytics environment in a multi-vendor project using cutting edge Google Cloud solutions such as Data Fusion and BigQuery. The Vodafone Hungary Data Engineering team gained significant knowledge of the new Google Cloud solutions, which meant the team was able to support the company’s long-term initiatives.

Based on data loaded by this metadata-driven framework, Vodafone Hungary built up a sophisticated data and analytics service on Google Cloud that helped it become a data-driven company.

By analyzing data from throughout the company with the help of Google Cloud, Vodafone was able to gain insights that provided a clearer picture of the business. They now have a holistic view of customers across all segments.

Along with these core KPIs, the advanced analytics and Big Data models built on the top of this data and analytics services ensures that customers get more personalized offers than was previously possible.. It used to be the case that a business requestor needed to define a project to send new data to the data warehouse. The new metadata-driven framework allows the internal data engineering team to onboard new systems and new data in a very short time (within days), thus speeding up the BI development and decision-making process.

Technical solution

The solution uses several technical innovations to meet the requirements of the business. The local data extraction solution is built on the top of the CDAP and Hadoop technologies written in CDAP pipelines, PySpark jobs, and Unix shell script. In this layer, the system gets data from several sources in several formats including database extracts and different file types. The system needs to manage around 1,900 loads on a daily basis, and most data arriving in a five-hour time frame. Therefore, the framework needs to be a highly scalable system that can handle the high loading peaks without generating unexpected cost during the low peaks.

Once collected, the data from the extraction layer goes to the cloud in an encrypted and anonymized format. In the cloud, the extracted data lands in a Google Cloud Storage bucket. By arriving at the file, it triggers the Data Fusion pipelines in an event-based way by using the Log Sink, Pub/Sub, Cloud Function, and REST API. After triggering the data load, Cloud Composer controls the execution of the metadata-driven, template-based, auto-generated DAGs. Data Fusion ephemeral clusters were chosen as they adapt to the size of each data pipeline while also controlling costs during low peaks.

The principle of limited liability is important. Each component has a relatively limited range of responsibilities, which means that Cloud Function, DAGs, and Pipelines contain the minimum responsibilities and logic that is necessary to finish their own tasks.

After loading this data into a raw layer, several tasks are triggered in Data Fusion to build up an historical aggregated layer. The Vodafone Hungary data team can use this to create their own reports in a Qlik environment (which also runs on the Google Cloud environment) and build up Big Data and advanced analytical models using the Vodafone standard Big Data framework.

The most critical point of the architecture is the custom triggering function, which handles scheduling and execution of processes. The process triggers more than 1,900 DAGs per day, while also moving and processing around 1 TB of anonymized data per day.

The way forward

After the stabilization, the optimization of the processes started taking into account cost and efficiency levels. The architecture was upgraded to use Airflow 2 and Composer 2 as these systems became available. Moving the architecture to these versions increased performance and manageability. Going forward, Vodafone Hungary will continue searching for even more ways to improve processes with the help of the Google Support team.

To support fast and effective processing, Vodafone Hungary recently decided to move the control tables to Google Cloud Spanner and keep only the business data in BigQuery. This delivered a great improvement in processing.

In the analytics area, Vodafone Hungary plans to move to more advanced and cutting-edge technologies, which will allow the Big Data team to improve their performance by using Google Cloud native machine learning tools such as Auto ML and Vertex AI. These will further improve the effectiveness of the targeted campaigns and offer the benefit of advanced data analysis.

To get started, we recommend you check out BigQuery's free trial and BigQuery's Migration Assessment.

Carbon Health transforms operating outcomes with Connected Sheets for Looker

Wed, 14 Dec 2022 17:00:00 -0000

Everyone wants affordable, quality healthcare but not everyone has it. A 2021 report by the Commonwealth Fund ranked the U.S. in last place among 11 high-income countries in healthcare access.1 Carbon Health is working to change that. We are doing so by combining the best of virtual care, in-person visits, and technology to support patients with their everyday physical and mental health needs.

Rethinking how data and analytics are accessed at Carbon Health

Delivering premium healthcare for the masses that's accessible and affordable is an ambitious undertaking. It requires a commitment to operating the business in an efficient and disciplined way. To meet our goals, our teams across the company require detailed, daily insights into operating results.

In the last year, we realized our existing BI platform was inaccessible to most of our employees outside of R&D. Creating the analytics, dashboards, and reports needed by our clinic leaders and executives required direct help from our data scientists.

However, this has all changed since deploying Looker as our new BI platform. We initially used Looker to build tables, charts, and graphs that improved how people could access and analyze data about our operating efficiency. As we continued to evaluate how our data and analytics should be experienced by our in-clinic staff, we learned about Connected Sheets for Looker, which has unlocked an entirely new way of sharing insights across the company.

A new way to deliver performance reporting and drive results

Connected Sheets for Looker gives Carbon Health employees who work in Google Sheets—practically everyone—a familiar tool for working with Looker data. For instance, one of our first outputs using the Connected Sheets integration has been a daily and weekly performance push-report for the clinic’s operating leaders, including providers.

Essentially a scorecard, the report tracks the most important KPIs for measuring clinics' successes, including appointment volume, patient satisfaction such as net promoter score (NPS), reviews, phone call answer rates, and even metrics about billing and collections. To provide easy access, we built a workflow through Google App Script that takes our daily performance report and automatically emails a PDF to key clinic leaders each morning.

Within the first 30 days of the report's creation, clinic leaders were able to drive noticeable improvements in operating results. For instance, actively tracking clinic volume has enabled us to manage our schedules more effectively, which in turn drives more visits and enables us to better communicate expectations with our patients. Other clinics have dramatically improved their call answer rates by tracking inbound call volume, which has also led to better patient satisfaction.

Greater accountability, greater collaboration

As you can imagine, a report that holds people accountable for outcomes in such a visible way can create some anxiety. We've eased those concerns by using the information constructively, with the goal to use reporting as a positive feedback mechanism to bolster open collaboration and identify operational processes that need improvement. For example, data about our call answer rates initiated an investigation that led to an operational redesign of how phones are deployed and managed at more than 120 clinics across the U.S.

Looker as a scalable solution with endless applications

We're now rolling out Connected Sheets for Looker to deliver performance push-reporting across all teams at Carbon Health. Additionally, we continue to find new ways to leverage Connected Sheets for Looker to meet other needs of the business.

For instance, we've recently been able to better understand our software costs by analyzing vendor spend from our accounting systems directly in Google Sheets. Going forward, this will allow us to build a basic workflow to monitor subscription spend and employee application usage, which will lead to us saving money on unnecessary licenses and underutilized software.

We've come a long way in the last year. Between Looker and its integration with Google Sheets, we can meet the data needs of all our stakeholders at Carbon Health. Connected Sheets for Looker has been an impactful solution that's going to help us drive measurable results in how we deliver premium healthcare to the masses.

^{1. Mirror, Mirror 2021: Reflecting Poorly
2. HEALTHCARE EDITORS' PICK Meet The Immigrant Entrepreneurs Who Raised $350 Million To Rethink U.S. Primary Care}

Minimal Downtime Migrations to Cloud Spanner with HarbourBridge 2.0

Wed, 14 Dec 2022 13:00:00 -0000

Spanner is a fully managed, strongly consistent and highly available database providing up to 99.999% availability. It is also very easy to create your Spanner instance and point your application to it. But what if you want to migrate your schema and data from another database to Cloud Spanner? The common challenges with database migrations are ensuring high throughput of data transfer, and high availability of your application with minimal downtime, and all this needs to be enabled with a user-friendly migrations solution.

Today, we are excited to announce the launch of HarbourBridge 2.0 (Preview) - an easy to use open source migration tool, now with enhanced capabilities for schema and data migrations with minimal downtime.

This blog intends to demonstrate migration of schema and data for an application from MySQL to Spanner using HarbourBridge.

About HarbourBridge

HarbourBridge is an easy to use open source tool, which gives you highly detailed schema assessments and recommendations and allows you to perform migrations with minimal downtime. It just lets you point, click and trigger your schema and data migrations. It provides a unified interface for the migration wherein it gives users the flexibility to modify the generated spanner schema and run end to end migration from a single interface. It provides the capabilities of editing table details like columns, primary key, foreign key, indexes, etc and provides insights on the schema conversion performance along with highlighting important issues and suggestions.

What's new in HarbourBridge 2.0?

With this recent launch, you can now do the following:

Perform end to end minimal downtime terabyte scale data migrations
Get improved schema assessment and recommendations
Experience ease of access with gCloud Integration

We’ll experience the power of some of these cool new add-ons as we walk through the various application migration scenarios in this blog.

Types of Migration

Data migration with HarbourBridge is of 2 types:

Minimal Downtime
Migration with downtime

Minimal Downtime is for real time transactions and incremental updates in business critical applications to ensure there is business continuity and very minimal interruption.Migration with downtime is recommended only for POC’s/ test environment setups or applications which can take a few hours of downtime.

Connecting HarbourBridge to source

There are three ways to connect HarbourBridge to your source database:

Direct connection to Database - for minimal downtime and continuous data migration for a certain time period
Data dump - for a one time migration of the source database dump into Spanner
Session file - to load from a previous HarbourBridge session

Migration components of HarbourBridge

With HarbourBridge you can choose to migrate:

Schema-only
Data-only
Both Schema and Data

The below image shows how at a high level, the various components involved behind the scenes for data migration:

To manage a low-downtime migration, HarbourBridge orchestrates the following processes for you. You only have to set up connection profiles from the HarbourBridge UI on the migration page, everything else is handled by Harbour Bridge under the hood:

HarbourBridge sets up a Cloud Storage bucket to store incoming change events on the source database while the snapshot migration progresses
HarbourBridge sets up a datastream job to bulk load a snapshot of the data and stream incremental writes.
HarbourBridge sets up the Dataflow job to migrate the change events into Spanner, which empties the Cloud Storage bucket over time

Validate that most of the data has been copied over to Spanner, and then stop writing to the source database so that the remaining change events can be applied. This results in a short downtime while Spanner catches up to the source database. Afterward, the application can be cut over to use Spanner as the main database.

The application

The use case we have created to discuss to demonstrate this migration is an application that streams in live (near real-time) T20 cricket match data ball-by-ball and calculates the Duckworth Lewis Target Score (also known as the Par Score) for Team 2, second innings, in case the match is disrupted mid-innings due to rain or other circumstances. This is calculated using the famous Duckworth Lewis Stern (DLS) algorithm and gets updated for every ball in the second innings; that way we will always know what the winning target is, in case the match gets interrupted and is not continued thereafter. There are several scenarios in Cricket that use the DLS algorithm for determining the target or winning score.

MySQL Database

In this use case, we are using Cloud SQL for MySQL to house the ball by ball data being streamed-in. The DLS Target client application streams data into MySQL database tables, which will be migrated to Spanner.

Application Migration Architecture

In this migration, our source data is being sent in bulk and in streaming modes to the MySQL table which is the source of the Migration. Cloud Functions Java function simulates the ball by ball streaming and calculates the Duckworth Lewis Target Score, updates it to the baseline table. HarbourBridge reads from MySQL and writes (Schema and Data) into Cloud Spanner.

The below diagram represents the high level architectural overview of the migration process:

Note: In our case the streaming process is simulated with the data coming from a CSV into a landing table in MySQL which then streams match data by pushing row by row data to the baseline MySQL table. This is the table used for further updates and DLS Target calculations.

Migrating MySQL to Spanner with HarbourBridge

Set up HarbourBridge

Run the following 2 gCloud commands on Google Cloud Console Cloud Shell:

Install the HarbourBridge component of gCloud by running:
gcloud components install HarbourBridge
Start the HarbourBridge UI by running:
gcloud alpha spanner migration web

Your HarbourBridge application should be up and running:

Note:

Before proceeding with the migration, remember to enable the DataStream and Dataflow API from Google Cloud Console
Ensure you have Cloud SQL for MySQL or your own MySQL server created for the source and Spanner instance created for the target
Ensure all source database instance objects are created. For access to the DB DDLs, DMLs and the data CSV file refer to this git repo folder
For data validation (post-migration step) SELECT queries for both source and Spanner, refer to this git repo folder
Ensure Cloud Functions is created and deployed (for Streaming simulation and DLS Target score calculation). For the source code, refer to the git repo folder. You can learn how to deploy a Java function to Cloud Functions here
Also note that your proxy is set up and running when trying to connect to the source from HarbourBridge. If you are using Cloud SQL for MySQL, you can ensure that proxy is running by executing the following command in Cloud Shell:
./cloud_sql_proxy -instances=<<Project-id:Region:instance-name>>=tcp:<<3306>>

Connect to the source

Of the 3 modes of connecting to source, we will use the “Connect to database” method to get the connection established with source:

Provide the connection credentials and hit connect:

You are now connected to the source and HarbourBridge will land you on the next step of migration.

Schema Assessment and Configuration

At this point, you get to see both the source (MySQL) version of the schema and the target draft version of the “Configure Schema” page. The Target draft version is the workspace for all edits you can perform on the schema on your destination database, that is, Cloud Spanner.

HarbourBridge provides you with comprehensive assessment results and recommendations for improving the schema structure and performance.

As you can see in this image above, the icons to the left of table represent the complexity of table conversion changes as part of the schema migration
In this case, the STD_DLS_RESOURCE table requires high complexity conversion changes whereas the other ones require minimal complexity changes
The recommendation on the right provides information about the storage requirement of specific columns and there other warnings indicated with the columns list as well
You have the ability to make changes to the column types at this point
Primary Key, Foreign Key, Interleaving tables, indexes and other dependencies related changes and suggestions are also available
Once changes are made to the schema, HarbourBridge gives you the ability to review the DDL and confirm changes
Once you confirm the schema changes are in effect before triggering the migration

Schema changes are saved successfully.

Prepare Migration

Click the “Prepare Migration” button on the top right corner of the HarbourBridge page.

1. Select Migration Mode as “Schema and Data”
2. Migration Type as “Minimal Downtime Migration”
3. Set up Target Cloud Spanner Instance

NOTE: HarbourBridge UI supports only Google SQL dialect as a Spanner destination today. Support for PostgreSQL dialect will be added soon.

4. Set up Source Connection profile

This is your connection to the MySQL data source. Ensure, you have the IP Addresses displayed on the screen allow-listed by your source.

5. Set up Target Connection profile

This is the connection to your Datastream job destination which is the Cloud Storage. Please select the instance and make sure you have allow-listed the necessary access.

Once done, hit Migrate at the bottom of the page and wait for the migration to start. HarbourBridge takes care of everything else, including setting up the Datastream and Dataflow jobs and executing them under the hood. You have the option to set this up on your own. But that is not necessary now with the latest launch of HarbourBridge.

Wait until you see the message “Schema migration completed successfully” on the same page. Once you see that, head over to your Spanner database to validate the newly created (migrated) schema.

Validate Schema and Initial Data

Connect to the Spanner instance, and head over to the database “cricket_db”. You should see the tables and rest of schema migrated over to the Spanner database:

Set up Streaming Data

As part of the setup, after the initial data is migrated, trigger the Cloud Functions job to kickstart data streaming into My SQL.

Validate Streaming Data

Check if the streaming data is eventually migrating into Spanner as the streaming happens.

The Cloud Functions (Java Function) can be triggered by hitting the HTTPS URL in the Trigger section of the function’s detail page. Once the streaming starts, you should see data flowing into MySQL and the Target DLS score for Innings 2 getting updated in the DLS table.

In the above image, you can see the record count go from 1705 to 1805 with the streaming. Also, the DLS Target field has a calculated value of 112 for the most recent ball.

Now let’s check if the Spanner database table got the updates in migration. Go to the Spanner table and query:

As you can see, Spanner has records increasing as part of migration as well.

Also note the change in Target score field value ball after ball:

Wait until you see all the changes migrated over.

For data validation, you can use DVT (Data Validation Tool), which is a standardized data validation method built by Google, and can be incorporated into existing GCP tools and technologies. In our use case, I validated the migration of the initial set of records from MySQL source to Spanner table using Cloud Spanner queries.

End the Migration

When you complete all these validation steps, click End Migration. Follow the below steps to update your application to point to Spanner database:

Stop writes to the source database - This will initiate a period of downtime
Wait for any other incremental writes to Spanner to catch up with the source
Once you are sure source and Spanner are in sync, update the application to point to Spanner
Start your application with Spanner as the database
Perform smoke tests to ensure all scenarios are working
Cutover the traffic to your application with Spanner as the database
This marks the end of the downtime period

Clean Up

Finally hit the “Clean Up” button on the End Migration popup screen. This will remove the migration jobs and dependencies that were created in the process.

Watch the migration in action

Minimal Downtime Migrations to Spanner with HarbourBridge 2.0

Next Steps

As you walked through this migration with us, you would have noticed how easy it is to point to your database, assess and modify your schema based on recommendations, and migrate your schema, your data, or both to Spanner with minimal downtime.

You can learn more about HarbourBridge on the README, and learn to install gCloud here.

Get started today

Spanner’s unique architecture allows it to scale horizontally without compromising on the consistency guarantees that developers rely on in modern relational databases. Try out Spanner today for free for 90 days or for as low as $65 USD per month.

Using budgets to automate cost controls

Tue, 13 Dec 2022 15:00:00 -0000

TL;DR - Budgets can do more than just track costs! You can set up automated cost controls using programmatic budget notifications, and we have an interactive walkthrough with sample architecture to help get you started.

Budgets can help you answer cost questions, and so much more!

There's a few blog posts on what Google Cloud Budgets are and how to use them for more than just sending emails by using programmatic budget notifications. These are important steps to take when using Google Cloud, so you can accurately ask and answer questions about your costs and get meaningful answers in the systems you already use. As your cloud usage grows and matures, you may also need to be more proactive in dealing with your costs.

More than just a budget

To recap: budgets let you create a dynamic way of being alerted about your costs, such as getting emails when you've spent or are forecasted to spend a certain amount. When creating a budget, you can provide a fixed amount or you can have the amount based on the previous period, so you could set up a budget that alerts you if your spending has changed significantly in a monthly cadence. In addition, you can have budgets send data to Pub/Sub on a regular basis (programmatic budget notifications) that can be used however you'd like, such as sending messages to Slack.

Budgets that send out notifications are flexible enough to do just about anything, but that's also where things can become a bit tricky to set up. If you're monitoring the costs for a large company with a lot of cloud usage, that could involve multiple environments with lots of products being used in different ways. Being informed about the costs is a good starting point, but you'll likely want to set up automated cost controls to protect yourself and your cloud spending.

In essence, setting up automated cost controls is the same as using programmatic budget notifications: the budget occasionally sends out a Pub/Sub message, and you create a Cloud Function (or similar) subscriber that receives that message and runs some code. Of course, the specifics of that code might be anything and will heavily depend on your business logic needs, ranging from sending a text message all the way to shutting down cloud resources. While the specifics are up to you, we made a few things to make getting started easier!

aside_block: [StructValue([(u'title', u'Get started with building a cost-enforcement solution'), (u'body', <wagtail.wagtailcore.rich_text.RichText object at 0x3e954823ae50>), (u'btn_text', u'Try it out!'), (u'href', u'https://console.cloud.google.com/?walkthrough_id=billing--budget--cost_enforcement'), (u'image', None)])]

Show me the way

We've created an interactive walkthrough to help you with all of the steps needed in getting programmatic budget notifications up and running.

Following the walkthrough, you'll set up a budget, Pub/Sub topic, and Cloud Function that work together to respond to programmatic notifications. Not only will you get a sense of all the pieces involved, you can easily modify the code from the function for your specific purposes, so it serves as a great starting point. That also leads to a question I've heard often: "This is great, but what code am I supposed to use?" And that is why we've expanded our walkthrough to include a full, one-click architecture deployment!

It's like a sentry, but for your cloud costs

Cost Sentry, powered by DeployStack, takes the next step in programmatic budget notifications and sets up all the pieces needed to create basic automated cost-enforcement, as well as some example architecture to test it on! In fact, the overall architecture isn't much more than just setting up the programmatic budget notifications alone, but it gives a good example of how that could work in a full environment.

This architecture will get deployed for you, along with the working code to handle a programmatic budget notification and interact with Compute Engine and Cloud Run.

Both the walkthrough and deploying the Cost Sentry stack can be used as the starting point for a full automated cost-enforcement solution. With these samples, you'll want to take a look at the Cloud Function code that receives data from your budget, and how it interacts with the Google Cloud APIs to shut down resources. In this example, any Compute Engine instances or Cloud Run deployments that have been labeled with 'costsentry' will be shut-down/disabled when your budget exceeds the configured amount.

While this is a great solution for getting an automated cost-enforcement solution started, the hard part is probably in the next questions you'll need to answer for your use case. Questions like "What do I actually want to have happen when I hit my budget?" and "Will stopping all of these instances automatically have ramifications?" (spoiler alert: probably) are important ones to figure out when looking at the full scope of a cost-enforcement solution.

Setting up a full automated cost enforcement solution gives you the flexibility to customize your response to budget updates, such as sending higher-priority messaging as you get closer to your budget total, and taking action by shutting down services when you greatly exceed your budget. Any way that you want to build a solution, this is a great starting point!

Go forth, and do

This may seem like a lot, and I'm a big fan of the "crawl, walk, run" philosophy. If you're new to Google Cloud, get started by just setting up a budget for all of your costs. From there, you can work with programmatic budget notifications to start expanding how you use budgets. As you get more familiar with Google Cloud, you'll likely need to customize your cost controls and start with Cost Sentry to set up your automated cost-enforcement solution.

Check out the interactive walkthrough and Cost Sentry architecture to get started!

Building out your support insights pipeline

Mon, 12 Dec 2022 14:00:00 -0000

Getting into the details

We wrote previously about how we used clustering to connect requests for support (in text form) to the best tech support articles so we could answer questions faster and more efficiently. In a constantly changing environment (and in a very oddball couple of years) we wanted to make sure we're focused on preserving our people's productivity by isolating, understanding and responding to new support trends as fast as we can.

Now we'd like to get into a bit more detail about how we did all that and what went on behind the scenes of our process:

Extraction

Google’s historical support ticket data and metadata are stored in BigQuery, as are the analysis results we generate from that data. We read and write that content using the BigQuery API. However, much of these tickets contain information that is not useful to the ML pipeline and should not be included in the preprocessing and text modeling phases. For example, boilerplate generated from our case management tools must be stripped out using regex and other technologies in order to isolate the IT interaction between the tech and users.

Furthermore, once all boilerplate has been removed, we use part-of-speech tagging to isolate only the nouns within the interaction, since nouns themselves proved to be the best features for modeling an interaction and differentiating a topic. Any one interaction could have 100+ nouns depending on the complexity. Using these nouns, we take one more step and use stemming and lemmatization to remove any suffix that may be placed on the noun (e.g., “computers” becomes “computer”). This allows for any modification of the root words to be modeled as the same feature and reduces noise in our clustering results.

Once each interaction is transformed into a set of nouns (and unique identifier), we can then move on to more advanced preprocessing techniques.

Text Modeling

To cluster the ticket set, it must first be converted into a robust feature space. The core technology underlying our featurization process is TensorFlow transformers, which can be invoked using the TFX API. TensorFlow parses and annotates the tickets’ natural-language contents and these annotations, once normalized and filtered, form a sparse feature space. The Cloud Data Loss Prevention (DLP) API redacts several categories of sensitive information — e.g., person names — from the tickets’ contents, which both mitigates privacy leakage and prunes low-relevance tokens from the feature space.

Although clustering can be performed against a sparse space, it is typically more effective if the space is densified to prune excessive dimensionality. We accomplish this using the term frequency-inverse document frequency (TF-IDF) statistical technique with a predefined maximum feature count – we also investigated more heavy-duty densification strategies using trained embedding models, but found that the quality improvements over TF-IDF were marginal for our use case, at the cost of a substantial reduction in human interpretability.

Clustering

The generated ticket feature set is partitioned into clusters using ClustOn. As this is an unsupervised learning problem, we arrived at the clustering process’s hyper-parameterization values via experimentation and human expert analysis. The trained parameters produced by the algorithm are persisted between subsequent runs of the pipeline in order to maintain consistent cluster IDs; this allows later operational systems to directly track and evaluate a cluster’s evolution over real time.

The resulting cluster set is sanity-checked by some basic heuristic measures, such a silhouette score, and then rejoined with the initial ticket data for analysis. Moreover, for privacy purposes, each cluster whose ticket cohort size falls below a predefined threshold is omitted from the data set; this ensures that cluster metadata in the output, such as feature data used to characterize the cluster, cannot be traced with high confidence back to individual tickets.

Scoring & Anomaly Detection

Once a cluster has been identified, we need a way to automatically estimate how likely it is that the cluster has recently undergone a state change which might indicate an incipient event, as opposed to remaining in a steady state. “Anomalous” clusters — i.e. those which exhibit a sufficiently high likelihood of an event — can be flagged for later operational investigation, while the rest can be disregarded.

Modeling a cluster’s behavior over time is done by distributing its tickets into a histogram according to their time of creation — using 24-hour buckets, reflecting the daily business cycle — and fitting a zero-inflated Poisson regression to the bucket counts using statsmodel¹. However, our goal is not just to characterize a cluster’s state, but to detect a discrete change in that state. This is accomplished by developing two models of the same cluster: one of its long-term behavior, and the other of its short-term behavior. The distinction between “long-term” and “short-term” can be as simple as partitioning the histogram’s buckets at some age threshold. But we chose a slightly more nuanced approach: both models are fitted to the entire histogram, but under two different weighting schemata; both decay exponentially by age, but at different rates, so that recent buckets are weighted relatively more heavily in the short-term model than the long-term one.

Both models are “optimized,” in that each achieves the maximum log-likelihood in its respective context. But if the long-term model is evaluated in the short-term context instead, its log-likelihood will show some amount of loss relative to the maximum achieved by the short-term model in the same context. This loss reflects the degree to which the long-term model fails to accurately predict the cluster’s short-term behavior — in other words, the degree to which the cluster’s short-term behavior deviates from the expectation established by its short-term behavior — and thus we refer to it as the deviation score. This score serves as our key measure of anomaly; if it surpasses a defined threshold, the cluster is deemed anomalous.

Operationalize

Using the IssueTracker API, bugs are auto-generated each time an anomalous cluster is detected. These bugs contain some summary of the tokens found within the cluster itself as well as a parameterized link to the DataStudio dashboard. These dashboards show the size of the cluster over time, the deviation score and the underlying tickets.

These bugs are picked up by Techstop operations engineers and investigated to determine the root causes, allowing for quicker boots on the ground for any outages that may be occurring, as well as a more harmonious flow of data between support operations and change and incident management teams.

Staying within the IssueTracker product, operations engineers create Problem Records in a separate queue detailing the problem, stakeholders and any solution content. These problem records are shared widely with frontline operations to help address any ongoing issues or outages.

However, the secret sauce does not stop there. Techstop then uses Google's Cloud AutoML engine to train a supervised model to classify any incoming support requests against known Problem Records (IssueTracker bugs). This model acts as a service for two critical functions:

The model is called by our Chrome extension (see this handy guide) to recommend Problem Records to frontline techs based on the current ongoing chat. For a company like Google that has a global IT team, this recommendation engine allows for coverage and visibility of issues in near real time
The model answers the “how big” question: Many stakeholders want to know how big the problem was, how many end users did this problem affect and so on. By training an AutoML model we can now give good estimators about impact and more importantly we can measure impact of project work that addresses these problems.

Resampling & User Journey Mapping

Going beyond incident response, we then semi-automatically extracts user journeys from these trends by sampling each cluster to discover the proportion of user intents. These intents are then used to map user pitfalls and generate a sense of topic for each emerging cluster.

Since operations are constrained by tech evaluation time, a solution to limit the number of reviews necessary that each agent would need to inspect, while still maintaining the accuracy of analysis, was derived.

User intents are defined as user “Goals” an employee may have when engaging with IT support. For example, “I want my cell phone to boot” or "I lost access to an internal tool” are good examples. Therefore, we propose a two-step procedure (to be applied for each cluster).

First, we sample chats until the probability that we discover a new intent is small (say <5% or whatever number we want). We can evaluate this probability at each step through the Good-Turing method.
A simple Good-Turing estimate of this probability can be found as E(1) / N, where N is the number of sampled chats so far and E(1) is approximately the number of intents that have only been seen once so far. This number should be lightly smoothed for better accuracy; it’s easy to implement this smoothing on our own² or call a library.
Once we have finished, we take the intents that we consider representative (say there are k of them) and create one additional category for “other intents.” Then, we estimate the sample size for multinomial estimation (with k+1 categories) that we still need to reach, given composition accuracy (say, that each intent fraction is within e.g., 0.1 or 0.2 of the actual fraction). To do so, we consider Thompson’s procedure³, but take advantage of the data collected so far to be used as a plugin estimate for the possible values of the parameters, plus we should also consider a grid of parameter values within a confidence interval of the current plugin estimate, to be sufficiently conservative. The procedure is described on page 43 in this article, steps (1) and (2). The procedure is easy to implement and under our current setup, it will be a few lines of code.

The procedure gives us the target sample size. If we have already reached this sample size in step 1, we are done. Otherwise, we sample a few more chats to reach this sample size.

This work along with the AutoML model allows Google to understand not only the problem impact size, but also key information about user experiences and where the CUJ users are struggling the most. In many cases a problem record will contain multiple CUJs (user intents) with separate personas and root causes.

Helping the business

Once we can make good estimators for different user goals we can work with domain experts to map clear user journeys, i.e., we can now use the data that this pipeline has generated to construct a user journey in a bottoms-up approach. This same amount of work, sifting through data, aggregating similar cases and estimating proportions of user goals would take an entire team of engineers and case scrubbers. With this ML solution we can now get the same (if not better) results with much lower operational costs.

These user journeys then can be fed to internal dashboards for key decision makers to understand the health of their products and service areas. It allows for automated incident management and acts as a safeguard against unplanned changes or user-affecting changes that did not go through the proper change management processes.

Furthermore, it is critical for problem management and other core functions within our IT service. By having a small team of operational engineers reviewing the output of this ML pipeline, we can create healthy problem records and keep track of our team's top user issues.

How do I do this too?

Want to make your own system for insights into your support pipeline? Here's a recipe to follow that will help you build all the parts you need

Load your data into BigQuery - Cloud BigQuery
Vectorize it with TF-IDF - TensorFlow Vectorizer
Perform clustering - TensorFlow Clustering
Score Clusters - Statsmodels Poisson Regression
Automate with Dataflow - Cloud DataFlow
Operationalize - IssueTracker API

^1. ^{When modeling a cluster, that cluster’s histogram serves as the regression’s endogenous variable. Additionally, the analogous histogram of the entire ticket set, across all clusters, serves as an exogenous variable. The latter histogram captures the overall ebb and flow in ticket generation rates due to cluster-agnostic business cycles (e.g. rates tend to be higher on weekdays than weekends), and its inclusion mitigates the impact of such cycles on each cluster’s individual model.}

^{2. Gale, William A., and Geoffrey Sampson. "Good‐turing frequency estimation without tears." Journal of quantitative linguistics 2.3 (1995): 217-237.}

^{3. Thompson, Steven K. "Sample size for estimating multinomial proportions." The American Statistician 41.1 (1987): 42-46.}

How StreamNative facilitates integrated use of Apache Pulsar through Google Cloud

Fri, 09 Dec 2022 17:00:00 -0000

StreamNative, a company founded by the original developers of Apache Pulsar and Apache BookKeeper, is partnering Google Cloud to build a streaming platform on open source technologies. We are dedicated to helping businesses generate maximum value from their enterprise data by offering effortless ways to realize real-time data streaming. Following the release of StreamNative Cloud in August 2020, which provides scalable and reliable Pulsar-Cluster-as-a-Service, we introduced StreamNative Cloud for Kafka. This is to enable a seamless switch between Kafka API and Pulsar. We then launched StreamNative Platform to support global event streaming data platforms in multi-cloud and hybrid-cloud environments.

By leveraging our fully-managed Pulsar infrastructure services, our enterprise customers can easily build their event-driven applications with Apache Pulsar and get real-time value from their data. There are solid reasons why Apache Pulsar has become one of the most popular messaging platforms in modern cloud environments, and we have strong beliefs in its capabilities of simplifying building complex event-driven applications. The most prominent benefits of using Apache Pulsar to manage real-time events include:

Single API: When building a complex event-driven application, it traditionally requires linking multiple systems to support queuing, streaming and table semantics. Apache Pulsar frees developers from the headache of managing multiple APIs by offering one single API that supports all messaging-related workloads.
Multi-tenancy: With the built-in multi-tenancy feature, Apache Pulsar enables secure data sharing across different departments with one global cluster. This architecture not only helps reduce infrastructure costs, but also avoids data silos.
Simplified application architecture: Pulsar clusters can scale to millions of topics while delivering consistent performance, which means that developers don’t have to restructure their applications when the number of topic-partitions surpasses hundreds. The application architecture can therefore be simplified.
Geo-replication: Apache Pulsar supports both synchronous and asynchronous geo-replication out-of-the-box, which makes building event-driven applications in multi-cloud and hybrid-cloud environments very easy.

Facilitating integration between Apache Pulsar and Google Cloud

To allow our customers to fully enjoy the benefits of Apache Pulsar, we’ve been working on expanding the Apache Pulsar ecosystem by improving the integration between Apache Pulsar and powerful cloud platforms like Google Cloud. In mid-2022, we added Google Cloud Pub/Sub Connector for Apache Pulsar, which enables seamless data replication between Pub/Sub and Apache Pulsar, and Google Cloud BigQuery Sink Connector for Apache Pulsar, which synchronizes Pulsar data to BigQuery in real time, to the Apache Pulsar ecosystem.

Google Cloud Pub/Sub Connector for Apache Pulsar uses Pulsar IO components to realize fully-featured messaging and streaming between Pub/Sub and Apache Pulsar, which has its own distinctive features. Using Pub/Sub and Apache Pulsar at the same time enables developers to realize comprehensive data streaming features on their applications. However, it requires significant development effort to establish seamless integration between the two tools, because data synchronization between different messaging systems depends on the functioning of applications. When applications stop working, the message data cannot be passed on to the other system.

Our connector solves this problem by fully integrating with Pulsar’s system. There are two ways to import and export data between Pub/Sub and Pulsar. The first, is the Google Cloud Pub/Sub source that feeds data from Pub/Sub topics and writes data to Pulsar topics. Alternatively, the Google Cloud Pub/Sub sink can pull data from Pulsar topics and persist data to Pub/Sub topics. Using Google Cloud Pub/Sub Connector for Apache Pulsar brings three key advantages:

Code-free integration: No code-writing is needed to move data between Apache Pulsar and Pub/Sub.
High scalability: The connector can be run on both standalone and distributed nodes, which allows developers to build reactive data pipelines in real time to meet operational needs.
Less DevOps resources required: The DevOps workloads of setting up data synchronization are greatly reduced, which translates into more resources to be invested in unleashing the value of data.

By using the BigQuery Sink Connector for Apache Pulsar, organizations can write data from Pulsar directly to BigQuery. This is unlike before, where developers could only use Cloud Storage Sink Connector for Pulsar to move data to Cloud Storage, and then query the imported data with external tables in BigQuery which had many limitations, including low query performance and no support for clustered tables.

Pulling data from Pulsar topics and persisting data to BigQuery tables, our BigQuery sink connector supports real-time data synchronization between Apache Pulsar and BigQuery. Just like our Pub/Sub connector, Google Cloud BigQuery Sink Connector for Apache Pulsar is a low-code solution that supports high scalability and greatly reduces DevOps workloads. Furthermore, our BigQuery connector possesses the Auto Schema feature, which automatically creates and updates BigQuery table structures based on the Pulsar topic schemas to ensure smooth and continuous data synchronization.

Simplifying Pulsar resource management on Kubernetes

All the products of StreamNative are built on Kubernetes, and we’ve been developing tools that can simplify resource management on Kubernetes platforms like Google Cloud Kubernetes (GKE). In August 2022, we introduced Pulsar Resources Operator for Kubernetes, which is an independent controller that provides automatic full lifecycle management for Pulsar resources on Kubernetes.

Pulsar Resources Operator uses manifest files to manage Pulsar resources, which allows developers to get and edit resource policies through the Topic Custom Resources that render the full field information of Pulsar policies. It enables easier Pulsar resource management compared with using command line interface (CLI) tools, because developers no longer need to remember numerous commands and flags to retrieve policy information. Key advantages of using Pulsar Resources Operator for Kubernetes include:

Easy creation of Pulsar resources: By applying manifest files, developers can swiftly initialize basic Pulsar resources in their continuous integration (CI) workflows when creating a new Pulsar cluster.
Full integration with Helm: Helm is widely used as a package management tool in cloud-native environments. Pulsar Resource Operator can seamlessly integrate with Helm, which allows developers to manage their Pulsar resources through Helm templates.

How you can contribute

With the release of Google Cloud Pub/Sub Connector for Apache Pulsar, Google Cloud BigQuery Sink Connector for Apache Pulsar, and Pulsar Resources Operator for Kubernetes, we have unlocked the application potential of open tools like Apache Pulsar by making them simpler to build, easier to manage, and extended their capabilities. Now, developers can build and run Pulsar clusters more efficiently and maximize the value of their enterprise data.

These three tools are community-driven services and have their source codes hosted in the StreamNative GitHub repository. Our team welcomes all types of contributions for the evolution of our tools. We’re always keen to receive feature requests, bug reports and documentation inquiry through GitHub, emails or Twitter.

How to build comprehensive customer financial profiles with Elastic Cloud and Google Cloud

Fri, 09 Dec 2022 17:00:00 -0000

Financial institutions have vast amounts of data about their customers. However, many of them struggle to leverage data to their advantage. Data may be sitting in silos or trapped on costly mainframes. Customers may only have access to a limited quantity of data, or service providers may need to search through multiple systems of record to handle a simple customer inquiry. This creates a hazard for providers and a headache for customers.

Elastic and Google Cloud enable institutions to manage this information. Powerful search tools allow data to be surfaced faster than ever - Whether it's card payments, ACH (Automated Clearing House), wires, bank transfers, real-time payments, or another payment method. This information can be correlated to customer profiles, cash balances, merchant info, purchase history, and other relevant information to enable the customer or business objective.

This reference architecture enables these use cases:

1. Offering a great customer experience: Customers expect immediate access to their entire payment history, with the ability to recognize anomalies. Not just through digital channels, but through omnichannel experiences (e.g. customer service interactions).

2. Customer 360: Real-time dashboards which correlates transaction information across multiple variables, offering the business a better view into their customer base, and driving efforts for sales, marketing, and product innovation.

Customer 360: The dashboard above looks at 1.2 billion bank transactions and gives a breakdown of what they are, who executes them, where they go, when and more. At a glance we can see who our wealthiest customers are, which merchants our customers send the most money to, how many unusual transactions there are - based on transaction frequency and transaction amount, when folks spend money and what kind spending and income they have.

3. Partnership management: Merchant acceptance is key for payment providers. Having better access to present and historical merchant transactions can enhance relationships or provide leverage in negotiations. With that, banks can create and monetize new services.

4. Cost optimization: Mainframes are not designed for internet-scale access. Along-side with technological limitation, the cost becomes a prohibitive factor. While Mainframes will not be replaced any time sooner, this architecture will help to avoid costly access to data to serve new applications.

5. Risk reduction: By standardizing on the Elastic Stack, banks are longer limited in the number of data sources they can ingest. With this, banks can better respond to call center delays and potential customer-facing impacts like natural disasters. By deploying machine learning and alerting features, banks can detect and stamp out financial fraud before it impacts member accounts.

Fraud detection: The Graph feature of Elastic helped a financial services company to identify additional cards that were linked via phone numbers and amalgamations of the original billing address on file with those two cards. The team realized that several credit unions, not just the original one where the alert originated from, were being scammed by the same fraud ring.

Architecture

The following diagram shows the steps to move data from Mainframe to Google Cloud, process and enrich the data in BigQuery, then provide comprehensive search capabilities through Elastic Cloud.

This architecture includes the following components:

Move Data from Mainframe to Google Cloud

Moving data from IBM z/OS to Google Cloud is straightforward with the Mainframe Connector, by following simple steps and defining configurations. The connector runs in z/OS batch job steps and includes a shell interpreter and JVM-based implementations of gsutil, bq and gcloud command-line utilities. This makes it possible to create and run a complete ELT pipeline from JCL, both for the initial batch data migration and ongoing delta updates.

A typical flow of the connector includes:

Reading the mainframe dataset
Transcoding the dataset to ORC
Uploading ORC file to Cloud Storage
Register ORC file as an external table or load as a native table
Submit a Query job containing a MERGE DML statement to upsert incremental data into a target table or a SELECT statement to append to or replace an existing table

Here are the steps to install the BQ MainFrame Connector:

copy mainframe connector jar to unix filesystem on z/OS
copy BQSH JCL procedure to a PDS on z/OS
edit BQSH JCL to set site specific environment variables

Please refer to the BQ Mainframe connector blog for example configuration and commands.

Process and Enrich Data in BigQuery

BigQuery is a completely serverless and cost-effective enterprise data warehouse. Its serverless architecture lets you use SQL language to query and enrich Enterprise scale data. And its scalable, distributed analysis engine lets you query terabytes in seconds and petabytes in minutes. An integrated BQML and BI Engine enables you to analyze the data and gain business insights.

Ingest Data from BQ to Elastic Cloud

Dataflow is used here to ingest data from BQ to Elastic Cloud. It’s a serverless, fast, and cost-effective stream and batch data processing service. Dataflow provides an Elasticsearch Flex Template which can be easily configured to create the streaming pipeline. This blog from Elastic shows an example on how to configure the template.

Cloud Orchestration from Mainframe

It's possible to load both BigQuery and Elastic Cloud entirely from a mainframe job, with no need for an external job scheduler.

To launch the Dataflow flex template directly, you can invoke the gcloud dataflow flex-template run command in a z/OS batch job step.

If you require additional actions beyond simply launching the template, you can instead invoke the gcloud pubsub topics publish command in a batch job step after your BigQuery ELT steps are completed, using the --attribute option to include your BigQuery table name and any other template parameters. The pubsub message can be used to trigger any additional actions within your cloud environment.

To take action in response to the pubsub message sent from your mainframe job, create a Cloud Build Pipeline with a pubsub trigger and include a Cloud Build Pipeline step that uses the gcloud builder to invoke gcloud dataflow flex-template run and launch the template using the parameters copied from the pubsub message. If you need to use a custom dataflow template rather than the public template, you can use the git builder to checkout your code followed by the maven builder to compile and launch a custom dataflow pipeline. Additional pipeline steps can be added for any other actions you require.

The pubsub messages sent from your batch job can also be used to trigger a Cloud Run service or a GKE service via Eventarc and may also be consumed directly by a Dataflow pipeline or any other application.

Mainframe Capacity Planning

CPU consumption is a major factor in mainframe workload cost. In the basic architecture design above, the Mainframe Connector runs on the JVM and runs on zIIP processor. Relative to simply uploading data to cloud storage, ORC encoding consumes much more CPU time. When processing large amounts of data it's possible to exhaust zIIP capacity and spill workloads onto GP processors. You may apply the following advanced architecture to reduce CPU consumption and avoid increased z/OS processing costs.

Remote Dataset Transcoding on Compute Engine VM

To reduce mainframe CPU consumption, ORC file transcoding can be delegated to a GCE instance. A gRPC service is included with the mainframe connector specifically for this purpose. Instructions for setup can be found in the mainframe connector documentation. Using remote ORC transcoding will significantly reduce CPU usage of the Mainframe Connector batch jobs and is recommended for all production level BigQuery workloads. Multiple instances of the gRPC service can be deployed behind a load balancer and shared by all Mainframe Connector batch jobs.

Transfer Data via FICON and Interconnect

Google Cloud technology partners offer products to enable transfer of mainframe datasets via FICON and 10G ethernet to Cloud Storage. Obtaining a hardware FICON appliance and Interconnect is a practical requirement for workloads that transfer in excess of 500GB daily. This architecture is ideal for integration of z/OS and Google Cloud because it largely eliminates data transfer related CPU utilization concerns.

^{We really appreciate Jason Mar from Google Cloud who provided rich context and technical guidance regarding the Mainframe Connector, and Eric Lowry from Elastic for his suggestions and recommendations, and the Google Cloud and Elastic team members who contributed to this collaboration.}

Google’s Virtual Desktop of the Future

Thu, 08 Dec 2022 13:00:00 -0000

Did you know that most Google employees rely on virtual desktops to get their work done? This represents a paradigm shift in client computing at Google, and was especially critical during the pandemic and the remote work revolution. We’re excited to continue enabling our employees to be productive, anywhere! This post covers the history of virtual desktops and details the numerous benefits Google has seen from their implementation.

Background

In 2018, Google began the development of virtual desktops in the cloud. A whitepaper was published detailing how virtual desktops were created with Google Cloud, running on Google Compute Engine, as an alternative to physical workstations. Further research had shown that it was feasible to move our physical workstation fleet to these virtual desktops in the cloud. The research began with user experience analysis – looking into how employee satisfaction of cloud workstations compared with physical desktops. Researchers found that user satisfaction of cloud desktops was higher than that of their physical desktop counterparts! This was a monumental moment for cloud-based client computing at Google, and this discovery led to additional analyses of Compute Engine to understand if it could become our preferred (virtual) workstation platform of the future.

Today, Google’s internal use of virtual desktops has increased dramatically. Employees all over the globe use a mix of virtual Linux and Windows desktops on Compute Engine to complete their work. Whether an employee is writing code, accessing production systems, troubleshooting issues, or driving productivity initiatives, virtual desktops are providing them with the compute they need to get their work done. Access to virtual desktops is simple: some employees access their virtual desktop instances via Secure Shell (SSH), while others use Chrome Remote Desktop — a graphical access tool.

In addition to simplicity and accessibility, Google has realized a number of benefits from virtual desktops. We’ve seen an enhanced security posture, a boost to our sustainability initiatives, and a reduction in maintenance effort associated with our IT infrastructure. All these improvements were achieved while improving the user experience compared to our physical workstation fleet.

Example of Google Data Center

Analyzing Cloud vs Physical Desktops

Let’s look deeper into the analysis Google performed to compare cloud virtual desktops and physical desktops. Researchers compared cloud and physical desktops on five core pillars: user experience, performance, sustainability, security, and efficiency.

User Experience

Before the transition to virtual desktops got underway, user experience researchers wanted to know more about how they would affect employee happiness. They discovered that employees embraced the benefits that virtual desktops offered. This included freeing up valuable desk space to provide an always-on, always available compute experience, accessible from anywhere in the world, and reduced maintenance overhead compared to physical desktops.

Performance

From a performance perspective, cloud desktops are simply better than physical desktops. For example, running on Compute Engine makes it easy to spin-up on-demand virtual instances with predictable compute and performance – a task that is significantly more difficult with a physical workstation vendor. Virtual desktops rely on a mix of Virtual Machine (VM) families that Google developed based on the performance needs of our users. These include Google Compute EngineE2 high-efficiency instances, which employees might use for day-to-day tasks, to higher-performance N2/N2D instances, which employees might use for more demanding machine learning jobs. Compute Engine offers a VM shape for practically any computing workflow. Additionally, employees no longer have to worry about machine upgrades (to increase performance, for example) because our entire fleet of virtual desktops can be upgraded to new shapes (with more CPU and RAM) with a single config change and a simple reboot — all within a matter of minutes. Plus, Compute Engine continues to add features and new machine types, which means our capabilities only continue to grow in this space.

Sustainability

Google cares deeply about sustainability and has been carbon neutral since 2007. Moving from physical desktops to virtual desktops on Compute Engine brings us closer to Google sustainability goals of a net-neutral desktop computing fleet. Our internal facilities team has praised virtual desktops as a win for future workspace planning, because a reduction in physical workstations could also mean a reduction in first-time construction costs of new buildings, significant (up to 30%) campus energy reductions, and even further reductions in costs associated with HVAC needs and circuit size needs at our campuses. Lastly, a reduction in physical workstations also contributes to a reduction in physical e-waste and a reduction in the carbon associated with transporting workstations from their factory of origin to office locations. At Google’s scale, these changes lead to an immense win from a sustainability standpoint.

Security

By their very nature, virtual desktops mitigate the ability for a bad actor to exfiltrate data or otherwise compromise physical desktop hardware since there is no desktop hardware to compromise in the first place. This means attacks such as USB attacks, evil maid attacks, and similar techniques for subverting security that require direct hardware access become worries of the past. Additionally, the transition to cloud-based virtual desktops also brings with it an enhanced security posture through the use of Google Cloud’s myriad security features including Confidential Computing, vTPMs, and more.

Efficiency

In the past, it was not uncommon for employees to spend days waiting for IT to deliver new machines or fix physical workstations. Today, cloud-based desktops can be created instantaneously on-demand and resized on-demand. They are always accessible, and virtually immune from maintenance-related issues. IT no longer has to deal with concerns like warranty claims, break-fix issues, or recycling. This time savings enables IT to focus on higher priority initiatives all while reducing their workload. With an enterprise the size of Google, these efficiency wins added up quickly.

Considerations to Keep in Mind

Although Google has seen significant benefits with virtual desktops, there are some considerations to keep in mind before deciding if they are right for your enterprise. First, it’s important to recognize that migrating to a virtual fleet requires a consistently reliable and performant client internet connection. For remote/global employees, it’s important they’re located geographically near a Google Cloud Region (to minimize latency). Additionally, there are cases where physical workstations are still considered vital. These cases include users who need USB and other direct I/O access for testing/debugging hardware and users who have ultra low-latency graphics/video editing or CAD simulation needs. Finally, to ensure interoperability between these virtual desktops and the rest of our computing fleet, we did have to perform some additional engineering tasks to integrate our asset management and other IT systems with the virtual desktops. Whether your enterprise needs such features and integration should be carefully analyzed before considering a solution such as this. However, should you ultimately conclude that cloud-based desktops are the solution for your enterprise, we’re confident you’ll realize many of the benefits we have!

Tying It All Together

Although moving Google employees to virtual desktops in the clouds was a significant engineering undertaking, the benefits have been just as significant. Making this switch has boosted employee productivity and satisfaction, enhanced security, increased efficiency, and provided noticeable improvements in performance and user experience. In short, cloud-based desktops are helping us transform how Googlers get their work done. During the pandemic, we saw the benefits of virtual desktops in a critical time. Employees had access to their virtual desktop from anywhere in the world, which kept our workforce safer and reduced transmission vectors for COVID-19. We’re excited for a future where more and more of our employees are computing in the cloud as we continue to embrace the work-from-anywhere model and as we continue to add new features and enhanced capabilities to Compute Engine!