Sometimes you can prepare your data for BigQuery by performing a simple action such as exporting data in JSON format, but sometimes you may need to transform your data in a more complex way. For example, your data might be stored in XML format, and preparing the data would require more than just a simple query. This topic describes common scenarios where you may need to transform your data before loading it into BigQuery.
Transforming XML data
Large datasets are often represented using XML. BigQuery doesn't support directly loading XML files, but XML files can be easily converted to an equivalent JSON format or flat CSV structure.
When dealing with large XML files, it's important to not use a DOM-based parser on the XML file, as the parser will attempt to load the entire file as an in-memory tree. Instead, we suggest using a pull-based parser or a SAX parser, which are more efficient. Parsing a large XML file is simple when using Python.
The following code examples show a sample of 100 GB of Wikipedia revision history data, and a Python code example that efficiently parses the data.
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.7/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org
/xml/export-0.7/ http://www.mediawiki.org/xml/export-0.7.xsd" version="0.7" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.20wmf10</generator>
<case>first-letter</case>
….
</siteinfo>
<page>
<title>AccessibleComputing</title>
<id>10</id>
<revision>
<id>233192</id>
<contributor>
<id>99</id>
</contributor>
</revision>
<revision>
…
</revision>
</page>
</mediawiki>
from lxml import etree # open file, loop through allelements xml_file = open(sys.argv[1],"r") for _, element in etree.iterparse(xml_file, tag='{http://www.mediawiki.org/xml/export-0.7/}page'): title = element.find('{http://www.mediawiki.org/xml/export-0.7/}title').text id = element.find('{http://www.mediawiki.org/xml/export-0.7/}id').text revisions = list(element.iter('{http://www.mediawiki.org/xml/export-0.7/}revision')) wp_namespace = element.find('{http://www.mediawiki.org/xml/export-0.7/}ns') for revision_element in revisions: # we're confident that we always have a revision ID revision_id = revision_element.find('{http://www.mediawiki.org/xml/export-0.7/}id').text comment = revision_element.find('{http://www.mediawiki.org/xml/export-0.7/}comment') # comment_text and contributor_id are optional if comment is not None and comment.text is not None: comment_text = comment.text contributor_id = contributor.find('{http://www.mediawiki.org/xml/export-0.7/}id') if contributor_id is not None: contributor_id_text = contributor_id.text