Thursday, April 30th, 2009
YQL execute now allows you to convert scraped data with server side JavaScript
I am a big fan of YQL, a terribly easy and fuss-free way to access APIs and mix data retrieved from them in a simple, SQL style language. Say for example you want photos of Paris,France from Flickr that are licensed with Creative Commons attribution, you can do this with a single command:
- select * from flickr.photos.info where photo_id in (select id from flickr.photos.search where woe_id in (select woeid from geo.places where text='paris,france') and license=4)
Try it out here and you see what I mean.
The next step of this interface was to open it out to the public. You can define an “Open Table” as a simple XML schema and bring your own API into this interface with that.
One thing that’s been burning on my tongue to tell the world about has been finally released now: YQL execute. Instead of making the YQL language itself much more complex (and thus running in circles) we now allow you to embed JavaScript in the Open Table XML that will run on the YQL server and allow you to access other web services, authenticate and scrape HTML with JavaScript and E4X. As Simon Willison put it:
This is nuts (in a good way). Yahoo!’s intriguing universal SQL-style XML/JSONP web service interface now supports JavaScript as a kind of stored procedure language, meaning you can use JavaScript and E4X to screen-scrape web pages, then query the results with YQL.
Using this, you can augment the original functionality of YQL to whatever you need. For example, you can scrape HTML with YQL using XPATH, but there was no way to use CSS selectors. Using an open table that invokes James Padolsey’s css2xpath JavaScript on the server side, this is now possible.
- use 'http://yqlblog.net/samples/data.html.cssselect.xml' as data.html.cssselect; select * from data.html.cssselect where url="www.yahoo.com" and css="#news a"
The data table is pretty easy:
- < ?xml version="1.0" encoding="UTF-8" ?>
- <table xmlns="http://query.yahooapis.com/v1/schema/table.xsd">
- <meta>
- <samplequery>select * from {table} where url="www.yahoo.com" and css="#news a"</samplequery>
- </meta>
- <bindings>
- <select itemPath="" produces="XML">
- <urls>
- <url></url>
- </urls>
- <inputs>
- <key id="url" type="xs:string" paramType="variable" required="true" />
- <key id="css" type="xs:string" paramType="variable" />
- </inputs>
- <execute>< ![CDATA[
- //include css to xpath convert function
- y.include("http://james.padolsey.com/scripts/javascript/css2xpath.js");
- var query = null;
- if (css) {
- var xpath = CSS2XPATH(css);
- y.log("xpath "+xpath);
- query = y.query("select * from html where url=@url and xpath=\""+xpath+"\"",{url:url});
- } else {
- query = y.query("select * from html where url=@url",{url:url});
- }
- response.object = query.results;
- ]]></execute>
- </select>
- </bindings>
- </table>
Check the official Yahoo Developer Network blog post on YQL execute for more examples, including authentication examples for flickr and netflix.









So I can use Yahoo!’s servers to screen scrape anything I want?
That’s terrific. But what prevents abuse, such as huge attacks on some poor guy’s $5 a month data-limited hosted account?
@Nosredna YQL access is limited to a cap that would prevent that:
However, what prevents me to curl his page every second? I don’t need YQL to scrape people’s pages. What YQL does though is cache the results which actually means less hits for the scraped page.
I did not know James Padolsey function but it seems quite incomplete compared with the one I created for vice-versa.
Here the specific function via experiments and document.query.css2xpath function
Maybe me and James could collaborate to create a complete and stable function (mine at least pass every CSS selector used in SlickSpeed test ;) )
Thanks for the answer Chris. Agreed that it’s always been possible to scrape. It’s the ease of doing it and the indirection through Yahoo! servers that I was thinking of.
The caching is nice.
yep, tested right now and James Padolsey function is both incomplete and buggy (with results as well) … James, give me a shout if you read me.
@Chris, Awesome work! I recommend using WebReflection’s converter though; as mentioned it’s more complete (and less buggy) than mine.
All,
One of the main reasons we made use of James’ CSS/xpath converter to show how easy it was to plug in useful JS functions and libraries into a table, to get new functionality that people want in YQL.
Why not create a better CSS selector open data table and submit it to github for others to use and share? The sample ones aren’t part of the community respository (datatables.org) so that seems a good place for a better version to go.
Jonathan
JonathanT, I partially agree about a better version but I do not get the “should be part of datatables.org” part … I mean what’s wrong with my or James website/project? I better see a specific one out of whatever box … what do you think about?
@Nosredna
Three words:
YQL honors robots.txt
unfortunatly you can’t safely scrape everything on the web, because there are some conversion quirks
the main problem is that YQL return well-formed XML, but the web is often a mess of both HTML and XHTML (also notice you can only scrape what’s inside the body tag)
look at this sample page I made (valid HTML 4): http://www.playquery.it/sandbox/yql/test3.html
this is how YQL parse it
some convertion errors:
– some HTML entities are converted to the corresponding character code (nbsp and reg), and some other not (amp,lt,gt)
– an anchor with a name=”top” now has also an id=”top”
– the textarea has some whitespaces inside, but in the YQL result is empty
– the table really freaks out (some p tags added, the form on the bottom of the page is put inside a td tag, the table is moved under the main paragraph)
and, as I noted on James blog some days ago (http://james.padolsey.com/javascript/using-yql-with-jsonp/), if you are forced to use the JSONP format instead of the XML is even worse
but, anyway, if you know very well the source of your query, and it’s XHTML well-formed, I think YQL could be really awesome
I looked at getting Sizzle running to do the CSS selectors before we launched YQL Execute and in order to use it you need a DOM. In order to get a DOM in Rhino you need env.js which currently runs to about 8k lines of code.
This means in order to get Sizzle working you need about 9k lines of JS. CSS2XPath currently weighs in at under 100 lines of code. XPath is natively implemented in Rhino and doesn’t require any additional code.
So, from the perspective of speed it’s 9k lines of interpretation vs 100 lines, and from the perspective of the execution cycle limits YQL has you can spend them on processing data, not creating a DOM.