RDFgrid: Map/Reduce-based Linked Data Processing with Hadoop

RDFgrid is a simple framework for map/reduce-based batch-processing of RDF data with Hadoop and Amazon Elastic MapReduce.

http://github.com/datagraph/rdfgrid

Features

Processes RDF data in the line-oriented, whitespace-separated N-Triples format.
Provides RDF statement manipulation using RDF.rb's object model; no manual parsing or serialization involved.
Provides built-in aggregate combiners/reducers for the common sum, min, max, and avg operations.
Compatible with Hadoop Streaming and Amazon's Elastic MapReduce service.
Available as a prepackaged archive with all dependencies included, simplifying deployments using Hadoop's distributed cache.

Examples

A mapper for counting RDF predicate usage (`doc/examples/mapper.rb`)

#!/usr/bin/ruby -Ilib
require 'rdfgrid'

class PredicateCounter < RDFgrid::Mapper::StatementMapper
  def process(statement)
    yield statement.predicate, 1
  end
end

PredicateCounter.process!

A reducer for summing up RDF predicate usage (`doc/examples/reducer.rb`)

#!/usr/bin/ruby -Ilib
require 'rdfgrid'

class PredicateSummer < RDFgrid::Reducer
  def process(values)
    yield values.inject(0) { |sum, value| sum + value.to_i }
  end
end

PredicateSummer.process!

Running the mapper and reducer pipeline with a local N-Triples dataset

$ cat data.nt | ruby mapper.rb | sort | ruby reducer.rb

Documentation

RDFgrid::Mapper
RDFgrid::Reducer

Dependencies

RDF.rb (>= 0.1.2)

Installation

The recommended installation method is via RubyGems. To install the latest official release, do:

% [sudo] gem install rdfgrid

Download

To get a local working copy of the development repository, do:

% git clone git://github.com/datagraph/rdfgrid.git

Alternatively, you can download the latest development version as a tarball as follows:

% wget http://github.com/datagraph/rdfgrid/tarball/master

Author

Arto Bendiken - http://ar.to/

License

RDFgrid is free and unencumbered public domain software. For more information, see http://unlicense.org/ or the accompanying UNLICENSE file.