Streaming MapReduce with Scalding and Storm
Scala Shell Java
Latest commit 11adbcb Jun 23, 2016 Pankaj Gupta Setting version to 0.11.0-SNAPSHOT
Failed to load latest commit information.
logo Add official summingbird logo May 20, 2014
project Update the build to .sbt style Feb 1, 2016
scripts added run_tests.sh Feb 17, 2015
summingbird-batch-hadoop/src Fix FileSystem.get issue Jan 29, 2016
summingbird-batch/src Merge issue Oct 20, 2015
summingbird-builder/src Mixup test for Config access instead of hadoop config Oct 20, 2015
summingbird-chill/src Do not use set references to more closely match our other Kryo usages Oct 19, 2015
summingbird-client/src Incorporate latest storehaus release. Jun 14, 2016
summingbird-core-test/src Add another dependantsAfterMerge test Dec 11, 2015
summingbird-core/src Remove unused imports. Jun 21, 2016
summingbird-example/src Incorporate latest storehaus release. Jun 14, 2016
summingbird-online/src Use more appropriate AtomicInteger apis. Jun 14, 2016
summingbird-scalding-test/src Merge branch 'Lookup' of github.com:adamkozuch/summingbird into adamk… Jan 29, 2016
summingbird-scalding/src/main/scala/com/twitter/summingbird/scalding Fix FileSystem.get issue Jan 30, 2016
summingbird-storm-test/src Fix storm tests Aug 13, 2015
summingbird-storm/src/main Make summingbird-storm compile with storm 0.10.0 Nov 30, 2015
.gitignore Update the build to .sbt style Feb 2, 2016
.travis.yml Merge branch 'develop' of github.com:twitter/summingbird into ianoc/m… Jul 30, 2015
CHANGES.md Updates for the 0.9.1 release Nov 16, 2015
CONTRIBUTING.md link. Aug 26, 2013
LICENSE Project template. Oct 23, 2012
NOTICE Project template. Oct 23, 2012
README.md Updates for the 0.9.1 release Nov 16, 2015
build.sbt Update versions of finagle and util. Storehaus depends on them. Jun 14, 2016
sbt Update the build to .sbt style Feb 2, 2016
version.sbt Setting version to 0.11.0-SNAPSHOT Jun 23, 2016

README.md

Summingbird Build Status

Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

Summingbird Logo

While a word-counting aggregation in pure Scala might look like this:

  def wordCount(source: Iterable[String], store: MutableMap[String, Long]) =
    source.flatMap { sentence =>
      toWords(sentence).map(_ -> 1L)
    }.foreach { case (k, v) => store.update(k, store.get(k) + v) }

Counting words in Summingbird looks like this:

  def wordCount[P <: Platform[P]]
    (source: Producer[P, String], store: P#Store[String, Long]) =
      source.flatMap { sentence =>
        toWords(sentence).map(_ -> 1L)
      }.sumByKey(store)

The logic is exactly the same, and the code is almost the same. The main difference is that you can execute the Summingbird program in "batch mode" (using Scalding), in "realtime mode" (using Storm), or on both Scalding and Storm in a hybrid batch/realtime mode that offers your application very attractive fault-tolerance properties.

Summingbird provides you with the primitives you need to build rock solid production systems.

Getting Started: Word Count with Twitter

The summingbird-example project allows you to run the wordcount program above on a sample of Twitter data using a local Storm topology and memcache instance. You can find the actual job definition in ExampleJob.scala.

First, make sure you have memcached installed locally. If not, if you're on OS X, you can get it by installing Homebrew and running this command in a shell:

brew install memcached

When this is finished, run the memcached command in a separate terminal.

Now you'll need to set up access to the Twitter Streaming API. This blog post has a great walkthrough, so open that page, head over to https://dev.twitter.com/ and get your various keys and tokens. Once you have these, clone the Summingbird repository:

git clone https://github.com/twitter/summingbird.git
cd summingbird

And open StormRunner.scala in your editor. Replace the dummy variables under config variable with your auth tokens:

lazy val config = new ConfigurationBuilder()
    .setOAuthConsumerKey("mykey")
    .setOAuthConsumerSecret("mysecret")
    .setOAuthAccessToken("token")
    .setOAuthAccessTokenSecret("tokensecret")
    .setJSONStoreEnabled(true) // required for JSON serialization
    .build

You're all ready to go! Now it's time to unleash Storm on your Twitter stream. Make sure the memcached terminal is still open, then start Storm from the summingbird directory:

./sbt "summingbird-example/run --local"

Storm should puke out a bunch of output, then stabilize and hang. This means that Storm is updating your local memcache instance with counts of every word that it sees in each tweet.

To query the aggregate results in Memcached, you'll need to open an SBT repl in a new terminal:

./sbt summingbird-example/console

At the launched repl, run the following:

scala> import com.twitter.summingbird.example._
import com.twitter.summingbird.example._

scala> StormRunner.lookup("i")
<memcache store loading elided>
res0: Option[Long] = Some(5)

scala> StormRunner.lookup("i")
res1: Option[Long] = Some(52)

Boom. Counts for the word "i" are growing in realtime.

See the wiki page for a more detailed explanation of the configuration required to get this job up and running and some ideas for where to go next.

Community and Documentation

This, and all github.com/twitter projects, are under the Twitter Open Source Code of Conduct. Additionally, see the Typelevel Code of Conduct for specific examples of harassing behavior that are not tolerated.

To learn more and find links to tutorials and information around the web, check out the Summingbird Wiki.

The latest ScalaDocs are hosted on Summingbird's Github Project Page.

Discussion occurs primarily on the Summingbird mailing list. Issues should be reported on the GitHub issue tracker. Simpler issues appropriate for first-time contributors looking to help out are tagged "newbie".

IRC: freenode channel #summingbird

Follow @summingbird on Twitter for updates.

Please feel free to use the beautiful Summingbird logo artwork anywhere.

Maven

Summingbird modules are published on maven central. The current groupid and version for all modules is, respectively, "com.twitter" and 0.9.1.

Current published artifacts are

  • summingbird-core_2.11
  • summingbird-core_2.10
  • summingbird-batch_2.11
  • summingbird-batch_2.10
  • summingbird-client_2.11
  • summingbird-client_2.10
  • summingbird-storm_2.11
  • summingbird-storm_2.10
  • summingbird-scalding_2.11
  • summingbird-scalding_2.10
  • summingbird-builder_2.11
  • summingbird-builder_2.10

The suffix denotes the scala version.

Authors (alphabetically)

License

Copyright 2013 Twitter, Inc.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0