A Comparison of Approaches to Large-Scale Data Analysis
| Andrew Pavlo Brown University |
Erik Paulson University of Wisconsin |
Alexander Rasin Brown University |
| Daniel J. Abadi Yale University |
David J. DeWitt Microsoft Inc. |
Samuel Madden M.I.T. CSAIL |
Michael Stonebraker M.I.T. CSAIL |
Overview
The following information is meant to provide documentation on how others can recreate the benchmark trials used in our SIGMOD 2009 paper. Our experiments were conducted on a 100-node cluster at the University of Wisconsin-Madison; each node had a single 2.40 GHz Intel Core 2 Duo processor running 64-bit Red Hat Enterprise Linux 5 (kernel version 2.6.18) with 4GB RAM and two 250GB SATA-I hard disks.
Publications
- M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin, "MapReduce and Parallel DBMSs: Friends or Foes?," Communications of the ACM, vol. 53, iss. 1, pp. 64-71, 2010. [PDF] [BIBTEX]
@article{stonebraker2010cacm,
author = {Stonebraker, Michael and Abadi, Daniel and DeWitt, David J. and Madden, Sam and Paulson, Erik and Pavlo, Andrew and Rasin, Alexander},
title = {MapReduce and Parallel DBMSs: Friends or Foes?},
journal = {Communications of the ACM},
volume = {53},
number = {1},
year = {2010},
issn = {0001-0782},
pages = {64--71},
doi = {http://doi.acm.org/10.1145/1629175.1629197},
publisher = {ACM},
address = {New York, NY, USA},
url = {http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf},
} - A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker, "A comparison of approaches to large-scale data analysis," in SIGMOD ’09: Proceedings of the 35th SIGMOD international conference on Management of data, New York, NY, USA, 2009, pp. 165-178. [PDF] [BIBTEX]
@inproceedings{pavlo09,
author = {Pavlo, Andrew and Paulson, Erik and Rasin, Alexander and Abadi, Daniel J. and DeWitt, David J. and Madden, Samuel and Stonebraker, Michael},
title = {A comparison of approaches to large-scale data analysis},
booktitle = {SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data},
year = {2009},
isbn = {978-1-60558-551-2},
pages = {165--178},
location = {Providence, Rhode Island, USA},
doi = {http://doi.acm.org/10.1145/1559845.1559865},
publisher = {ACM},
address = {New York, NY, USA},
url = {http://database.cs.brown.edu/papers/benchmarks-sigmod09.pdf}
}
Updates
- August 24th, 2011The repository has been moved to a new permanent location: https://database.cs.brown.edu/svn/mr-benchmarks/
- October 13th, 2009 The following is the benchmarks.conf configuration that we used for the upcoming CACM paper. Through the feedback that we recieved from the Hadoop community, we were able to now get SequenceFiles+Compression to run slightly faster than plain text files.
COMBINE_RESULTS=1 ## Combine Final Results into Single Output File COMPRESS_DATA=1 ## Compress Input Data and Intermediate Results SEQUENCE_FILE=1 ## Use SequenceFile Input Formats
- April 14th, 2009Previously we incorrectly stated we used the -client option for our Hadoop configuration; please be aware that we used -server in all of our trials reported in the paper. Note that if you do not enable the JVM reuse option, the you will want to keep the default mapred.child.java.opts to -client instead of -server.
Source Code
All of the data generation and Hadoop source code used in the benchmark study are available via our public SVN repository:
svn co https://database.cs.brown.edu/svn/mr-benchmarks/
There is also a WebSVN interface to browse the source code online.
Hadoop Configuration
We changed the following properties in $HADOOP_CONF_DIR/hadoop-site.xml for our Hadoop 0.19 installation:
dfs.block.size 268435456 # 256MB mapred.child.java.opts -Xmx512m -server mapred.compress.map.output true mapred.job.reuse.jvm.num.tasks -1 io.sort.factor 100 io.sort.mb 256 tasktracker.http.threads 60 topology.script.file.name /path/to/rack-mapper.pl
Data Generation
There are two data sets that need to be generated: (1) the Grep data and (2) the HTML documents and their corresponding data tables. Please note that although the scripts that we provide here assume the network layout and infrastructure of the University of Wisconsin’s DB cluster, they should be easily adaptable to other configurations.
Grep Data Sets
We use a custom Map/Reduce job that generates the Grep data set in HDFS from directly inside of the map task. We use a modified version of Yahoo!’s TeraGen data generator that was also used for the TeraByte benchmark. Because the SQL systems cannot operate on files stored in HDFS, we provide scripts to download the data into separate files on each node and store them on the local file systems.
To generate the data, execute the following instructions. Please note that our code assumes that there are 100 nodes in the cluster: you can modify teragen.pl to tune the parameters for how much data to create. This will create four different data sets (535MBx100, 10GBx100, 20GBx50, and 40GBx25) in the HDFS directory /data/SortGrep/.
svn co https://database.cs.brown.edu/svn/mr-benchmarks/ cd datagen/teragen start-dfs.sh start-mapred.sh perl ./teragen.pl
You can then use the the script teragen-grabber.pl to pull down the separate data files on each node. Again, please note that this script assumes that each node’s hostname is formatted as d-###, but this can easily be adapted to each node. You can also use our pusher utility (similar to pssh) to execute teragen-grabber.pl in parallel on each node. This program assumes that there is a shared network filesystem available on each node (e.g., NFS or AFS), even though our benchmarks did not use network storage.
pusher "$HOME/MapReduce/benchmarks/datagen/teragen/teragen-grabber.pl /data/SortGrep/ /path/to/local/directory/" \ --debug --hosts=$HADOOP_CONF_DIR/slaves
Analysis Benchmarks Data Sets
The SVN repository also contains code to generate the random HTML data sets. All you need to do is execute the following command on a single node in your cluster and it fire off the generation program on each node in parallel (again, we assume that there is a shared file system available on each node. Make sure you need to modify config.txt first in order to change the parameters according to your cluster configuration.
cd htmlgen/teragen nano config.txt python ./generateData.py
The defaults for this program is to generate a set of random HTML documents, and derive the Rankings data set (1GB/node) and the UserVisits data set (20GB/node). The relevant SQL schema for loading the tables into a database are in CACM_schema.sql.
Hadoop Benchmarks
Execution of the Hadoop benchmarks is completely automated; the main controller of the benchmark execution is through the runner.sh shell script. This program takes a configuration file (see benchmarks.conf-sample) about which benchmarks and cluster configurations to execute. You will also need to modify the global configuration file (see runner.conf-sample) to correspond with your local cluster configuration. Once these files have been changed and after you have downloaded the Grep and Analytical benchmark data onto the local filesystem of each node, you need to then build the JAR files used by the benchmark runner. Once this is completed, you can just execute runner.sh to run all of the bencmarks:
cd mapreduce cp runner.conf-sample runner.conf nano runner.conf cp benchmarks.conf-sample benchmarks.conf nano benchmarks.conf ant jars ./runner.sh ./benchmarks.conf
By default, the runner will automatically reformat your HDFS installation for each cluster size (e.g., 10 nodes, 25 nodes, etc) and start/stop the Hadoop HDFS Master and JobTracker each time. This ensures that the data is distributed uniformly across all nodes in the cluster.
You can modify the options in your benchmarks.conf configuration file to change this behavior. The other key options for the benchmarks are found at the top of this file (to enable or disable a particular feature in a benchmark trial, set it’s value to either 1 or 0, respectively):
COMBINE_RESULTS=1 ## Combine Final Results into Single Output File COMPRESS_DATA=0 ## Compress Input Data and Intermediate Results TUPLE_DATA=0 ## Use custom tuples for multi-column values SEQUENCE_FILE=0 ## Use SequenceFile Input Formats SPLIT_DATA=0 ## Split input files into separate, smaller subsets INPUT_LIMIT=0 ## How many lines to limit for Input Files when loading
You will also need to set the SLAVES_CONF_DIR option in benchmarks.conf to point to a directory containing the different slave host configuration files (one for each of the different cluster sizes tested):
## Directory of slave files to use ## This directory should contain the following files: ## slaves-001 (1 host) ## slaves-010 (10 hosts) ## slaves-025 (25 hosts) ## slaves-050 (50 hosts) ## slaves-100 (100 hosts) SLAVES_CONF_DIR="$HADOOP_CONF_DIR/slaves-files"
SQL Commands
As taken from the paper, the following are SQL commands that we used for each of the benchmark tasks. You may need to modify them slightly in order to get them to work on different databases systems.
Original MR Benchmark:
SELECT * FROM Data WHERE field LIKE `%XYZ%';
Benchmark #1 – Select Task
SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X;
Benchmark #2 – Aggregate Task
SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP;
Benchmark #2a – Aggregate Task (Variant)
SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7);
Benchmark #3 – Join Task
SELECT INTO Temp sourceIP,
AVG(pageRank) as avgPageRank,
SUM(adRevenue) as totalRevenue
FROM Rankings AS R, UserVisits AS UV
WHERE R.pageURL = UV.destURL
AND UV.visitDate BETWEEN Date(`2000-01-15') AND Date(`2000-01-22')
GROUP BY UV.sourceIP;
SELECT sourceIP, totalRevenue, avgPageRank
FROM Temp ORDER BY totalRevenue DESC LIMIT 1;
Benchmark #4 – UDF Aggregate
SELECT INTO Temp F(contents) FROM Documents; SELECT url, SUM(value) AS pageRank FROM Temp GROUP BY url;