SlideShare a Scribd company logo
11
LSRS 2016
(Some) pitfalls of distributed learning
Yves Raimond, Algorithms Engineering, Netflix
2
Some background
3
2006
4
5
▪ > 83M members
▪ > 190 countries
▪ > 3B hours/month
▪ > 1000 device types
▪ 36% of peak US downstream
traffic
Netflix scale
6
Recommendations @ Netflix
7
Help members find content to watch and enjoy
to maximize member satisfaction and retention
Goal
8
9
Two potential reasons to try distributed
learning
10
Reason 1: minimizing training time
Collecting dataset
Training
Serving
Time
Model 1
time-to-serve delay
11
Training time vs online performance
▪ Most (all?) recommendation algorithms need to predict
future behavior from past information
▪ If model training takes days, it might miss out on important
changes
▪ New items being introduced
▪ Popularity swings
▪ Changes in underlying feature distributions
▪ Time-to-serve can be a key component in how good the
recommendations will be, online
12
Training time vs experimentation speed
▪ Faster training time
=> more offline experimentations and iterations
=> better models
▪ Many other factors at play (like modularity of the ML
framework), but training time is a key one
▪ How quickly can you iterate through e.g. model architectures
if training a model takes days?
13
Reason 2: increasing dataset size
▪ If your model is complex enough (trees, DNNs, …) more data
could help
▪ … But this will have an impact on the training time
▪ … Which in turn could have a negative impact on
time-to-serve delay and experimentation speed
▪ Hard limits
14
Let’s distribute!
15
Topic-sensitive PageRank
▪ Popular graph diffusion algorithm
▪ Capturing vertex importance with regards to a particular
vertex
▪ Easy to distribute using Spark and GraphX
▪ Fast distributed implementation contributed by Netflix
(coming up in Spark 2.1!)
16
Iteration 0
We start by
activating a single
node
“Seattle”
related to
shot in
featured in
related to
cast
cast
cast
related to
17
Iteration 1
With some probability,
we follow outbound
edges, otherwise we
go back to the origin.
18
Iteration 2
Vertex accumulates
higher mass
19
Iteration 2
And again, until
convergence
20
Latent Dirichlet Allocation
▪ Popular clustering /
latent factors model
▪ Uncollapsed Gibbs
sampler is fairly easy to
distribute
Per-topic
word
distributions
Per-document
topic
distributions
Topic label for
document d and
word w
21
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed parameterized graph for
LDA with 3 Topics
documents
words
Word appear
in document
22
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
Categorical distribution
for the triplet using
vertex attributes
23
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
Categorical distributions for
all triplets
24
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
1
1
2
0
Sample Topics for all edges
25
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0
1
0
0
1
1
1
0
0
0 2 0
1 0 1
1
1
2
0
Neighborhood aggregation for topic
histograms
26
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.1
0.4
0.3
0.1
0.4
0.4
0.8
0.2
0.3
0.1 0.8 0.1
0.45 0.1 0.45
Realize samples from Dirichlet to
update the graph
27
Now, is it faster?
28
Topic Sensitive Pagerank
▪ Distributed Spark/GraphX implementation
▪ Available in Spark 2.1
▪ Propagates multiple vertices at once
▪ Alternative implementation
▪ Single-threaded and single-machine for one source vertex
▪ Works on full graph adjacency
▪ Scala/Breeze, horizontally scaled with Spark to propagate multiple
vertices at once
▪ Dimension: number of vertices for which we compute a
ranking
29
Open Source DBPedia
dataset
Sublinear rise in time with
Spark/GraphX vs linear rise in the
horizontally scaled version
Doubling the size of cluster:
2.0 speedup in horizontally scaled
version vs 1.2 in Spark/GraphX
30
Latent Dirichlet Allocation
▪ Distributed Spark/GraphX implementation
▪ Alternative implementation
▪ Single machine, multi-threaded Java code
▪ NOT horizontally scaled
▪ Dimension: training set size
31
Netflix dataset
Number of Topics = 100
Spark/GraphX setup:
8 x resources than the
multi-core setup
Wikipedia dataset, 100
Topic LDA
Cluster: (16 x r3.2xl)
(source: Databricks)
Spark/GraphX for very large
datasets outperforms multi-core
32
Other comparisons
▪ Frank McSherry’s blog post
comparing different distributed
pagerank implementation and a
single-threaded Rust
implementation on his laptop
▪ 1.5B edges for twitter_rv, 3.7B
for uk_2007_05
▪ “If you are going to use a big
data system for yourself, see if it
is faster than your laptop.”
33
Other comparisons
▪ GraphChi, a single-machine large-scale graph computation engine
developed at CMU, reports similar findings
34
Now, is it faster?
No, unless your problem or dataset is huge :(
35
To conclude...
▪ When distributing an algorithm, there are two opposing
forces:
▪ 1) Communication overhead (shifting data from node to node)
▪ 2) More raw computing power available
▪ Whether one overtakes the other depends on the size of your
problem
▪ Single-machine ML can be very efficient!
▪ Smarter algorithms can beat brute force
▪ Better data structures, input data formats, caching, optimization
algorithms, etc. can all make a huge difference
▪ Good core implementation is a prerequisite to distribution
▪ Easy to get large machines!
36
To conclude...
▪ However, distribution lets you easily throw more hardware
at a problem
▪ Also, some algorithms/methods are better than others at
minimizing the communication overhead
▪ Iterative distributed graph algorithms can be inefficient in that
respect
▪ Can your problem fit on a single machine?
▪ Can your problem be partitioned?
▪ For SGD-like algos, parameter servers can be used to distribute while
keeping this overhead to a minimum
37
Questions?
(Yes, we’re hiring)
Many thanks to @EhtshamElahi!

More Related Content

(Some) pitfalls of distributed learning