(Some) pitfalls of distributed learning

11
LSRS 2016
(Some) pitfalls of distributed learning
Yves Raimond, Algorithms Engineering, Netflix

5
▪ > 83M members
▪ > 190 countries
▪ > 3B hours/month
▪ > 1000 device types
▪ 36% of peak US downstream
traffic
Netflix scale

7
Help members find content to watch and enjoy
to maximize member satisfaction and retention
Goal

9
Two potential reasons to try distributed
learning

10
Reason 1: minimizing training time
Collecting dataset
Training
Serving
Time
Model 1
time-to-serve delay

11
Training time vs online performance
▪ Most (all?) recommendation algorithms need to predict
future behavior from past information
▪ If model training takes days, it might miss out on important
changes
▪ New items being introduced
▪ Popularity swings
▪ Changes in underlying feature distributions
▪ Time-to-serve can be a key component in how good the
recommendations will be, online

12
Training time vs experimentation speed
▪ Faster training time
=> more offline experimentations and iterations
=> better models
▪ Many other factors at play (like modularity of the ML
framework), but training time is a key one
▪ How quickly can you iterate through e.g. model architectures
if training a model takes days?

13
Reason 2: increasing dataset size
▪ If your model is complex enough (trees, DNNs, …) more data
could help
▪ … But this will have an impact on the training time
▪ … Which in turn could have a negative impact on
time-to-serve delay and experimentation speed
▪ Hard limits

15
Topic-sensitive PageRank
▪ Popular graph diffusion algorithm
▪ Capturing vertex importance with regards to a particular
vertex
▪ Easy to distribute using Spark and GraphX
▪ Fast distributed implementation contributed by Netflix
(coming up in Spark 2.1!)

16
Iteration 0
We start by
activating a single
node
“Seattle”
related to
shot in
featured in
related to
cast
cast
cast
related to

17
Iteration 1
With some probability,
we follow outbound
edges, otherwise we
go back to the origin.

18
Iteration 2
Vertex accumulates
higher mass

19
Iteration 2
And again, until
convergence

20
Latent Dirichlet Allocation
▪ Popular clustering /
latent factors model
▪ Uncollapsed Gibbs
sampler is fairly easy to
distribute
Per-topic
word
distributions
Per-document
topic
distributions
Topic label for
document d and
word w

21
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed parameterized graph for
LDA with 3 Topics
documents
words
Word appear
in document

22
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
Categorical distribution
for the triplet using
vertex attributes

23
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
Categorical distributions for
all triplets

24
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
1
1
2
0
Sample Topics for all edges

25
w1
w2
w3
d1
d2
0
1
0
0
1
1
1
0
0
0 2 0
1 0 1
1
1
2
0
Neighborhood aggregation for topic
histograms

26
w1
w2
w3
d1
d2
0.1
0.4
0.3
0.1
0.4
0.4
0.8
0.2
0.3
0.1 0.8 0.1
0.45 0.1 0.45
Realize samples from Dirichlet to
update the graph

28
Topic Sensitive Pagerank
▪ Distributed Spark/GraphX implementation
▪ Available in Spark 2.1
▪ Propagates multiple vertices at once
▪ Alternative implementation
▪ Single-threaded and single-machine for one source vertex
▪ Works on full graph adjacency
▪ Scala/Breeze, horizontally scaled with Spark to propagate multiple
vertices at once
▪ Dimension: number of vertices for which we compute a
ranking

29
Open Source DBPedia
dataset
Sublinear rise in time with
Spark/GraphX vs linear rise in the
horizontally scaled version
Doubling the size of cluster:
2.0 speedup in horizontally scaled
version vs 1.2 in Spark/GraphX

30
Latent Dirichlet Allocation
▪ Distributed Spark/GraphX implementation
▪ Alternative implementation
▪ Single machine, multi-threaded Java code
▪ NOT horizontally scaled
▪ Dimension: training set size

31
Netflix dataset
Number of Topics = 100
Spark/GraphX setup:
8 x resources than the
multi-core setup
Wikipedia dataset, 100
Topic LDA
Cluster: (16 x r3.2xl)
(source: Databricks)
Spark/GraphX for very large
datasets outperforms multi-core

32
Other comparisons
▪ Frank McSherry’s blog post
comparing different distributed
pagerank implementation and a
single-threaded Rust
implementation on his laptop
▪ 1.5B edges for twitter_rv, 3.7B
for uk_2007_05
▪ “If you are going to use a big
data system for yourself, see if it
is faster than your laptop.”

33
Other comparisons
▪ GraphChi, a single-machine large-scale graph computation engine
developed at CMU, reports similar findings

34
Now, is it faster?
No, unless your problem or dataset is huge :(

35
To conclude...
▪ When distributing an algorithm, there are two opposing
forces:
▪ 1) Communication overhead (shifting data from node to node)
▪ 2) More raw computing power available
▪ Whether one overtakes the other depends on the size of your
problem
▪ Single-machine ML can be very efficient!
▪ Smarter algorithms can beat brute force
▪ Better data structures, input data formats, caching, optimization
algorithms, etc. can all make a huge difference
▪ Good core implementation is a prerequisite to distribution
▪ Easy to get large machines!

36
To conclude...
▪ However, distribution lets you easily throw more hardware
at a problem
▪ Also, some algorithms/methods are better than others at
minimizing the communication overhead
▪ Iterative distributed graph algorithms can be inefficient in that
respect
▪ Can your problem fit on a single machine?
▪ Can your problem be partitioned?
▪ For SGD-like algos, parameter servers can be used to distribute while
keeping this overhead to a minimum

37
Questions?
(Yes, we’re hiring)
Many thanks to @EhtshamElahi!

(Some) pitfalls of distributed learning

More Related Content

(Some) pitfalls of distributed learning