What is garbage collection visualization?
By Brian Moore
In short, “garbage collection visualization” (hereafter shortened to “gcviz”) is the process of turning gc.log[1]
into x/y time-series scatter plots (pictures). That is, turning garbage
collector (GC) logging output into two types of charts, GC event
charts:
A GC event occurs when one of the JVM-configured collectors
operates on the heap, and a heap size chart shows the size of the heap
after a GC event. Both of these types of information are present in
gc.log[1].
Setting the context for gcviz
Earlier, Shobana wrote
about the metadata that Video Metadata Services (VMS) maintains
in-memory for clients to access along extremely low-latency
(microseconds), high-volume (10^11 daily requests), large-data
(gigabytes) code paths. Being able to serve-up data in this setting
requires a deep understanding of Java garbage collection.
Netflix
is (mostly) a Java shop. Netflix deploys Java applications into Apache
Tomcat, an application server (itself written in Java). Tomcat, the
Netflix application, and all of the libraries that are dependencies of
the Netflix application allocate from the same heap. Netflix Java
applications are typically long-running (weeks/months) and have large
heaps (typically tens of gigabytes).
At
this scale, the overhead required to manage the heap is significant,
and garbage collection pauses that last more than 500 milliseconds will
almost certainly interfere with network activity.
Any
GC event that “stops the world” (STW) will pause Tomcat, the
application and the libraries the application needs to run. New inbound
connections cannot be established (the Tomcat accept thread is blocked)
and I/O on outbound connections will stall until the GC completes (each
java thread in a STW event is at a safepoint and unable to accomplish
meaningful work). It is therefore important to ensure that any required
GC pauses are kept as small as possible to ensure that an application
remains available. GC pauses are often seen as the cause of an issue
when in fact they are a side-effect of the allocation behavior of the
combination of Tomcat, the application and the libraries the application
uses.
Before gcviz was written, Netflix determined the influence (or absence) of GC events in outages via several methods:
- hand-crafting excel spreadsheets
- using PrintGCStats/PrintGCFixup/GCHisto
- visually skimming gc.log, looking for “long” stop-the-world events
While
these methods were effective, they were time-consuming and difficult
for many Netflix developers. Inside Netflix, gcviz has been incorporated
into Netflix’s base AMI and integrated into Asgard, making visual GC event analysis of any application trivial (a click of a button) for all Netflix developers.
Why is gcviz important?
gcviz is important for several reasons:
- Convenient: The developer loads the event and heap chart pages directly into their browser by clicking on a link in Asgard. gcviz does all the processing behind the scenes. This convenience allows us to quickly reinforce or reject the oft-repeated claim: bad application behavior is "because of long GC pauses".
- Visual: Images can convey time-series information more quickly and densely than text.
- Causation: gcviz can help application developers establish or refute a correlation between time-of-day-based observed application degradation and JVM-wide GC behavior.
- Clarity: The semantics of each of the different GC event types (concurrent-mark, ParNew) can be made implicit in the image instead of being given equal weight in the text of gc.log. This is useful because, for example, a long running concurrent-sweep should be no cause for alarm because it runs in parallel with the application program.
- Iterative: gcviz allows for quick interpretation of experimental results. What effect does modifying a particular setting or algorithm have on the heap usage or GC subsystem? gcviz allows one to quickly and visually understand any impact a GC-related change has made. gcviz is now being effectively used in canary deployments within Netflix where GC behavior is compared to an existing baseline.
Prior (and contemporary) Art
Netflix
is not the first organization that has seen the benefits of visualizing
garbage collection data. Other projects involved in the visualization
or interpretation of garbage collection data include the following:
- AppDynamics http://www.appdynamics.com/
- jClarity Censum http://www.jclarity.com/products/censum/
- PrintGCStats http://java.net/projects/printgcstats/
- verbosegcanalyzer http://code.google.com/p/verbosegcanalyzer/
Why develop another solution?
While
the tools mentioned above work well under their own conditions, Netflix
had other requirements and constraints to consider and realized that
none of the existing tools met all of Netflix’s needs. The following
requirements were considered and followed during the design and
implementation of gcviz. gcviz needs to:
- Run outside the context of the application under analysis, but on each instance on which the application runs.
- Use gc.log as the source-of-truth
- historical analysis (no need to attach to a running vm)
- filesystem/logs can be read even when HTTP port is busy and/or application threads are blocked
- Read gc logs with the JVM options that Netflix commonly (but not universally) uses:
- -verbose:gc
- -verbose:sizes
- -Xloggc:/apps/tomcat/logs/gc.log
- -XX:+PrintGCDetails
- -XX:+PrintGCDateStamps
- -XX:+PrintTenuringDistribution
- -XX:-UseGCOverheadLimit
- -XX:+ExplicitGCInvokesConcurrent
- -XX:+UseConcMarkSweepGC
- -XX:+CMSScavengeBeforeRemark
- -XX:+CMSParallelRemarkEnabled
- Gracefully handle any arguments passed to the JVM.
- Remain independent of any special terminal/display requirements (must be able to run inside and outside Netflix’s base AMI without modification)
- Be able to run standalone to leverage special display capabilities if they are present
- Be accessible from Asgard
- Be dead-simple to use, requiring no specialized gc knowledge
- Correlate netflix-internal events with gc activity. One example of a netflix-internal event is a cache refresh.
- Retain its reports over time to enable long-term comparative analysis of a given application.
A Small Example: GC Event Chart
To
understand how the GC event charts are constructed, it may be helpful
to consider an example. In the picture below, gcviz visualizes a gc.log
that contains three gc events:
- a DefNew GC event at 60 seconds after JVM boot,
- a DefNew GC event at 120 seconds after JVM boot, and
- a DefNew GC event at 180 seconds after JVM boot
These
three events took 100 milliseconds, 200 milliseconds, and 100
milliseconds, respectively. These three events would produce a chart
with three DefNew “dots” on it. The “x” value of the dot would be the
seconds since JVM boot, converted to absolute time, and the “y” value of
the dot would be the duration in seconds. The red color of the dot
indicates that DefNew is a “stop-the-world” GC event. Hopefully this is
more clear in pictorial form:
Real-world GC Event Charts
In
mid-2012, Netflix used gcviz to analyze some performance problems a
class of applications was having. These performance problems had 80-120
second garbage collection pauses as one of its symptoms. The gcviz event
chart looked like this:
The
pauses had been reduced from 120 seconds to under 5 seconds. In
addition, any Netflix engineer could quickly see that the symptom had
been eliminated without needing to dig through gc.log.
Real-world Heap Size Charts
Real-world Heap Size Charts
In
another mid-2012 event, the heap usage of a Netflix application went
higher than the designers of that application intended. The application
designers intended memory usage to peak around 15GB. The chart below
shows that heap usage peaked at 2.0 times 10^7 kilobytes (19.07
gigabytes) and it also shows “flumes” for periods of time where the heap
usage went above its previous/standard high-water mark:
after
that problem was identified and fixed the high-water mark returned to
1.6 times 10^7 kilobytes (15.23 gigabytes) and the flumes reduced and
the heap usage pattern became far more regular. In addition the
low-water mark dropped from 11.44 gigabytes to 9.54 gigabytes:
JVM collector compatibility
Currently, gcviz supports all of the HotSpot/JDK 7 collectors, with the exception of G1.
Additionally collected data
In
addition to collecting gc.log data, gcviz collects additional system
information (cpu usage, network usage, underlying virtual memory usage,
etc.) to help correlate gc events with application events. In addition
gcviz can be configured to capture a jmap histogram of live objects by
object count, bytes required and class name.
Open-sourcing details
gcviz has been open sourced under the Apache License, version 2.0 at https://github.com/Netflix/gcviz
Conclusion
As
a company, Netflix considers data visualization of paramount
importance. Most of Netflix’s major systems contain significant
visualization components (for example, the Hystrix Dashboard and Pytheas).
Most visualization at Netflix occurs on continuous data, but
visualizing discrete data has its place too. Being able to quickly
differentiate between a GC-indicated allocation problem and other types
of error conditions has been valuable in operating the many services
required to bring streaming TV and movies into living rooms all across
the world.
Notes
[1]
I didn’t want to lose you by quoting gc.log so early in the blog post!
gc.log is a file created by the JVM option (for example)
-Xloggc:/apps/tomcat/logs/gc.log. Specifying this option is recommended
and common inside Netflix. Its contents are something like the
following:
2013-01-01T18:30:16.651+0000: 2652735.877: [CMS-concurrent-sweep-start]
2013-01-01T18:30:21.777+0000: 2652741.003: [CMS-concurrent-sweep: 5.126/5.126 secs] [Times: user=5.13 sys=0.00, real=5.12 secs]
2013-01-01T18:30:21.777+0000: 2652741.004: [CMS-concurrent-reset-start]
2013-01-01T18:30:21.842+0000: 2652741.068: [CMS-concurrent-reset: 0.065/0.065 secs] [Times: user=0.06 sys=0.00, real=0.07 secs]
2013-01-01T19:28:47.041+0000:
2656246.267: [GC 2656246.267: [ParNewDesired survivor size 786432000
bytes,new threshold 15 (max 15)- age 1: 26395600 bytes, 26395600
total- age 2: 1376 bytes, 26396976 total- age 3: 4184
bytes, 26401160 total- age 4: 9591072 bytes, 35992232 total-
age 5: 747344 bytes, 36739576 total- age 6: 18239512 bytes,
54979088 total- age 7: 7398216 bytes, 62377304 total- age
8: 4702664 bytes, 67079968 total- age 9: 5584 bytes,
67085552 total- age 10: 3728 bytes, 67089280 total- age 11:
2416 bytes, 67091696 total- age 12: 10838496 bytes,
77930192 total- age 13: 1682368 bytes, 79612560 total- age 14:
17756736 bytes, 97369296 total- age 15: 6775352 bytes,
104144648 total: 6103985K->124729K(10752000K), 0.1872850 secs]
11541109K->5564874K(29184000K), 0.1874910 secs] [Times: user=0.72
sys=0.00, real=0.19 secs]
2013-01-01T19:28:47.229+0000:
2656246.456: [GC [1 CMS-initial-mark: 5440145K(18432000K)]
5583733K(29184000K), 0.1454260 secs] [Times: user=0.15 sys=0.00,
real=0.15 secs]
2013-01-01T19:28:47.375+0000: 2656246.602: [CMS-concurrent-mark-start]
2013-01-01T19:29:02.574+0000:
2656261.800: [CMS-concurrent-mark: 15.195/15.199 secs] [Times:
user=15.24 sys=0.03, real=15.19 secs]
2013-01-01T19:29:02.574+0000: 2656261.801: [CMS-concurrent-preclean-start]
2013-01-01T19:29:02.638+0000:
2656261.864: [CMS-concurrent-preclean: 0.061/0.064 secs] [Times:
user=0.06 sys=0.00, real=0.07 secs]
2013-01-01T19:29:02.638+0000:
2656261.865: [CMS-concurrent-abortable-preclean-start] CMS: abort
preclean due to time 2013-01-01T19:29:08.589+0000: 2656267.816:
[CMS-concurrent-abortable-preclean: 5.946/5.951 secs] [Times: user=5.96
sys=0.01, real=5.95 secs]






