This article is to inform you that I will be performing a review and trial of the Plumbr Java memory leak detector tool. This review will be available from this Blog along with the sample (leaking) Java EE program I created for that purpose.
Plumbr tool creator’s mission
I can assure you that Java memory leaks can be quite complex to solve as there are different memory leak patterns and the analysis phase can be tricky; especially when you attempt to differentiate between leaking Java objects and healthy memory footprint.
That being said, I agree with Plumbr creators that Java memory leak analysis should not be only reserved to Java experts; this is one of the reasons I created this Blog at the first place, for the less experienced Java EE individuals who desire to learn key problem patterns such as Java memory leaks and how to resolve them.
Learning these advanced troubleshooting principles is always easier when we are equipped with the proper tools.
Plumbr’s primary objective is to assist you in your troubleshooting effort by learning about your Java / Java EE application and then to detect potential memory leak source(s) and location(s) within your implementation.
Now let’s begin...
Please bookmark this article and stay tuned for the full review and case study.
This article will provide you with a tutorial on how you can analyze a JVM memory leak problem by generating and analyzing a Sun HotSpot JVM HPROF Heap Dump file.
A real life case study will be used for that purpose: Weblogic 9.2 memory leak affecting the Weblogic Admin server.
Environment specifications
·Java EE server: Oracle Weblogic Server 9.2 MP1
·Middleware OS: Solaris 10
·Java VM: Sun HotSpot 1.5.0_22
·Platform type: Middle tier
Monitoring and troubleshooting tools
·Quest Foglight (JVM and garbage collection monitoring)
·jmap (hprof / Heap Dump generation tool)
·Memory Analyzer 1.1 via IBM support assistant (hprof Heap Dump analysis)
·Platform type: Middle tier
Step #1 – WLS 9.2 Admin server JVM monitoring and leak confirmation
The Quest Foglight Java EE monitoring tool was quite useful to identify a Java Heap leak from our Weblogic Admin server. As you can see below, the Java Heap memory is growing over time.
If you are not using any monitoring tool for your Weblogic environment, my recommendation to you is to at least enable verbose:gc of your HotSpot VM. Please visit my Java 7 verbose:gc tutorial on this subject for more detailed instructions.
Step #2 – Generate a Heap Dump from your leaking JVM
Following the discovery of a JVM memory leak, the goal is to generate a Heap Dump file (binary format) by using the Sun JDK jmap utility.
** please note that jmap Heap Dump generation will cause your JVM to become unresponsive so please ensure that no more traffic is sent to your affected / leaking JVM before running the jmap utility **
<JDK HOME>/bin/jmap -heap:format=b <Java VM PID>
This command will generate a Heap Dump binary file (heap.bin) of your leaking JVM. The size of the file and elapsed time of the generation process will depend of your JVM size and machine specifications / speed.
For our case study, a binary Heap Dump file of ~ 2 GB was generated in about 1 hour elapsed time.
Sun HotSpot 1.5/1.6/1.7 Heap Dump file will also be generated automatically as a result of a OutOfMemoryError and by adding -XX:+HeapDumpOnOutOfMemoryError in your JVM start-up arguments.
Step #3 – Load your Heap Dump file in Memory Analyzer tool
It is now time to load your Heap Dump file in the Memory Analyzer tool. The loading process will take several minutes depending of the size of your Heap Dump and speed of your machine.
Step #4 – Analyze your Heap Dump
The Memory Analyzer provides you with many features, including a Leak Suspect report. For this case study, the Java Heap histogram was used as a starting point to analyze the leaking objects and the source.
For our case study, java.lang.String and char[] data were found as the leaking Objects. Now question is what is the source of the leak e.g. references of those leaking Objects. Simply right click over your leaking objects and select >> List Objects > with incoming references
As you can see, javax.management.ObjectName objects were found as the source of the leaking String & char[] data. The Weblogic Admin server is communicating and pulling stats from its managed servers via MBeans / JMX which create javax.management.ObjectName for any MBean object type. Now question is why Weblogic 9.2 is not releasing properly such Objects…
Following our Heap Dump analysis, a review of the Weblogic known issues was performed which did reveal the following Weblogic 9.2 bug below:
·Weblogic Bug ID: CR327368
·Description: Memory leak of javax.management.ObjectName objects on the Administration Server used to cause OutOfMemory error on the Administration Server.
This finding was quite conclusive given the perfect match of our Heap Dump analysis, WLS version and this known problem description.
Conclusion
I hope this tutorial along with case study has helped you understand how you can pinpoint the source of a Java Heap leak using jmap and the Memory Analyzer tool.
Please don’t hesitate to post any comment or question.
I also provided free Java EE consultation so please simply email me and provide me with a download link of your Heap Dump file so I can analyze it for you and create an article on this Blog to describe your problem, root cause and resolution.
This article will provide you with a step by step tutorial on how you can pinpoint root cause of Java class loader memory leak related problems.
A recent class loader leak problem found from a Weblogic Integration 9.2 production system on AIX 5.3 (using the IBM Java VM 1.5) will be used as a case study and will provide you with complete root cause analysis steps.
Environment specifications (case study)
·Java EE server: Oracle Weblogic Integration 9.2 MP2
·OS: AIX 5.3 TL9 64-bit
·JDK: IBM VM 1.5.0 SR6 - 32-bit
·Java VM arguments: -server -Xms1792m -Xmx1792m -Xss256k
·Platform type: Middle tier - Order Provisioning System
Monitoring and troubleshooting tools
·IBM Thread Dump (javacore.xyz.txt format)
·IBM AIX 5.3 svmon command
·IBM Java VM 1.5 Heap Dump file (heapdump.xyz.phd format)
Java class loader memory leak can be quite hard to identify. The first challenge is to determine that you are really facing a class loader leak vs. other Java Heap related memory problems. Getting OutOfMemoryError from your log is often the first symptom; especially when the Thread is involved in a class loading call, creation of Java Thread etc.
If you are reading this article, chances are that you already did some analysis and are suspecting a class loader leak at the source of your problem. I will still show how you can confirm your problem is 100% due to a class loader leak.
Step #1 – AIX native memory monitoring and problem confirmation
Your first task is to determine if your memory problem and/or OutOfMemoryError is really caused by a depletion of your native memory segments. If you are not familiar with this, I suggest you first go through my other article that will explain you how to monitor native memory on AIX 5.3 of your IBM Java VM process on AIX 5.3.
Using the AIX svmon command, the idea is to monitor and build a native memory comparison matrix, on a regular basis as per below. In our case study production environment, the native memory capacity is 768 MB (3 segments of 256MB).
As you can see below, the native memory is clearly leaking at a rate of 50-70MB daily.
Date
Weblogic Instance Name
Native Memory (MB)
Native memory delta increase (MB)
20-Apr-11
Node1
530
+ 54 MB
Node2
490
+65 MB
Node3
593
+70 MB
Node4
512
+50 MB
Date
Weblogic Instance Name
Native Memory (MB)
19-Apr-11
Node1
476
Node2
425
Node3
523
Node4
462
This approach will allow you to confirm that your problem is related to native memory and also understand the rate of the leak itself.
Step #2 – Loaded classes and class loader monitoring
At this point, the next step is to determine if your native memory leak is due to class loader leak. Java objects like class descriptors, method names, Threads etc. are stored mainly in the native memory segments since these objects are more static in nature.
A simple way to keep track on your class loader stats is to generate a few IBM VM Thread Dump on a daily basis as per below explanations:
First, identify the PID of your Java process and generate a Thread Dump via the kill -3 command.
** kill -3 is actually not killing the process, simply generating a running JVM snapshot and is really low risk for your production environment **
This command will generate an IBM VM format Thread Dump at the root of your Weblogic domain.
Now open the Thread Dump file with an editor of your choice and look for the following keyword:
CLASSES subcomponent dump routine
This section provides you full detail on the # of class loaders in your Java VM along with # of class instances for each class loader
You can keep track on the total count of class loader instances by running a quick grep command:
grep -c '2CLTEXTCLLOAD' <javacore.xyz.txt file>
You can keep track on the total count of class instances by running a quick grep command:
grep -c '3CLTEXTCLASS' <javacore.xyz.txt file>
For our case study, the number of class loaders and classes found was very huge and showing an increase on a daily basis.
·3CLTEXTCLASS : ~ 52000 class instances
·2CLTEXTCLLOAD: ~ 21000 class loader instances
Step #3 – Loaded classes and class loader delta increase and culprit identification
This step requires you to identify the source of the increase. By looking at the class loader and class instances, you should be able to fairly easily identify a list of primary suspects that you can analyse further. This could be application class instances or even Weblogic classes. Leaking class instances could also be Java $Proxy instances created by dynamic class loading frameworks using the Java Reflection API.
In our scenario, we found an interesting leak of the number of $ProxyXYZ class increases referenced by the Weblogic Generic class loader.
This class instance type was by far the #1 contributor for all our class instances. Further monitoring of the native memory and class instances did confirm that the source of delta increase of class instances was due to a leak of $ProxyXYZ related instances.
Such $Proxy instances are created by the Weblogic Integration 9.2 BPM (business process management) engine during our application business processes and normally implemented and managed via java.lang.ref.SoftReference data structures and garbage collected when necessary. The Java VM is guaranteed to clear any SoftReference prior to an OutOfMemoryError so any such leak could be a symptom of hard references still active on the associated temporary Weblogic generic class loader.
The next step was to analyze the generated IBM VM Heap Dump file following an OutOfMemoryError condition.
Step #4 - Heap Dump analysis
A Heap Dump file is generated by default from the IBM Java VM 1.5 following an OutOfMemoryError. A Java VM heap dump file contains all information on your Java Heap memory but can also help you pinpoint class loader native memory leaks since it also provides detail on the class loader objects as pointers to the real native memory objects.
Find below a step by step Heap Dump analysis conducted using the ISA 4.1 tool (Memory Analyzer).
1) Open IAS and load the Heap Dump (heapdump.xyz.phdformat) and select the Leak Suspects Reportin order to have a look at the list of memory leak suspects
2) Once you find the source of class instance leak, the easiest way to analyze next is to use the find by address function from the tool and deep dive further
** In our case, the key question mark was why the Weblogic class loader itself was still referenced and still keeping hard reference to such $Proxy instances **
3) The final step was to deep dive within one sample of Weblogic Generic class loader instance (0x73C641C8) and attempt to pinpoint the culprit parent referrer
As you can see from the snapshot, the inner class weblogic/controls/container/ConfiguredServiceProvider$ProviderKey was identified as the primary suspect and potential culprit of the problem.
Potential root cause and conclusion
As per the Heap Dump analysis. This data structure appear to be maintaining a list of java.lang.ref.SoftReference for the generated class loader instances but also appear to be holding hard references; preventing the Java VM to garbage collect the unused Weblogic Generic class loader instances and its associated $Proxy instances.
Further analysis of the Weblogic code will be required by Oracle support along with some knowledge base database research as this could be a known issue of the WLI 9.2 BPM.
I hope this tutorial will help you in your class loader leak analysis when using an IBM Java VM, please do not hesitate to post any comment or question on the subject.
Solution and next steps
We are discussing this problem right now with the Oracle support team and I will keep you informed of the solution as soon as possible so please stay tuned for more update on this post.
This case study describes the complete root cause analysis of a Java Heap leak problem experienced with Oracle Weblogic 10 and using the open source text search engine Apache Lucene 2.3.2.
Environment specifications
·Java EE server: Oracle Weblogic Portal 10.0
·OS: Solaris 10 64-bit
·JDK: Sun Java HotSpot 1.5.0_11-b03 32-bit
·Java VM arguments: -XX:PermSize=512m -XX:MaxPermSize=512m -XX:+UseParallelGC -Xms1536m -Xmx1536m -XX:+HeapDumpOnOutOfMemoryError
-·Problem type: Java Heap memory leak observed via JConsole monitoring
A Java Heap leak was detected in our production environment following a capacity planning initiative of a the Java EE environment and infrastructure which involved close data monitoring and analysis.
This finding did also explain why the support team had to restart the Weblogic environment on a weekly in order to avoid severe performance degradation. A degradation of performance of a Java EE server over time is often the symptom of a memory/resource leak.
Gathering and validation of facts
As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:
·What is the client impact? HIGH (if Weblogic is not restarted every week)
·Recent change of the affected platform? No
·Any recent traffic increase to the affected platform? No
·Since how long this problem has been observed? This problem has been identified since several months but no corrective action taken until then
·Is the Java Heap depletion happening suddenly or over time? It was observed via VisualGC that the Java Heap (old generation space) is increasing over time with a full depletion rate of ~7 days
·Did a restart of the Weblogic server resolve the problem? No, Weblogic restart currently used as a mitigation strategy and workaround only
·Conclusion #1: The problem is related to a memory leak of the Java Heap space with a full Java Heap depletion / failure rate of ~ 10 days
Java Heap monitoring
The Java Heap old generation and Eden space were both monitored using Java VisualGC 3.0 monitoring tool. The review of the VisualGC data was quite conclusive on the fact that our application is leaking the Java Heap old gen space on a regular basis. The next logical step was the Heap Dump analysis.
JVM running for 6 days and 7 hours
JVM running for 7 days and 8 hours
JVM running for 8 days and 7 hours
Heap Dump analysis
Find below a step by step Heap Dump analysis conducted using the ISA 4.1 tool (Memory Analyser).
1) Open IAS and load the Heap Dump (hprof format)
2) Select the Leak Suspects Report in order to have a look at the list of memory leak suspects
3) As per below, the Apache Lucene org.apache.lucene.store.RAMInputStreamobject was identified as our primary leak suspect
4) The final step was to deep dive within one of the Lucene object in order to identify the source of the leak
Conclusion
The primary Java Heap memory leak seems to originate from the Apache Lucene framework and due java.lang.ThreadLocal variables still maintaining reference to org.apache.lucene.store.RAMInputStream instances with memory footprint up to 30 MB for each instance.
Root cause and solution: Apache Lucene bug report #1383!
We did some research on the Apache issue tracking system and found a bug ISSUE-LUCENE-1383 reported back in 2008 which did correlate with our Heap Dump analysis findings. Find below a description of the problem:
"Java's ThreadLocal is dangerous to use because it is able to take a surprisingly very long time to release references to the values you store in it. Even when a ThreadLocal instance itself is GC'd, hard references to the values you had stored in it are easily kept for quite some time later.
While this is not technically a "memory leak", because eventually (when the underlying Map that stores the values cleans up its "stale" references) the hard reference will be cleared, and GC can proceed, its end behaviour is not different from a memory leak in that under the right situation you can easily tie up far more memory than you'd expect, and then hit unexpected OOM error despite allocating an extremely large heap to your JVM.”
“The patch adds CloseableThreadLocal. It's a wrapper around ThreadLocal that wraps the values inside a WeakReference, but then also holds a strong reference to the value (to ensure GC doesn't reclaim it) until you call the close method. On calling close, GC is then free to reclaim all values you had stored; regardless of how long it takes ThreadLocal's implementation to actually release its references.“
This problem was fixed starting in Apache Lucene version 2.4.
Solution and next steps
The solution will require an upgrade of Apache Lucene from version 2.3.2 to 3.1. The project is still in the early analysis phase and I will provide an update and results as soon as available.
This case study describes the complete steps from root cause analysis to resolution of a JDBC connection pool leak problem experienced with Oracle Weblogic 10.3 and using the open source Java persistence framework Hibernate 3.2.3.
This case study will also demonstrate the importance of best coding practices when using Hibernate and how a simple code problem can have severe consequences on a production environment.
JDBC Data Source Connection Pool depletion was observed in our production environment following an upgrade from Weblogic Portal 8.1 to Weblogic Portal 10.3.
Initial problem mitigation did involve restarting all the Weblogic managed servers almost every hour until decision was taken to rollback to the previous Weblogic Portal 8.1 environment.
Gathering and validation of facts
As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:
·What is the client impact? HIGH
·Recent change of the affected platform? Yes, the application was migrated recently from Weblogic Portal 8.1 to Weblogic Portal 10.3. Also, the application code was migrated from EJB Entity Beans to Hibernate for the read and write operations to the Oracle database
·Any recent traffic increase to the affected platform? No
·Since how long this problem has been observed? Right after the upgrade project deployment
· Is the JDBC Connection Pool depletion happening suddenly or over time? It was observed via our internal enterprise JMX client monitoring tool that the connection pool is increasing over time at a fast rate. The current pool capacity is 25
·Did a restart of the Weblogic server resolve the problem? No, rollback to the previous environment was required
·Conclusion #1: The problem is related to a JDBC Connection Pool leak of the primary application JDBC data source
·Conclusion #2: This problems correlates with the Weblogic upgrade and migration of our application code to Hibernate
JDBC Pool utilization
The history of JDBC Pool utilization was captured using our internal enterprise remote JMX monitoring tool. Such monitoring is achieved by connecting remotely to the Weblogic server and pull detail from the runtime Weblogic MBeans:
StringBuffer canonicalObjNameBuffer = new StringBuffer();
// JDBC data source MBean metrics extraction int activeConnectionCount = (Integer)adapter.getJMXService().getJMXConnection().getMBeanAttribute(canonicalObjName, "ActiveConnectionsCurrentCount"); int leakedConnectionCount = (Integer)adapter.getJMXService().getJMXConnection().getMBeanAttribute(canonicalObjName, "LeakedConnectionCount"); long reservedRequestCount = (Long)adapter.getJMXService().getJMXConnection().getMBeanAttribute(canonicalObjName, "ReserveRequestCount"); int connectionDelayTime = (Integer)adapter.getJMXService().getJMXConnection().getMBeanAttribute(canonicalObjName, "ConnectionDelayTime");
The graph below represents a history of the ActiveConnectionsCurrentCount MBean metric. This corresponds to your current connection pool utilization.
The results were quire conclusive as it did reveal such connection pool leak and some sudden surge leading to full depletion.
Error log review
The log review did reveal the following error during these 2 episodes of connection leak surge. There was a very good correlation with # of errors found in these logs and the # of leaked connections increase observed from graph. The error was thrown during execution of the Hibernate Session.flush() method due to null value injection.
Unexpected exception has occured: org.hibernate.PropertyValueException: not-null property references a null or transient value: app.AppComponent.method UnExpected Exception has been occured:
org.hibernate.PropertyValueException: not-null property references a null or transient value: app.AppComponent.method
at org.hibernate.engine.Nullability.checkNullability(Nullability.java:72)
at org.hibernate.event.def.DefaultFlushEntityEventListener.scheduleUpdate(DefaultFlushEntityEventListener.java:263)
at org.hibernate.event.def.DefaultFlushEntityEventListener.onFlushEntity(DefaultFlushEntityEventListener.java:121)
at org.hibernate.event.def.AbstractFlushingEventListener.flushEntities(AbstractFlushingEventListener.java:196)
at org.hibernate.event.def.AbstractFlushingEventListener.flushEverythingToExecutions(AbstractFlushingEventListener.java:76)
at org.hibernate.event.def.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:26)
at org.hibernate.impl.SessionImpl.flush(SessionImpl.java:1000)
The exercise did reveal the source of the leak from one of our application DAO using Hibernate; same DAO component which was generating errors in our log due to null value injection.
Hibernate best practices and code review
Proper coding best practices are important when using Hibernate to ensure proper Session / Connection resource closure. Any Hibernate Session along with JDBC resource must be closed in a finally{} block to properly handle any failure scenario. Such finally{} block must also be shielded against any failure condition in order to guarantee closure of the Hibernate session by adding proper try{} catch{} block when applicable.
The code analysis did reveal a problem with the Hibernate Session / Connection closure code; bypassing the execution of the Hibernate session.close() in the scenario an Exception is thrown during execution of session.flush().
finally {
if (session != null) {
session.flush();
session.close();
}
}
Root cause
The root cause of the connection leak was concluded as a combination code defect within our DAO component. The code portion that is taking care of the Hibernate session flush was not embedded in a try{} catch{} block. Any failure of the Session.flush() was bypassing closure of the Hibernate session; triggering a leak of the associated JDBC connection.
Solution and results
A code fix was applied to the problematic DAO component as per below and deployed to our production environment following proper testing.
BEFORE
AFTER
The results were quite conclusive and did confirm a complete resolution of the JDBC connection pool leak.
Conclusion and recommendations
·Perform regular code walkthrough of any new code with an increased focus for any new code dealing with JDBC and third party API implementation such as Hibernate. Always ensure that your JDBC and Hibernate related resource closure code is bullet proof against any negative scenario.
·Perform regular monitoring of your production Weblogic JDBC Pool utilization, ensure no JDBC Pool leak is present (proactive monitoring). Same exercise should be done in your load testing environment before deploying a major release.
P-H is a Java EE consultant currently working for CGI Inc. in Canada. He has 10 years+ of experience developing and troubleshooting Java EE systems. Email: [email protected].