Wednesday, October 23, 2013

Using Java tools to diagnose and fix memory problems

In Sep 2011, I made the following blogpost on my employer's internal blogging software because I wanted to use this as training material for offshore developers. I am replicating it here for my own use

Introduction


A lot of times an application's performance is hampered because of improper use of computing resources. It becomes difficult to diagnose these problems because usually these problems are hidden when you test with small data sets. Also, it is hard to debug a performance problem when it takes time to execute a test. Fortunately, Java provides some excellent tools that help you monitor and profile applications. This helps in diagnosis of the problems.

In this post, I will present a case study regarding how we improved performance of cashflow generation in CFP Lite using out-of-the-box tools provided by Java


Primer on CFP Lite Cashflow Generation


CFP Lite allows the user to generate cashflows by running them through an excel model. The loans that are imported into CFP Lite are fed into an excel sheet, and the formulas on the excel sheet are executed. The formulas compute cashflows. CFP Lite extracts the cashflows and saves them into the database.

Since, cashflow generation for loans are independent of each other, CFP Lite generates cashflows in a thread pool. This is done to improve the throughput of the application. However, the problem was that the impelmentation wasn't designed to take large number of loans into account and performance would degrade expenentially with number of loans. The cashflow generation code went through 2 iterations of fixes. Here, I will present how we diagnosed and fixed cashflow generation.

To make things easier for the reader to understand, I made an appplication that mimics what CFP Lite was doing. This is done to reduce the amount of code that the reader has to understand. The intent is so that the application makes it easier for the reader to understand the issue instead of getting bogged down by the specifics of CFP Lite. I have attached the sample applications to this post. The reader can run the application for him/herself and is actually encouraged to do so.

It is exepcted that the reader knows how thread pools work. For detailed explanation of how thread pools work, please see my related blog post on thread pools

Initial implementation

The initial implementation of CFP Lite had a main thread that started a Thread pool. The main thread loaded the cohorts from the database, and for each cohort created a task to be executed to the pool. The task would query the database for the loans belonging to the cohorts, create an excel model, generate cashflows using the excel model and then return the cashflows

The code

To mimic this behavior, I have implemented the application below


  1. public class RunIt {  
  2.  static Logger logger = Logger.getLogger(RunIt.class.getName());  
  3.  /** 
  4.   * @param args 
  5.   */  
  6.  public static void main(String[] args) {  
  7.   long currentTIme = System.currentTimeMillis();  
  8.   ExecutorService threadPool = Executors.newFixedThreadPool(5);  
  9.   List> futureCashflows = new ArrayList>(); // the list that contains the cashflows   
  10.     
  11.   logger.info("Submitting cohorts to executor");  
  12.   //submit all the cohorts to the thread pool  
  13.   for(long i=0; i< 1L*1000; i++) {  
  14.    Cohort cohort = getNextCohort(i);  
  15.    CashflowGenerator callable = new CashflowGenerator(cohort);  
  16.    Future futureCashflow = threadPool.submit(callable);  
  17.    futureCashflows.add(futureCashflow);  
  18.   }  
  19.   logger.info("Submitted loans to queue.. now wait for results");  
  20.   // now the loans are processing.. keep getting the results in parallel and process them  
  21.   long i=0;  
  22.   for(Future futureCashflow: futureCashflows) {  
  23.    i++;  
  24.    Cashflows cashflow;  
  25.    try {  
  26.     cashflow = futureCashflow.get();// this will block until the background threads have completed processing the task  
  27.     processCashflow(cashflow);  
  28.     if(i%100==0) logger.info("Processed " + i + " cashflows");  
  29.    } catch (Exception e) {  
  30.     // log error and continue  
  31.    }  
  32.   }  
  33.     
  34.   logger.info("Cashflow generation took " + (System.currentTimeMillis()-currentTIme) + " ms");  
  35.   System.exit(0);  
  36.  }  
  37. .  
  38. .  
  39. .  
  40. .  
  41. .  
  42. }  
  43. public class CashflowGenerator implements Callable {  
  44.  final Cohort cohort;  
  45.  public CashflowGenerator(Cohort cohort) {  
  46.   super();  
  47.   this.cohort = cohort;  
  48.  }  
  49.  public Cashflows call() throws Exception {  
  50.     
  51.   // first load loans from database  
  52.   getLoansFromDatabase();  
  53.     
  54.     
  55.   // generate cashflows  
  56.   Cashflows cashflows = generateCashflowsUsingModel();  
  57.     
  58.   return cashflows;  
  59.  }  
  60.  private Cashflows generateCashflowsUsingModel() {  
  61.   Cashflows cashflows = new Cashflows();  
  62.   initializeModel();  
  63.   while(balance>0) {  
  64.      
  65.    Cashflow cashflow = generateCashflow(r, monthlypayment, balance);  
  66.    cashflows.getCashflows().add(cashflow);// add  a  cashflow to the cashflows  
  67.    balance-=cashflow.getSched(); // remaining balance  
  68.   }  
  69.   return cashflows;  
  70.  }  
  71. .  
  72. .  
  73. .  
  74. .  
  75. .  
  76. }  

Here the Runit class contains a main method that mimics the main thread of CFP Lite. It gets all the cohorts from the DB. For each cohort it creates aCashflowGenerator task and submits the task to the thread pool. The task queries the database for the loans, creates the excel model, then generates cashflows using the model and returns the cashflows. The cashflows are stored by the executor in a future object, and the main thread maintains a list of future objects. The main thread loops through all the future results, and when they are available, gets the cashflows from the result and stores it into the DB

The problem

When you run this with 1000 cohorts, you get the following output


  1. Sep 15, 2011 12:49:11 PM tryit.memoryleak.RunIt main  
  2. INFO: Submitting cohorts to executor  
  3. Sep 15, 2011 12:49:11 PM tryit.memoryleak.RunIt main  
  4. INFO: Submitted loans to queue.. now wait for results  
  5. Sep 15, 2011 12:49:15 PM tryit.memoryleak.RunIt main  
  6. INFO: Processed 100 cashflows  
  7. Sep 15, 2011 12:49:19 PM tryit.memoryleak.RunIt main  
  8. INFO: Processed 200 cashflows  
  9. Sep 15, 2011 12:49:22 PM tryit.memoryleak.RunIt main  
  10. INFO: Processed 300 cashflows  
  11. Sep 15, 2011 12:49:26 PM tryit.memoryleak.RunIt main  
  12. INFO: Processed 400 cashflows  
  13. Sep 15, 2011 12:49:30 PM tryit.memoryleak.RunIt main  
  14. INFO: Processed 500 cashflows  
  15. Sep 15, 2011 12:49:33 PM tryit.memoryleak.RunIt main  
  16. INFO: Processed 600 cashflows  
  17. Sep 15, 2011 12:49:37 PM tryit.memoryleak.RunIt main  
  18. INFO: Processed 700 cashflows  
  19. Sep 15, 2011 12:49:41 PM tryit.memoryleak.RunIt main  
  20. INFO: Processed 800 cashflows  
  21. Sep 15, 2011 12:49:44 PM tryit.memoryleak.RunIt main  
  22. INFO: Processed 900 cashflows  
  23. Sep 15, 2011 12:49:48 PM tryit.memoryleak.RunIt main  
  24. INFO: Processed 1000 cashflows  
  25. Sep 15, 2011 12:49:48 PM tryit.memoryleak.RunIt main  
  26. INFO: Cashflow generation took 36551 ms  

It takes less than a second to sumbit the cohorts to the pool. It takes about 4 secs to generate cashflows for 100 cohorts, and 1000 cohorts execute in 36s (please note that CFP Lite ran much slower than this. The mime app ruhs much faster because it's doing less and also I didn't want people to fall asleep. The intent is to show the problem, and not focus on actual performance numbers)

Now, you would expect that 10K loans should take 360s = 5 mins, right? Let's see what happens with 10K loans. Here is what we get in the log after running it for 6 minutes


  1. Sep 15, 2011 1:18:29 PM tryit.memoryleak.RunIt main  
  2. INFO: Submitting cohorts to executor  
  3. Sep 15, 2011 1:18:30 PM tryit.memoryleak.RunIt main  
  4. INFO: Submitted loans to queue.. now wait for results  
  5. Sep 15, 2011 1:18:34 PM tryit.memoryleak.RunIt main  
  6. INFO: Processed 100 cashflows  
  7. Sep 15, 2011 1:18:37 PM tryit.memoryleak.RunIt main  
  8. INFO: Processed 200 cashflows  
  9. Sep 15, 2011 1:18:41 PM tryit.memoryleak.RunIt main  
  10. INFO: Processed 300 cashflows  
  11. Sep 15, 2011 1:18:45 PM tryit.memoryleak.RunIt main  
  12. INFO: Processed 400 cashflows  
  13. Sep 15, 2011 1:18:48 PM tryit.memoryleak.RunIt main  
  14. INFO: Processed 500 cashflows  
  15. Sep 15, 2011 1:18:52 PM tryit.memoryleak.RunIt main  
  16. INFO: Processed 600 cashflows  
  17. Sep 15, 2011 1:18:56 PM tryit.memoryleak.RunIt main  
  18. INFO: Processed 700 cashflows  
  19. Sep 15, 2011 1:19:00 PM tryit.memoryleak.RunIt main  
  20. INFO: Processed 800 cashflows  
  21. Sep 15, 2011 1:19:04 PM tryit.memoryleak.RunIt main  
  22. INFO: Processed 900 cashflows  
  23. Sep 15, 2011 1:19:08 PM tryit.memoryleak.RunIt main  
  24. INFO: Processed 1000 cashflows  
  25. Sep 15, 2011 1:19:12 PM tryit.memoryleak.RunIt main  
  26. INFO: Processed 1100 cashflows  
  27. Sep 15, 2011 1:19:16 PM tryit.memoryleak.RunIt main  
  28. INFO: Processed 1200 cashflows  
  29. Sep 15, 2011 1:19:20 PM tryit.memoryleak.RunIt main  
  30. INFO: Processed 1300 cashflows  
  31. Sep 15, 2011 1:19:24 PM tryit.memoryleak.RunIt main  
  32. INFO: Processed 1400 cashflows  
  33. Sep 15, 2011 1:19:27 PM tryit.memoryleak.RunIt main  
  34. INFO: Processed 1500 cashflows  
  35. Sep 15, 2011 1:19:31 PM tryit.memoryleak.RunIt main  
  36. INFO: Processed 1600 cashflows  
  37. Sep 15, 2011 1:19:35 PM tryit.memoryleak.RunIt main  
  38. INFO: Processed 1700 cashflows  
  39. Sep 15, 2011 1:19:40 PM tryit.memoryleak.RunIt main  
  40. INFO: Processed 1800 cashflows  
  41. Sep 15, 2011 1:19:44 PM tryit.memoryleak.RunIt main  
  42. INFO: Processed 1900 cashflows  
  43. Sep 15, 2011 1:19:48 PM tryit.memoryleak.RunIt main  
  44. INFO: Processed 2000 cashflows  
  45. Sep 15, 2011 1:19:52 PM tryit.memoryleak.RunIt main  
  46. INFO: Processed 2100 cashflows  
  47. Sep 15, 2011 1:19:56 PM tryit.memoryleak.RunIt main  
  48. INFO: Processed 2200 cashflows  
  49. Sep 15, 2011 1:20:00 PM tryit.memoryleak.RunIt main  
  50. INFO: Processed 2300 cashflows  
  51. Sep 15, 2011 1:20:04 PM tryit.memoryleak.RunIt main  
  52. INFO: Processed 2400 cashflows  
  53. Sep 15, 2011 1:20:08 PM tryit.memoryleak.RunIt main  
  54. INFO: Processed 2500 cashflows  
  55. Sep 15, 2011 1:20:12 PM tryit.memoryleak.RunIt main  
  56. INFO: Processed 2600 cashflows  
  57. Sep 15, 2011 1:20:16 PM tryit.memoryleak.RunIt main  
  58. INFO: Processed 2700 cashflows  
  59. Sep 15, 2011 1:20:20 PM tryit.memoryleak.RunIt main  
  60. INFO: Processed 2800 cashflows  
  61. Sep 15, 2011 1:20:25 PM tryit.memoryleak.RunIt main  
  62. INFO: Processed 2900 cashflows  
  63. Sep 15, 2011 1:20:29 PM tryit.memoryleak.RunIt main  
  64. INFO: Processed 3000 cashflows  
  65. Sep 15, 2011 1:20:33 PM tryit.memoryleak.RunIt main  
  66. INFO: Processed 3100 cashflows  
  67. Sep 15, 2011 1:20:38 PM tryit.memoryleak.RunIt main  
  68. INFO: Processed 3200 cashflows  
  69. Sep 15, 2011 1:20:42 PM tryit.memoryleak.RunIt main  
  70. INFO: Processed 3300 cashflows  
  71. Sep 15, 2011 1:20:46 PM tryit.memoryleak.RunIt main  
  72. INFO: Processed 3400 cashflows  
  73. Sep 15, 2011 1:20:50 PM tryit.memoryleak.RunIt main  
  74. INFO: Processed 3500 cashflows  
  75. Sep 15, 2011 1:20:55 PM tryit.memoryleak.RunIt main  
  76. INFO: Processed 3600 cashflows  
  77. Sep 15, 2011 1:20:59 PM tryit.memoryleak.RunIt main  
  78. INFO: Processed 3700 cashflows  
  79. Sep 15, 2011 1:21:12 PM tryit.memoryleak.RunIt main  
  80. INFO: Processed 3800 cashflows  
  81. Sep 15, 2011 1:21:17 PM tryit.memoryleak.RunIt main  
  82. INFO: Processed 3900 cashflows  
  83. Sep 15, 2011 1:22:15 PM tryit.memoryleak.RunIt main  
  84. INFO: Processed 4000 cashflows  
  85. Sep 15, 2011 1:24:14 PM tryit.memoryleak.RunIt main  
  86. INFO: Processed 4100 cashflows  

It processed only 4100 cohorts in 6 mins! What happenned? Should be done by now! Also if you see it starts off by taking 4 secs per 100 cohorts like usual. However, right around the 4 min mark, it is suddenly taking 1 min or more per 100 cohorts. Eventually, this thing will slow down further and further. What ahppenned after 3800 cohorts? Each task is doing exactly the same thing. It should take the same time

ETA: I let it run while I wastyping out this blog post, and 10K loans took 3851206ms = 64 mins. A 12 times reduction in performance

The analysis

So, to investigate this problem, I fired up JConsole. Jconsole is an application that is available along with Java. If you have java in your class path, you should have JConsole in your class path. It helps you look inside the application and tells you how it is behaving. When you start JConsole, it gives you the ist of java applications that you can monitor. It gives the class namee of the class that has the main thread. Note that JConsole is itself s Java application, so it shows itself  If you have Java based IDE running, it will show you that too (in my case eclipse). In my case my main class is Runit, and I select that and hit connect. If you are monitoring a web app running in JBoss, it will show up as JBoss bootstrap class
jconsole-connect.JPG

After you connect, it shows you a dashboard that gives you an overview of what is happenning under the hood

jconsole-dashboard.JPG

The dashboard shows the CPU, the memory, threads and number of classes. The tabs above the screen show you a detailed view of each. Click on the memory tab and you can see this

jconsole-memory.JPG

Here you see a plot of heap memory usage by default, and it also shows you how much memory is used by eden space, survivor space and old gen space (more on this later). Now, let's see how the heap memory looks like after you run 10K loans

jconsole-memory-full.JPG

Tada! 5 points for anyone who can guess what the problem is at this point

The memory is getting full. As you can see, the application started with low memory, but the memory started filling up quickly, right after the 3-4 min mark, at which point it is right under 1 GB. 1 GB is the maximum amount of memory that I gave to Java. Also, see that the Old Gen space is full( it's the 3rds bar from the left). All these things indicate that the application is using too much memory. Also, another indicator is found by comparing the VM summarry and Memory screen

VM summaryMemory
jconsole-vmsummarry.JPG
jconsole GC time.JPG

Note in the Process CPU time it says 20 mins. This means that the CPU spent 20 mins in the application. note that this isn't the time the app was running.. that is uptime. This is the CPU time. If you have multiple threads running in a multi-CPU machine, JConsole will add up all the time spent by all the CPUS.. which is why Process CPU time is more than uptime Also see that GC time in the memory tab is 10 minutes. Note:- the application was running for 13 mins, out of which 20 mins were used in CPU. 10 mins out of that 20 minutes was spent in GC. That is excessive. The application would have been 100% faster if there were no GC running!
So, by looking at the memory usage and the GC time, we get a clue that the problem is excessive memory usage. The question is where is the memory usage. But before we handle that the question is why does excessive memory usage slow down the application? SHouldn;t it just go out of memory? Why should I even think about reducing memory usage when the problem reported to me is that the application is slow? Shouldn;t I try to optimize the database queries or increase the number of threads? Yes those things might help but they won't solve the core problem of memory (Actually when this problem was reported, Atul K did try to increase threads, and what happenned was that it took more time.. not less. Yes!! application ran slower by increasing the threads, which seems paradoxical) Unless you don't open JConsole and look under the hood you won't know exactly what is happenning

Why does using more memory slow down a Java application

The answer has to do with how GC works. GC is the biggest boon and bane for Java. Before Java, when most high performing applications were written in C/C++, the programmer had to manage his/her own memory. o, if a memory was allocated, the code had to deallocate the memory. That led to huge problems because the code will either a) forget to deallocate memory which led to memory leaks or b) use an object that is already deallocated, which led to crashes(this is also called as the dangling pointer problem) It wasn't uncommon for application to require a reboot every once in a while (to "clear" memory leaks), or to crash unexpectedly because of a dangling pointer. GC more or less solved these problems by handling the deallocation for the programmer. The code can allocate memory, but doesn;t deallocate it. Once an object is out of scope, and doesn;t have any references from other objects, then it becomes a candidate for deallocation. GC goes ahead and deallocates the memory when it runs.

The problem is finding objects for deallocation is not a simple thing. Let's say you have a List A that contains an Object B. Internally B contains 3 objects C, D and E. Let's say D contains a back reference to the parent Object B, and E contains a reference to itself. Now let's say you remove B from A. The programmer would expect that B would get deallocated because there are no references to it. However, it;s not as simple. GC has to find that when B is deallocated, it has to deallocate C, D and E too, so it has to walk through the object tree to find all the related objects that should be deallocated. Also, from B --> D and D --> B. There is a circular reference to B. To properly deallocate B, it has to detect the circular reference and ignore it. Also, E has a reference to itself. It has to detect that circular reference too. This requires multiple passes through the object tree. Proper implementation of GC is not a simple topic, and there are still people doing in academia doing research on improving GC implementations. It would be out of scope and useless for any of us programmers to know the nitty gritty details. Suffice to say that what we as programmers need to know is this GC is costly
Also, another thing to note is that GC becomes slower as you use more memory. When GC needs to find objects to collect, it has to walk through the object tree. If you have 1000 objects in memory that don't need to be collected, GC still has to walk through those objects. If you have 10000 objects, it's going to 10 times longer to run GC
Since people who design JVMs know that GC is costly, they make the JVM so that GC is lazy. Normally in an application that is idle most of the time, GC runs periodically. It doesn;t run every time an object goes out of scope. However, in an active application that is actively creating objects and releasing objects, GC will run when it absolutely has to. What the JVM does is that it holds off on collection of the objects until the memory usage reaches a limit, and then runs GC. If you look at the memory graph above, you see patterns of memory increasing and then dropping slightly. The drop is where GC ran a collection and released objects. This is done to improve performance. However, the side effect of this is that GC runs more often when you use more memory. If you are running the memory right under the max memory given to Java, the JVM is going to hit the limit more often and run GC more often. To draw an anaology, this is like when you are bouncing a basketball on the ground. If you keep your hand at your chest level and then start moving it down, the ball bounces faster and faster even though you control the speed of descent of the ball. This is because the ball has less distance to travel the closer it is to the ground. Similarly, the GC "bounces" more often if the memory usage is closer to the top.
So, the end result of GC being costly + becoming costlier as you use more memory + running more often is what causes the application to slow down as it uses more memory. Eventually, if GC is not able to release any memory an Out of memory exception is thrown. However, that happens long after the application has slowed down to a crawl. The first clue for most memory problems in Java apps is almost always "My application is running slower" Sometimes, what this means is that there can be unexpected side effects in typical applications. For example, let's say you have a bit of code that opens a connection and runs sql queries in a loop. Now, if code inside the loop is using too much memory, it will start getting slower and slower in every iteration. At some point, the database connection times out, and the code gets a SQLException. The code logs the exception and exits. When it exits, all the memory that it allocated inside the loop gets deallocated, and GC clears it up in the next collection. Suddenly the memory becomes low, and the Out of memory problem "heals" itself. The sysadmin might look at the memory usage after the problem happens, and see that memory usage is low and hence mistakenly conclude that memory is not a problem. When the problem is reported to the developer, the stack trace says "Connection time out" along with a note from the admin that says memory usage is low. So, the developer might looking at the SQL queries, and may spend hours if not days tuning the queries. That does make things a bit faster, but doesn't solve the root cause. The stack trace is misleading that leads the developer down the wrong rabbit hole.

The only way to solve this problem is for the developer to test performance under load him/herself with JConsole running before the application is released. Nothing beats running JConsole yourself

The fix

After we determined that memory was an issue, we have to find out where exactly the memory was being used. Please note that the analysis done through JConsole doesn't help us find the exact problem, but leads us down the correct path. There are several ways we can find memory problems, and there are tools that we can use to look inside the heap. I will cover that later. In this case, Atul K went through the code and found that the Cashflows were being stored in the Future objects, and theFuture objects aren;t released until the entire cashflow processing was done. So, we changed the code so that the task writes the cashflows to a CSV file. The main thread takes the CSV files and uses sqlloader to load them into the database after all tasks are complete. This had 2 advantages
a) the Future object doesn't need to have cashflows
b) part of the processing that was done by main thread will now be done inside the task. This means that it will be done in parallel, which will make the application faster

Actually, there was no need to keep a Future object anymore, and also we could have used Runnable instead of Callable. However, since we were under a time constraint (it was evening before release) we decided to do a quick change that provided the most improvement and decided to not remove the Future objects