In Sep 2011, I made the following blogpost on my employer's internal blogging software because I wanted to use this as training material for offshore developers. I am replicating it here for my own use
Introduction
A lot of times an application's performance is hampered because of improper use of computing resources. It becomes difficult to diagnose these problems because usually these problems are hidden when you test with small data sets. Also, it is hard to debug a performance problem when it takes time to execute a test. Fortunately, Java provides some excellent tools that help you monitor and profile applications. This helps in diagnosis of the problems.
In this post, I will present a case study regarding how we improved performance of cashflow generation in CFP Lite using out-of-the-box tools provided by Java
Primer on CFP Lite Cashflow Generation
CFP Lite allows the user to generate cashflows by running them through an excel model. The loans that are imported into CFP Lite are fed into an excel sheet, and the formulas on the excel sheet are executed. The formulas compute cashflows. CFP Lite extracts the cashflows and saves them into the database.
Since, cashflow generation for loans are independent of each other, CFP Lite generates cashflows in a thread pool. This is done to improve the throughput of the application. However, the problem was that the impelmentation wasn't designed to take large number of loans into account and performance would degrade expenentially with number of loans. The cashflow generation code went through 2 iterations of fixes. Here, I will present how we diagnosed and fixed cashflow generation.
To make things easier for the reader to understand, I made an appplication that mimics what CFP Lite was doing. This is done to reduce the amount of code that the reader has to understand. The intent is so that the application makes it easier for the reader to understand the issue instead of getting bogged down by the specifics of CFP Lite. I have attached the sample applications to this post. The reader can run the application for him/herself and is actually encouraged to do so.
It is exepcted that the reader knows how thread pools work. For detailed explanation of how thread pools work, please see my related blog post on thread pools
Initial implementation
The initial implementation of CFP Lite had a main thread that started a Thread pool. The main thread loaded the cohorts from the database, and for each cohort created a task to be executed to the pool. The task would query the database for the loans belonging to the cohorts, create an excel model, generate cashflows using the excel model and then return the cashflows
The code
To mimic this behavior, I have implemented the application below
- public class RunIt {
- static Logger logger = Logger.getLogger(RunIt.class.getName());
- /**
- * @param args
- */
- public static void main(String[] args) {
- long currentTIme = System.currentTimeMillis();
- ExecutorService threadPool = Executors.newFixedThreadPool(5);
- List
> futureCashflows = new ArrayList >(); // the list that contains the cashflows - logger.info("Submitting cohorts to executor");
- //submit all the cohorts to the thread pool
- for(long i=0; i< 1L*1000; i++) {
- Cohort cohort = getNextCohort(i);
- CashflowGenerator callable = new CashflowGenerator(cohort);
- Future
futureCashflow = threadPool.submit(callable); - futureCashflows.add(futureCashflow);
- }
- logger.info("Submitted loans to queue.. now wait for results");
- // now the loans are processing.. keep getting the results in parallel and process them
- long i=0;
- for(Future
futureCashflow: futureCashflows) { - i++;
- Cashflows cashflow;
- try {
- cashflow = futureCashflow.get();// this will block until the background threads have completed processing the task
- processCashflow(cashflow);
- if(i%100==0) logger.info("Processed " + i + " cashflows");
- } catch (Exception e) {
- // log error and continue
- }
- }
- logger.info("Cashflow generation took " + (System.currentTimeMillis()-currentTIme) + " ms");
- System.exit(0);
- }
- .
- .
- .
- .
- .
- }
- public class CashflowGenerator implements Callable
{ - final Cohort cohort;
- public CashflowGenerator(Cohort cohort) {
- super();
- this.cohort = cohort;
- }
- public Cashflows call() throws Exception {
- // first load loans from database
- getLoansFromDatabase();
- // generate cashflows
- Cashflows cashflows = generateCashflowsUsingModel();
- return cashflows;
- }
- private Cashflows generateCashflowsUsingModel() {
- Cashflows cashflows = new Cashflows();
- initializeModel();
- while(balance>0) {
- Cashflow cashflow = generateCashflow(r, monthlypayment, balance);
- cashflows.getCashflows().add(cashflow);// add a cashflow to the cashflows
- balance-=cashflow.getSched(); // remaining balance
- }
- return cashflows;
- }
- .
- .
- .
- .
- .
- }
Here the Runit class contains a main method that mimics the main thread of CFP Lite. It gets all the cohorts from the DB. For each cohort it creates aCashflowGenerator task and submits the task to the thread pool. The task queries the database for the loans, creates the excel model, then generates cashflows using the model and returns the cashflows. The cashflows are stored by the executor in a future object, and the main thread maintains a list of future objects. The main thread loops through all the future results, and when they are available, gets the cashflows from the result and stores it into the DB
The problem
When you run this with 1000 cohorts, you get the following output
- Sep 15, 2011 12:49:11 PM tryit.memoryleak.RunIt main
- INFO: Submitting cohorts to executor
- Sep 15, 2011 12:49:11 PM tryit.memoryleak.RunIt main
- INFO: Submitted loans to queue.. now wait for results
- Sep 15, 2011 12:49:15 PM tryit.memoryleak.RunIt main
- INFO: Processed 100 cashflows
- Sep 15, 2011 12:49:19 PM tryit.memoryleak.RunIt main
- INFO: Processed 200 cashflows
- Sep 15, 2011 12:49:22 PM tryit.memoryleak.RunIt main
- INFO: Processed 300 cashflows
- Sep 15, 2011 12:49:26 PM tryit.memoryleak.RunIt main
- INFO: Processed 400 cashflows
- Sep 15, 2011 12:49:30 PM tryit.memoryleak.RunIt main
- INFO: Processed 500 cashflows
- Sep 15, 2011 12:49:33 PM tryit.memoryleak.RunIt main
- INFO: Processed 600 cashflows
- Sep 15, 2011 12:49:37 PM tryit.memoryleak.RunIt main
- INFO: Processed 700 cashflows
- Sep 15, 2011 12:49:41 PM tryit.memoryleak.RunIt main
- INFO: Processed 800 cashflows
- Sep 15, 2011 12:49:44 PM tryit.memoryleak.RunIt main
- INFO: Processed 900 cashflows
- Sep 15, 2011 12:49:48 PM tryit.memoryleak.RunIt main
- INFO: Processed 1000 cashflows
- Sep 15, 2011 12:49:48 PM tryit.memoryleak.RunIt main
- INFO: Cashflow generation took 36551 ms
It takes less than a second to sumbit the cohorts to the pool. It takes about 4 secs to generate cashflows for 100 cohorts, and 1000 cohorts execute in 36s (please note that CFP Lite ran much slower than this. The mime app ruhs much faster because it's doing less and also I didn't want people to fall asleep. The intent is to show the problem, and not focus on actual performance numbers)
Now, you would expect that 10K loans should take 360s = 5 mins, right? Let's see what happens with 10K loans. Here is what we get in the log after running it for 6 minutes
- Sep 15, 2011 1:18:29 PM tryit.memoryleak.RunIt main
- INFO: Submitting cohorts to executor
- Sep 15, 2011 1:18:30 PM tryit.memoryleak.RunIt main
- INFO: Submitted loans to queue.. now wait for results
- Sep 15, 2011 1:18:34 PM tryit.memoryleak.RunIt main
- INFO: Processed 100 cashflows
- Sep 15, 2011 1:18:37 PM tryit.memoryleak.RunIt main
- INFO: Processed 200 cashflows
- Sep 15, 2011 1:18:41 PM tryit.memoryleak.RunIt main
- INFO: Processed 300 cashflows
- Sep 15, 2011 1:18:45 PM tryit.memoryleak.RunIt main
- INFO: Processed 400 cashflows
- Sep 15, 2011 1:18:48 PM tryit.memoryleak.RunIt main
- INFO: Processed 500 cashflows
- Sep 15, 2011 1:18:52 PM tryit.memoryleak.RunIt main
- INFO: Processed 600 cashflows
- Sep 15, 2011 1:18:56 PM tryit.memoryleak.RunIt main
- INFO: Processed 700 cashflows
- Sep 15, 2011 1:19:00 PM tryit.memoryleak.RunIt main
- INFO: Processed 800 cashflows
- Sep 15, 2011 1:19:04 PM tryit.memoryleak.RunIt main
- INFO: Processed 900 cashflows
- Sep 15, 2011 1:19:08 PM tryit.memoryleak.RunIt main
- INFO: Processed 1000 cashflows
- Sep 15, 2011 1:19:12 PM tryit.memoryleak.RunIt main
- INFO: Processed 1100 cashflows
- Sep 15, 2011 1:19:16 PM tryit.memoryleak.RunIt main
- INFO: Processed 1200 cashflows
- Sep 15, 2011 1:19:20 PM tryit.memoryleak.RunIt main
- INFO: Processed 1300 cashflows
- Sep 15, 2011 1:19:24 PM tryit.memoryleak.RunIt main
- INFO: Processed 1400 cashflows
- Sep 15, 2011 1:19:27 PM tryit.memoryleak.RunIt main
- INFO: Processed 1500 cashflows
- Sep 15, 2011 1:19:31 PM tryit.memoryleak.RunIt main
- INFO: Processed 1600 cashflows
- Sep 15, 2011 1:19:35 PM tryit.memoryleak.RunIt main
- INFO: Processed 1700 cashflows
- Sep 15, 2011 1:19:40 PM tryit.memoryleak.RunIt main
- INFO: Processed 1800 cashflows
- Sep 15, 2011 1:19:44 PM tryit.memoryleak.RunIt main
- INFO: Processed 1900 cashflows
- Sep 15, 2011 1:19:48 PM tryit.memoryleak.RunIt main
- INFO: Processed 2000 cashflows
- Sep 15, 2011 1:19:52 PM tryit.memoryleak.RunIt main
- INFO: Processed 2100 cashflows
- Sep 15, 2011 1:19:56 PM tryit.memoryleak.RunIt main
- INFO: Processed 2200 cashflows
- Sep 15, 2011 1:20:00 PM tryit.memoryleak.RunIt main
- INFO: Processed 2300 cashflows
- Sep 15, 2011 1:20:04 PM tryit.memoryleak.RunIt main
- INFO: Processed 2400 cashflows
- Sep 15, 2011 1:20:08 PM tryit.memoryleak.RunIt main
- INFO: Processed 2500 cashflows
- Sep 15, 2011 1:20:12 PM tryit.memoryleak.RunIt main
- INFO: Processed 2600 cashflows
- Sep 15, 2011 1:20:16 PM tryit.memoryleak.RunIt main
- INFO: Processed 2700 cashflows
- Sep 15, 2011 1:20:20 PM tryit.memoryleak.RunIt main
- INFO: Processed 2800 cashflows
- Sep 15, 2011 1:20:25 PM tryit.memoryleak.RunIt main
- INFO: Processed 2900 cashflows
- Sep 15, 2011 1:20:29 PM tryit.memoryleak.RunIt main
- INFO: Processed 3000 cashflows
- Sep 15, 2011 1:20:33 PM tryit.memoryleak.RunIt main
- INFO: Processed 3100 cashflows
- Sep 15, 2011 1:20:38 PM tryit.memoryleak.RunIt main
- INFO: Processed 3200 cashflows
- Sep 15, 2011 1:20:42 PM tryit.memoryleak.RunIt main
- INFO: Processed 3300 cashflows
- Sep 15, 2011 1:20:46 PM tryit.memoryleak.RunIt main
- INFO: Processed 3400 cashflows
- Sep 15, 2011 1:20:50 PM tryit.memoryleak.RunIt main
- INFO: Processed 3500 cashflows
- Sep 15, 2011 1:20:55 PM tryit.memoryleak.RunIt main
- INFO: Processed 3600 cashflows
- Sep 15, 2011 1:20:59 PM tryit.memoryleak.RunIt main
- INFO: Processed 3700 cashflows
- Sep 15, 2011 1:21:12 PM tryit.memoryleak.RunIt main
- INFO: Processed 3800 cashflows
- Sep 15, 2011 1:21:17 PM tryit.memoryleak.RunIt main
- INFO: Processed 3900 cashflows
- Sep 15, 2011 1:22:15 PM tryit.memoryleak.RunIt main
- INFO: Processed 4000 cashflows
- Sep 15, 2011 1:24:14 PM tryit.memoryleak.RunIt main
- INFO: Processed 4100 cashflows
It processed only 4100 cohorts in 6 mins! What happenned? Should be done by now! Also if you see it starts off by taking 4 secs per 100 cohorts like usual. However, right around the 4 min mark, it is suddenly taking 1 min or more per 100 cohorts. Eventually, this thing will slow down further and further. What ahppenned after 3800 cohorts? Each task is doing exactly the same thing. It should take the same time
ETA: I let it run while I wastyping out this blog post, and 10K loans took 3851206ms = 64 mins. A 12 times reduction in performance
The analysis
So, to investigate this problem, I fired up JConsole. Jconsole is an application that is available along with Java. If you have java in your class path, you should have JConsole in your class path. It helps you look inside the application and tells you how it is behaving. When you start JConsole, it gives you the ist of java applications that you can monitor. It gives the class namee of the class that has the main thread. Note that JConsole is itself s Java application, so it shows itself If you have Java based IDE running, it will show you that too (in my case eclipse). In my case my main class is Runit, and I select that and hit connect. If you are monitoring a web app running in JBoss, it will show up as JBoss bootstrap class
After you connect, it shows you a dashboard that gives you an overview of what is happenning under the hood
The dashboard shows the CPU, the memory, threads and number of classes. The tabs above the screen show you a detailed view of each. Click on the memory tab and you can see this
Here you see a plot of heap memory usage by default, and it also shows you how much memory is used by eden space, survivor space and old gen space (more on this later). Now, let's see how the heap memory looks like after you run 10K loans
Tada! 5 points for anyone who can guess what the problem is at this point
The memory is getting full. As you can see, the application started with low memory, but the memory started filling up quickly, right after the 3-4 min mark, at which point it is right under 1 GB. 1 GB is the maximum amount of memory that I gave to Java. Also, see that the Old Gen space is full( it's the 3rds bar from the left). All these things indicate that the application is using too much memory. Also, another indicator is found by comparing the VM summarry and Memory screen
VM summary | Memory |
---|---|
Note in the Process CPU time it says 20 mins. This means that the CPU spent 20 mins in the application. note that this isn't the time the app was running.. that is uptime. This is the CPU time. If you have multiple threads running in a multi-CPU machine, JConsole will add up all the time spent by all the CPUS.. which is why Process CPU time is more than uptime Also see that GC time in the memory tab is 10 minutes. Note:- the application was running for 13 mins, out of which 20 mins were used in CPU. 10 mins out of that 20 minutes was spent in GC. That is excessive. The application would have been 100% faster if there were no GC running!
So, by looking at the memory usage and the GC time, we get a clue that the problem is excessive memory usage. The question is where is the memory usage. But before we handle that the question is why does excessive memory usage slow down the application? SHouldn;t it just go out of memory? Why should I even think about reducing memory usage when the problem reported to me is that the application is slow? Shouldn;t I try to optimize the database queries or increase the number of threads? Yes those things might help but they won't solve the core problem of memory (Actually when this problem was reported, Atul K did try to increase threads, and what happenned was that it took more time.. not less. Yes!! application ran slower by increasing the threads, which seems paradoxical) Unless you don't open JConsole and look under the hood you won't know exactly what is happenning
Why does using more memory slow down a Java application
The answer has to do with how GC works. GC is the biggest boon and bane for Java. Before Java, when most high performing applications were written in C/C++, the programmer had to manage his/her own memory. o, if a memory was allocated, the code had to deallocate the memory. That led to huge problems because the code will either a) forget to deallocate memory which led to memory leaks or b) use an object that is already deallocated, which led to crashes(this is also called as the dangling pointer problem) It wasn't uncommon for application to require a reboot every once in a while (to "clear" memory leaks), or to crash unexpectedly because of a dangling pointer. GC more or less solved these problems by handling the deallocation for the programmer. The code can allocate memory, but doesn;t deallocate it. Once an object is out of scope, and doesn;t have any references from other objects, then it becomes a candidate for deallocation. GC goes ahead and deallocates the memory when it runs.
The problem is finding objects for deallocation is not a simple thing. Let's say you have a List A that contains an Object B. Internally B contains 3 objects C, D and E. Let's say D contains a back reference to the parent Object B, and E contains a reference to itself. Now let's say you remove B from A. The programmer would expect that B would get deallocated because there are no references to it. However, it;s not as simple. GC has to find that when B is deallocated, it has to deallocate C, D and E too, so it has to walk through the object tree to find all the related objects that should be deallocated. Also, from B --> D and D --> B. There is a circular reference to B. To properly deallocate B, it has to detect the circular reference and ignore it. Also, E has a reference to itself. It has to detect that circular reference too. This requires multiple passes through the object tree. Proper implementation of GC is not a simple topic, and there are still people doing in academia doing research on improving GC implementations. It would be out of scope and useless for any of us programmers to know the nitty gritty details. Suffice to say that what we as programmers need to know is this GC is costly
Also, another thing to note is that GC becomes slower as you use more memory. When GC needs to find objects to collect, it has to walk through the object tree. If you have 1000 objects in memory that don't need to be collected, GC still has to walk through those objects. If you have 10000 objects, it's going to 10 times longer to run GC
Since people who design JVMs know that GC is costly, they make the JVM so that GC is lazy. Normally in an application that is idle most of the time, GC runs periodically. It doesn;t run every time an object goes out of scope. However, in an active application that is actively creating objects and releasing objects, GC will run when it absolutely has to. What the JVM does is that it holds off on collection of the objects until the memory usage reaches a limit, and then runs GC. If you look at the memory graph above, you see patterns of memory increasing and then dropping slightly. The drop is where GC ran a collection and released objects. This is done to improve performance. However, the side effect of this is that GC runs more often when you use more memory. If you are running the memory right under the max memory given to Java, the JVM is going to hit the limit more often and run GC more often. To draw an anaology, this is like when you are bouncing a basketball on the ground. If you keep your hand at your chest level and then start moving it down, the ball bounces faster and faster even though you control the speed of descent of the ball. This is because the ball has less distance to travel the closer it is to the ground. Similarly, the GC "bounces" more often if the memory usage is closer to the top.
So, the end result of GC being costly + becoming costlier as you use more memory + running more often is what causes the application to slow down as it uses more memory. Eventually, if GC is not able to release any memory an Out of memory exception is thrown. However, that happens long after the application has slowed down to a crawl. The first clue for most memory problems in Java apps is almost always "My application is running slower" Sometimes, what this means is that there can be unexpected side effects in typical applications. For example, let's say you have a bit of code that opens a connection and runs sql queries in a loop. Now, if code inside the loop is using too much memory, it will start getting slower and slower in every iteration. At some point, the database connection times out, and the code gets a SQLException. The code logs the exception and exits. When it exits, all the memory that it allocated inside the loop gets deallocated, and GC clears it up in the next collection. Suddenly the memory becomes low, and the Out of memory problem "heals" itself. The sysadmin might look at the memory usage after the problem happens, and see that memory usage is low and hence mistakenly conclude that memory is not a problem. When the problem is reported to the developer, the stack trace says "Connection time out" along with a note from the admin that says memory usage is low. So, the developer might looking at the SQL queries, and may spend hours if not days tuning the queries. That does make things a bit faster, but doesn't solve the root cause. The stack trace is misleading that leads the developer down the wrong rabbit hole.
The only way to solve this problem is for the developer to test performance under load him/herself with JConsole running before the application is released. Nothing beats running JConsole yourself
The fix
After we determined that memory was an issue, we have to find out where exactly the memory was being used. Please note that the analysis done through JConsole doesn't help us find the exact problem, but leads us down the correct path. There are several ways we can find memory problems, and there are tools that we can use to look inside the heap. I will cover that later. In this case, Atul K went through the code and found that the Cashflows were being stored in the Future objects, and theFuture objects aren;t released until the entire cashflow processing was done. So, we changed the code so that the task writes the cashflows to a CSV file. The main thread takes the CSV files and uses sqlloader to load them into the database after all tasks are complete. This had 2 advantages
a) the Future object doesn't need to have cashflows
b) part of the processing that was done by main thread will now be done inside the task. This means that it will be done in parallel, which will make the application faster
Actually, there was no need to keep a Future object anymore, and also we could have used Runnable instead of Callable. However, since we were under a time constraint (it was evening before release) we decided to do a quick change that provided the most improvement and decided to not remove the Future objects