Why we built Illuminate and where we think APM is going next!
People have asked us a few questions about Illuminate following our recent major release. We’ll try to do our best by answering them here!
1. Why do we need yet another tool in the performance monitoring space?
Illuminate is a completely different type of performance tool. In our opinion it represents the future of tooling in the performance diagnostic space. Up until now, the APM space has been dominated by tools that are effectively dashboards displaying multitudes of metrics. Interpreting those metrics on the dials and charts into something meaningful can take a lot of time. It requires organizations to rely on a combination of highly specialised individuals and significant levels of cooperation between teams, that may well be scattered across several time zones.
Just the logistics of managing all of this often results in long outages when problems occur. While dashboards look very flashy and often do contain all of the data needed to solve the problem, they only leave you with data, tons of data. The problem with having all this data is; it’s expensive to collect, expensive to store, rarely looked at, and is wonderful at obfuscating the helpful data. In our more than 40 years combined experience tuning mission critical Java applications across a wide variety of industries we didn’t use tons of data, which lead us to question the state of the art. We believe the industry can do better and that is why we formed jClarity and started to build Illuminate.
The Java ecosystem has come a long way in terms of automating the build and deploy toolchain. You can now build software quickly (RAD frameworks such as Spring Boot), with reliable tests (JUnit, Spock and friends) and deploy on a daily basis (Chef, Puppet and pals). This missing piece for us is when that software is up and running and does not behave as users would expect. It’s that very complex last mile which we want to help solve!
In short, we want to tell users what the root cause of a performance problem is and some suggestions on how to go and fix it, all within minutes, not weeks!
2. What makes Illuminate better?
In short, it’s our twin approaches of analytics beat metrics and gathering less data is better.
When we sat down together in mid 2012 it became apparent that here were a number of common patterns to all of the performance tuning engagements that we had been involved in. Kirk had already generalized a lot of the patterns into a methodology that he’d been teaching and using for quite some time. The lightbulb moment when we realised that our patterns could also be folded into this process or a methodology making it stronger. The core methodology that we use in our engagements is very friendly to humans. It not only simplifies the tuning process but it also makes it time predictable, it’s deliberately light and very targeted in the data needed to drive it.
We knew that the methodology was not an imperative process and that it contains a fair degree of fuzziness and uncertainty. That means it had to be driven by a human, which implies yet another thing that someone on your team has to know. So we started to research machine learning techniques to see if we could apply the ‘fuzzy logic’ thinking into a workable software algorithm.
What we have done with Illuminate is combined machine learning with a battle hardened performance methodology to create a diagnostic engine that drives the process. It’s this combination of technology and field experience that we think makes Illuminate a unique solution.
Another major advantage is that the methodology requires only small amounts of data, so we could design Illuminate to be extremely light. We wanted the diagnostic engine to have a minimal impact on running systems. While the original intent was to allow Illuminate to scale out to systems containing 1000s of JVMs, it looks as if it will also allow the same technology to run with IoT devices as well.
3. What is new in this release?
The previous release required users to trigger the diagnostic engine. With this release users can set Service Level Agreements (SLAs) to trigger a diagnosis. For example, lets say you need logins to respond in less than 1 second. You would give this information to Illuminate and if logins do take longer than 1 second, Illuminate will then start running a diagnostic. The SLA violation data is now also feed into the diagnostic engine and that helps provide the end user with a better characterization be it a rogue O/S process, Java’s Garbage Collection, an external Database or Web Service or just plain old slow code.
Illuminate delivers a report into your inbox within seconds, no more hunting around for the needle in the Haystack! Elemica and Clareity Security were two early adopters of this engine and were able to find issues within minutes that had eluded them for months.
4. How does it work?
It’s a Software as a Service. Users download and install an Illuminate Daemon using a simple installer which starts up a small stand alone Java process. The Daemon sits quietly unless it is asked to start gathering SLA data and/or to trigger a diagnosis. Users can set SLA’s via the dashboard and can opt to collect latency measurements of their transactions manually (using our library) or by asking Illuminate to automatically instrument their code (Servlet and JDBC based transactions are currently supported).
SLA latency data for transactions is collected on a short cycle. When the moving average of latency measurements goes above the SLA value (e.g. 150ms), a diagnosis is triggered. The diagnosis is very quick, gathering key data from O/S, JVM(s), virtualisation and other areas of the system. The data is then run through the machine learned algorithm which will quickly narrow down the possible causes and gather a little extra data if needed.
Once Illuminate has determined the root cause of the performance problem, the diagnosis report is sent back to the dashboard and an alert is sent to the user. That alert contains a link to the result of the diagnosis which the user can share with colleagues. Illuminate has all sorts of backoff strategies to ensure that users don’t get too many alerts of the same type in rapid succession!
The communications all work over SSL’d websockets, so generally speaking there should be no fiddling with Firewalls or other annoying configuration. Illuminate can run in house for those users who have policies forbidding externally hosted services.
There are a host of small advancements in the Linux, JVM and virtualisation space that we are (or will shortly be) taking full advantage of such as memory mapping internal communications vi Chronicle, profiling honestly with the Honest Profiler and a few things we are keeping under wraps for now. We’ll be posting further technical details and interviews with industry leaders on this blog.
5. How does it impact performance when running in production?
With the JVM there is an unavoidable cost when instrumenting transactions to get latency data. We have looked at this problem very carefully and use our knowledge of JVM safe pointing and other internal behaviours to carefully weave in our stop/start hooks. We also use some backoff and filtering strategies to minimise the impact of the collection. These measurements can also be dynamically switched on and off at any time in case of emergencies.
Next week a new change will go in which the latency measurements will be sent completely outside of the JVM process that they are captured on, this will reduce the impact even further.
6. Can it still work in tandem with other APM products?
We’ve tested Illuminate against most of the other agent based tools out in the market and have not seen any show stoppers to date.
7. Why should we trust you?
We dislike blind trust and we are certainly not going to say Illuminate is perfect! It represents an important step, but it is one of many in our roadmap. We also like to say “Measure, don’t guess®“ and we love brutally honest feedback from our users and community. As way of some extra background: The team is heavily involved in the Java, Performance Tuning and Open Source communities. Kirk, Ben and Martijn are all recognized Java Champions and work on OpenJDK (Java itself), have authored leading Java titles such as The Well Grounded Java Developer and the most recent Java in A Nutshell. We also have the popular Friends of jClarity performance tuning community, which has about 1000 or so friendly experts in this space.
8. How can I demo/trial it?
Illuminate is available for a 14-day free trial and works on Linux based systems for Java/JVM language applications. It has a default SLA of 1000ms set, so all you have to do is switch on the auto instrumenting (if you have Servlet and/or JDBC based transactions) or use our simple stopwatch library. Once traffic starts flowing through your application, Illuminate will highlight the transaction times for you in the dashboard and trigger a diagnosis if the SLA is breached.
You can of course change the default SLA and add new ones. For the really impatient, you can simply manually trigger a diagnosis. The manual trigger naturally comes with a caveat, if your application is performing well then you should should take the diagnosis with a grain of salt!
Martijn (CEO) and the jClarity Team!