August 15, 2015

Software Instrumentation - Balancing Data Collection and Impact on Performance

Written by Gen

keyword icon Performance monitoring

Real-time monitoring of web applications requires logging events and metrics and reporting anomalies in a timely manner. The great temptation is to "collect everything," but doing so will slow your app's performance to a crawl. Applying instrumentation to Java bytecode can be particularly tricky until you get the upper hand on the language's profiling tools for troubleshooting performance and memory-allocation problems.

The system-failure-response two-step goes like this: In step one you detect the problem and alert the appropriate parties; and in step two you get the system back online (or back to operational levels), discover the source of the glitch, and apply the permanent fix. Sounds simple, right?

Any system manager will tell you the monitoring and troubleshooting process is anything but simple. Today, code is instrumented to log and report all events so you can respond quickly to any anomaly. The prevailing sentiment is to "collect everything" because data is cheap and the price of an extended outage is high.

However, instrumentation affects overall system performance, so literally collecting everything would create tremendous overhead. The response to a Stack Overflow question posted in January 2012 points out that instrumentation has a bigger hit on performance than interrupt sampling. However, the benefits you gain from the data you collect could justify the extra bandwidth and CPU cycles.

Logging and monitoring are both intended to keep web applications running, but the two techniques approach troubleshooting in different ways. Matt Makai explains the differences on Full Stack Python. With logging, an exception handler reports errors explicitly through code when the event occurs. By contrast, monitoring uses an agent to instrument the code and collect data about function performance in addition to the actual exception being logged.

On the OS and network level of the web stack, you need to monitor CPU and memory use; persistent storage in use and available; and network bandwidth and latency. At the application level, monitoring encompasses 500-level HTTP warnings; code performance; template rendering speed; how quickly the app renders in a browser; and database query speed.

The limitations of Java bytecode instrumentation

Java bytecode instrumentation lets you debug and profile your code during runtime. Since the introduction of the Java agent interface in JDK 1.5, you can write Java code that's executed by the class loader rather than having to adjust the class loader hierarchy manually. This lets you manipulate the bytecode within each specific class.

In a Dr. Dobbs article, Ian Formanek and Gregg Sporar explain how dynamic bytecode instrumentation facilitates profiling to spot performance and memory-allocation glitches. Because you can limit the profiling to only specific application components, only relevant information is reported, and the monitoring process's impact on performance is minimized.

The JVM Tool Interface's "redefineClass(Class[] classes, byte[][] newBC)" method instructs the profiler to insert the required instrumentation the first time the application is invoked after the class or method has been specified. You can change the profiling overhead while the application is running and even remove instrumentation entirely.

Formanek and Sporar provide examples of bytecode instrumentation in action. To illustrate CPU performance profiling, the NetBeans Profiler is applied to the Java2D demo program. When the entire program is instrumented, the app's start time slows from 2 seconds to 10 seconds. When a filter is applied to remove all "java.*" classes, performance improves greatly, but important information is no longer being reported.

 

Filtering "java.*" classes excludes important information from the profile (top) that is included when all classes are profiled. Source: Dr. Dobbs

When the filter is replaced by the single root method "java2d.AnimatingSurface.run()", the app performs well and only the three threads that invoked the "run()" method are reported by the profiler.

By using a single root method in place of a "java.*" class feature, the app's performance was maintained and full details of the "run()" method were returned. Source: Dr. Dobbs