Show HN: FlowTracker – Track data flowing through Java programs

Track data flowing through Java programs, gain new understanding at a glimpse.

FlowTracker is a Java agent that tracks how a program reads, manipulates, and writes data.
By watching a program run, it can show what file and network I/O happened, but more importantly connecting its inputs and outputs to show where its output came from.
This helps you understand what any Java program’s output means and why it wrote it.

This proof-of-concept explores what insights we get by looking at program behaviour from this perspective.

Spring PetClinic is a demo application for the Spring framework.
To demonstrate FlowTracker’s abilities, we let it observe PetClinic handling an HTTP request and generating an HTML page based on a template and data from a database.
You can use this demo in your browser, without installing anything.
Open the FlowTracker PetClinic demo, or watch the video below.

petclinic.mp4

You see the HTTP response that FlowTracker saw PetClinic send over the network.
Click on a part of the contents of the HTTP response to see in the bottom view where that part came from.
You can select another tracked origin/input or sink/output in the tree on the left (or bottom left button on mobile).

Exploring this HTTP response, we navigate through multiple layers of the software stack:

HTTP handling FlowTracker shows what code produced what output.
Click on “HTTP/1.1” or the HTTP headers. You see that this part of the response was generated by apache coyote (classes in the org.apache.coyote package), pointing you to where exactly each header came from.
Thymeleaf templates FlowTracker shows how the input the program reads (the HTML templates) corresponds to the output.
Click on an HTML tag name, like “html” or “head”. You see the layout.html file, where this part of the HTML page comes from.
If you click on layout.html, and then on the colorful + button at the bottom, then everything coming from that file will be marked in the same color.
Scrolling down you’ll then notice part of the response comes from a different file, ownerDetails.html.
Click on a < or > to see that those characters were written by the Thymeleaf templating library.
Database
The HTML page contains a table with information that comes from the database.
Clicking on George in that table does not only show that that value came from the database.
It goes further: it traced it all the way back to the SQL script that inserted that value in the database in first place.

In that demo, the tracking up to the SQL script works because it was using an in-memory database.
The database content never left the JVM, so FlowTracker could fully keep track of it.
When we run the same demo but with a mysql database, then we track those values up to the database connection: we see the SQL query sent before to produce them, and details of how the mysql jdbc driver talks to the database.
See FlowTracker PetClinic mysql demo.
Notice that FlowTracker intercepts the decrypted contents sent over the SSL connection to the database.

This Spring PetClinic demo is just an example.
FlowTracker does not depend on your application using any particular framework or library.

Another demo, showing how by watching the java compiler, FlowTracker helps you understand the format of the generated class file and the bytecode in it:
javac demo, video.

Warning:
In its current state, FlowTracker is closer to a proof of concept than production ready.
It has proven itself to work well on a number of example programs, but it is not going to work well for everything, your mileage may vary.
Also be aware that it adds a lot of overhead, making programs run much slower.

Download the FlowTracker agent jar from the Github releases pages (flowtracker-*.jar under "Assets").
Add the agent to your java command line: -javaagent:path/to/flowtracker.jar.
Disable some JVM optimizations that disrupt flowtracker by also adding the output of java -jar flowtracker.jar jvmopts to the command line.
By default, FlowTracker starts a webserver on port 8011, so open http://localhost: 8011/ in your browser.

For more detailed instructions, including configuration options, see USAGE.md.

FlowTracker is an instrumenting agent.
The agent injects its code into class files (bytecode) when the JVM loads them.
That code maintains a mapping of in-memory data to its origin, while the program reads, passes around, and writes data.
The focus is on tracking textual and binary data (like Strings, char and byte arrays), not on numerical, structured or computed data.

This achieved with a combination of:

Replacing some calls to JDK methods with calls to FlowTracker's version of those methods.
Injecting code into key places in the JDK, mostly to track input and output.
Dataflow analysis and deeper instrumentation within methods to track local variables and values on the stack.
Adding code before and after method invocations, and at the start and end of invoked methods, to track method arguments and return values using ThreadLocals.

Core classes and concepts of FlowTracker's data model:

Tracker: holds information about a tracked object's content and source:
- content: the data that passed through them. e.g. all bytes passed through an InputStream or OutputStream.
- source: associate ranges of its content to their source ranges in other trackers. For example, for the bytes of a String that could be pointing to the range of the tracker of the FileInputStream that the String was read from; telling us from which file and where exactly in that file it came from.
TrackerRepository: holds a large global Map that associates interesting objects with their tracker.
TrackerPoint: Pointer to a position in a tracker, representing a single primitive value being tracked, e.g. the source of one byte.

To keep Trackers up-to-date, our instrumentation inserts calls to hook methods in flowtracker when some specific JDK methods are being called.

The simplest example of that is for System.arraycopy.
We intercept that on the caller's side: Calls to java.lang.System.arraycopy are replaced with calls to com.coekie.flowtracker.hook.SystemHook.arraycopy.
For this and other instrumentation, we use the ASM bytecode manipulation library.
In SystemHook we call the real arraycopy, get the Trackers of the source and destination arrays from the TrackerRepository, and update the target Tracker to point to its source.

For example, given this code:

char[] abc = ...; char[] abcbc = new char[5];
System.arraycopy(abc, 0, abcbc, 0, 3);
System.arraycopy(abc, 1, abcbc, 3, 2);

This gets rewritten to the following.
Note that instrumentation happens on bytecode, not source code, but we show equivalent source code here because that's much easier to read.

char[] abc = ...; char[] abcbc = new char[5];
SystemHook.arraycopy(abc, 0, abcbc, 0, 3);
SystemHook.arraycopy(abc, 1, abcbc, 3, 2);

After executing this, the tracker for abcbc would look like: {[0-2]: {tracker: abcTracker, sourceIndex: 0, length: 3}, [3-4]: {tracker: abcTracker, sourceIndex: 1, length: 2}}

That was an example of a hook on the caller side.
But most calls to hook methods are added on the callee side, inside the methods in the JDK.
For example take FileInputStream.read(byte[]), which reads data from a File and stores the result in the provided byte[].
We add the call to our hook method (FileInputStreamHook.afterReadByteArray) at the end of the FileInputStream.read(byte[]) method.
We have our own instrumentation micro-framework for that, driven by annotations, implemented using ASM's AdviceAdapter.

That way we add hooks to a number of classes in the JDK responsible for input and output, such as java.io.FileInputStream, java.io.FileOutputStream, and internal classes like sun.nio.ch.FileChannelImpl, sun.nio.ch.IOUtil, sun.nio.ch.NioSocketImpl and more.

Implementation:
SystemHook,
FileInputStreamHook,
and other classes in the hook package.

Primitive values, dataflow analysis

A bigger challenge is tracking primitive values.
Consider this example:

byte[] x; byte[] y;
// ...
byte b = x[1];
// ...
y[2] = b;

When that code is executed, we would need to update the Tracker of y, to remember that the value at index 2 comes from the value at index 1 in x.
If those had been String[]s and b was a String instead of a byte, then we wouldn't need to modify code like this, because the TrackerRepository would know what the Tracker of the String is, and keeps that association no matter how that String object is passed around.
But the TrackerRepository can't keep a mapping of primitive values like bytes to Trackers, because primitive values don't have an identity: any Map having a byte as key would mix up different occurrences of the same byte.
Instead, we store the association of b to its tracker in a local variable in the method itself.
The code gets rewritten to roughly something like this:

byte[] x; byte[] y;
// ...
byte b = x[1];
TrackerPoint bTracker = ArrayHook.getElementTracker(x, 1);
// ...
y[2] = b;
ArrayHook.setElementTracker(y, 2, bTracker);

To do that FlowTracker needs to understand how exactly values flow through a method.
We build upon ASM's analysis support to analyze the code (symbolic interpretation).
That way we construct a model of where values in local variables and on the stack come from at every point in the method, and where they end up.

This is implemented in

FlowValue and its subclasses (e.g. ArrayLoadValue) that model where values come from, and can generate the instructions that create the TrackerPoints that point to that source.
A particularly interesting one is MergedValue, which handles situations where because of control flow (e.g. if-statements, loops) a value can come from multiple possible places.
FlowInterpreter: extension of ASM's Interpreter, interprets bytecode instructions, creates the appropriate FlowValues.
Store and its subclasses (e.g. ArrayStore) that represent places that FlowValues go to, that consume the TrackerPoints.
FlowTransformer: drives the whole analysis and instrumentation process. See its docs for a more detailed walkthrough of how this all fits together.

We don't track the source of all primitive values.
The focus is on byte and char values, and to a lesser extent ints and longs.

The dataflow analysis from the previous section is limited to handling flow of primitive values within a single method.
Those values also flow into other methods, as arguments and return values of method invocations.
We model that in Invocation, which stores PointTrackers for arguments and return values.
The Invocation is stored in a ThreadLocal just before a method invocation, and retrieved at the start of the implementation of the method.

For example, take this code passing a primitive value to a "write" method:

void caller() {
  byte b = ...;
  out.write(b);  
}

...

class MyOutputStream {
  void write(byte value) {
    ... // do something with value
  }
}

To get the TrackerPoint of b into the write method, the code is instrumented like this:

void caller() {
  byte b = ...;
  TrackerPoint bTracker = ...;
  Invocation.create("write(byte)")
    .setArg(0, bTracker)
    // this puts the Invocation in the ThreadLocal
    .calling(); 
  out.write(b);  
}

...

class MyOutputStream {
  void write(byte value) {
    // this extracts the Invocation from the ThreadLocal
    Invocation invocation = Invocation.start("write(byte)");
    TrackerPoint valueTracker = invocation.getArg0();
    ... // do something with value & valueTracker
  }
}

Implementation:
Invocation,
InvocationArgStore,
InvocationArgValue,
InvocationReturnStore,
InvocationReturnValue,
InvocationOutgoingTransformation,
InvocationIncomingTransformation

There are two main types of tracked origins of data.
There is I/O, which is tracked as explained in the "Basic instrumentation" section.
And there are values coming from the code itself, such as primitive and String constants ('a', "abc").
For those, we create a tracker for each class (a ClassOriginTracker), that contains a textual