Using JNI for calculations

I had plenty free time lately. Out of boredom, I decided to try something new and interesting, something what I wouldn't probably use for my job. Something like JNI.

JNI enables programmers to run native code from Java and vice versa. It can be used to improve performance of calculations and to manipulate hardware directly that sometimes is impossible with Java (for instance direct work graphical adapter).

I always wondered how to connect Java with code written in other languages to improve performance. So I decided to write very simple JNI-based program for multiplying matrices and make own benchmarks comparing JNI and pure Java performance.

First of all it's necessary to write Java class that has native methods. In our case it's MatrixMultiplier. You can find this class as well as the whole example project at this GitHub repository.

Let's start with declaring native method. Technically it can't have body, because it's usually implemented in different language in different file. Although in GWT it's possible to insert JavaScript code inside specially formated comments, but this trick only possible because Google provides its own compiler for GWT. I'm going to implement native method in C, so there's no body in Java code.

private native int[][] multiply(int[][] a, int[][] b);

pureJavaMultiply method does the same, but in the pure Java.

private int[][] pureJavaMultiply(int[][] a, int[][] b) {
    int[][] res = new int[a.length][a.length];
    for (int i = 0; i < a.length; i++) {
        for (int j = 0; j < a.length; j++) {
            res[i][j] = 0;
            for (int k = 0; k < a.length; k++) {
                res[i][j] += a[i][k] * b[k][j];
            }
        }
    }
    return res;
}

To launch native method it's necessary to load shared library from the file. For this purpose System.load method is used. In case of Windows replace .so with .dll.

File sharedLibrary = new File("net_sukharevd_jni_example_MatrixMultiplier.so");
System.load(sharedLibrary.getAbsolutePath());

Rest part of MatrixMultiplier class is dedicated to performance estimation of two multiplying methods.

It's necessary to compile this Java class

javac src/net/sukharevd/jni/example/MatrixMultiplier.java

and generate C header file from it with usage of javah

javah -d src/ -cp src/ net.sukharevd.jni.example.MatrixMultiplier

As result there will be a new header file in src folder that introduces multiply method

/* DO NOT EDIT THIS FILE - it is machine generated */
#include <jni.h>
/* Header for class net_sukharevd_jni_example_MatrixMultiplier */

#ifndef _Included_net_sukharevd_jni_example_MatrixMultiplier
#define _Included_net_sukharevd_jni_example_MatrixMultiplier
#ifdef __cplusplus
extern "C" {
#endif
/*
 * Class:     net_sukharevd_jni_example_MatrixMultiplier
 * Method:    multiply
 * Signature: ([[I[[I)[[I
 */
JNIEXPORT jobjectArray JNICALL Java_net_sukharevd_jni_example_MatrixMultiplier_multiply
  (JNIEnv *, jobject, jobjectArray, jobjectArray);

#ifdef __cplusplus
}
#endif
#endif

The next step is to write implementation for this header file with the same name, but with ".c" extension. Hope comments will help to understand this code.

#include <stdlib.h>
#include <jni.h>
#include "net_sukharevd_jni_example_MatrixMultiplier.h"

// Multiplies matrix A to matrix B.
JNIEXPORT jobjectArray JNICALL Java_net_sukharevd_jni_example_MatrixMultiplier_multiply
  (JNIEnv* env, jobject obj, jobjectArray a, jobjectArray b) {

     jint size = (*env)->GetArrayLength(env, a);

     // READ MATRIX B (JVM object) TO BUFFER (native C array)
     int k;
     jintArray* b_sub = malloc(size * sizeof(jintArray*));    // Subarray objects of B,
                                                              // stored for cleaning up.
     jint** buf_b = malloc(size * sizeof(jint*));             // Buffer for matrix B.
     for (k = 0; k < size; k++) {
         b_sub[k] = (*env)->GetObjectArrayElement(env, b, k); // Fetch subarray of B.
         buf_b[k] = (*env)->GetIntArrayElements(env, b_sub[k], NULL); // Read subarray to buffer.
     }

     // CREATE RESULT ARRAY
     jclass intArrCls = (*env)->FindClass(env, "[I");         // Find int[] class.
     if (intArrCls == NULL)
         return NULL; /* exception thrown */
     jobjectArray result = (*env)->NewObjectArray(env, size, intArrCls, NULL);
                                                              // Create array of int[].
     if (result == NULL)
         return NULL; /* out of memory error thrown */
     (*env)->DeleteLocalRef( env, intArrCls );                // Delete reference to int[] class.

     // MULTIPLY MATRICES
     int i, j;
     jint* buf_res_row = (jint*)malloc(size * sizeof(jint));  // Buffer for final matrix.

     for (i = 0; i < size; i++) {
         // READ I-TH ROW FROM A TO BUFFER
         jintArray a_i = (*env)->GetObjectArrayElement(env, a, i);
         jint* buf_a_i = (*env)->GetIntArrayElements(env, a_i, NULL);

         // CALCULATE I-TH ROW OF RESULT
         for (j = 0; j < size; j++) {
             buf_res_row[j] = 0;
             for (k = 0; k < size; k++) {
                 buf_res_row[j] += buf_a_i[k] * buf_b[k][j];
             }
         }

         (*env)->ReleaseIntArrayElements(env, a_i, buf_a_i, 0);
         (*env)->DeleteLocalRef(env, a_i);

         // COPY I-TH ROW TO RESULT OBJECT
         jintArray res_row = (*env)->NewIntArray(env, size);
         if (res_row == NULL)
             return NULL; /* out of memory error thrown */
         (*env)->SetIntArrayRegion(env, res_row, 0, size, buf_res_row); // Firstly copy to new
         (*env)->SetObjectArrayElement(env, result, i, res_row);        // array object, and
         (*env)->DeleteLocalRef(env, res_row);                          // then to matrix.
     }

     // CLEAN UP
     free(buf_res_row);
     for (k = 0; k < size; k++) {
         (*env)->ReleaseIntArrayElements(env, b_sub[k], buf_b[k], 0);
         (*env)->DeleteLocalRef(env, b_sub[k]);
     }
     free(buf_b);
     free(b_sub);

     return result;
}

When implementation is ready, it's possible to compile it obtaining .so shared library file. For Linux it could be done executing command

gcc -O3 -fPIC -shared -I /usr/lib/jvm/java-7-oracle/include/ \
    -I /usr/lib/jvm/java-7-oracle/include/linux/ \
    -o net_sukharevd_jni_example_MatrixMultiplier.so \
    src/net_sukharevd_jni_example_MatrixMultiplier.c
  • -O3 means that it's desired to optimize compilation reducing code size and execution time.
  • -shared means that the result should be a shared library.
  • -fPIC (Position Independent Code) makes jumps be generated with relative rather than absolute addressing, that makes sense for shared libraries.
  • -I tells compiler where to look for header files (e.g. jni.h)
  • -o specifies where output file should be saved.

To use gcc in Windows it's necessary to install it. For Windows gcc is supplied as part MinGW (or MinGW-w64 for x64 systems). It's also nice to have gcc in the PATH, so add MinGW's bin directory to PATH. As soon as you have gcc, execute the following in command line (replacing path to JDK):

gcc -O3 -shared^
    -I "C:\Program Files\Java\jdk1.7.0_25\include"^
    -I "C:\Program Files\Java\jdk1.7.0_25\include\win32"^
    -o MatrixMultiplier.dll^
    src\net_sukharevd_jni_example_MatrixMultiplier.c

Finally it's time to run the code

java -cp src/ net.sukharevd.jni.example.MatrixMultiplier

The Java class makes benchmarks for different sizes of matrices (N). Results are represented in a table

N tJni, [ms]tPureJava, [ms] Ratio
250 16 28 1.75
500 257 480 1.87
750 1,288 1,801 1.40
1000 4,126 4,638 1.12
1250 9,513 13,040 1.37
1500 16,267 22,560 1.39
1750 26,781 40,954 1.53
2000 43,172 64,025 1.48
2250 62,540 89,050 1.42
2500 82,934 135,473 1.63
2750 113,912 180,310 1.58
3000 156,210 245,301 1.57
3250 201,585 300,490 1.49
3500 246,014 413,289 1.68
3750 308,626 460,653 1.49
4000 397,392 630,182 1.59

The table shows that implemented native method gave about 1.5 times boost which means that it was worth to write all this C code for the sake of performance.

But it's not the first version of this method. Former one was much less efficient. In fact the only difference between them is the way of fetching subarrays from matrices. The former one didn't buffered the second matrix, instead it obtained elements when they were required

// CALCULATE I-TH ROW OF RESULT
for (j = 0; j < size; j++) {
    buf_res_row[j] = 0;
    for (k = 0; k < size; k++) {

        // !!! NOW B ISN'T BUFFERED ENTIRELY !!!
        jintArray b_k = (*env)->GetObjectArrayElement(env, b, k);
        (*env)->GetIntArrayRegion(env, b_k, j, 1, buf_b_k);
        (*env)->DeleteLocalRef(env, b_k);

        buf_res_row[j] += buf_a_i[k] * buf_b_k[0];
    }
}

Such a change caused serious performance problems shown in the table below

N tJni, [ms]tPureJava, [ms] Ratio
250 1199 30 42.82
500 10231 511 20.02
750 35587 1920 18.53
1000 85362 5158 16.55
1250 167264 12560 13.32
1500 293473 22748 12.90
1750 463501 39921 11.61
2000 694096 62484 11.11
2250 996630 91282 10.92
2500 1361381 120622 11.29
2750 1818452 189720 9.58
3000 2382053 240613 9.90

I couldn't imagine that this change would affect performance more than 10 times! As mentioned in this IBM article looking up for objects in JVM could be very expensive. In the former edition this looking up was done N times more than necessary. JVM objects should be cached to avoid redundant and expensive looking up and work with them also should be minimized. Not keeping these recommendation cost me tenfold drop in performance.

Using JNI could significantly improve performance of Java applications. But writing JNI programs is much more sophisticated than writing pure Java implementations. Moreover there are no guarantees of optimality for such solutions. Poor implementation could give opposite results decreasing overall performance.

Tagged as : Java C

Comments