Using JNI for calculations

I had plenty free time lately. Out of boredom, I decided to try something new and interesting, something what I wouldn't probably use for my job. Something like JNI.

JNI enables programmers to run native code from Java and vice versa. It can be used to improve performance of calculations and to manipulate hardware directly that sometimes is impossible with Java (for instance direct work graphical adapter).

I always wondered how to connect Java with code written in other languages to improve performance. So I decided to write very simple JNI-based program for multiplying matrices and make own benchmarks comparing JNI and pure Java performance.

First of all it's necessary to write Java class that has native methods. In our case it's MatrixMultiplier. You can find this class as well as the whole example project at this GitHub repository.

Let's start with declaring native method. Technically it can't have body, because it's usually implemented in different language in different file. Although in GWT it's possible to insert JavaScript code inside specially formated comments, but this trick only possible because Google provides its own compiler for GWT. I'm going to implement native method in C, so there's no body in Java code.

private native int[][] multiply(int[][] a, int[][] b);

pureJavaMultiply method does the same, but in the pure Java.

private int[][] pureJavaMultiply(int[][] a, int[][] b) {
    int[][] res = new int[a.length][a.length];
    for (int i = 0; i < a.length; i++) {
        for (int j = 0; j < a.length; j++) {
            res[i][j] = 0;
            for (int k = 0; k < a.length; k++) {
                res[i][j] += a[i][k] * b[k][j];
            }
        }
    }
    return res;
}

To launch native method it's necessary to load shared library from the file. For this purpose System.load method is used. In case of Windows replace .so with .dll.

File sharedLibrary = new File("net_sukharevd_jni_example_MatrixMultiplier.so");
System.load(sharedLibrary.getAbsolutePath());

Rest part of MatrixMultiplier class is dedicated to performance estimation of two multiplying methods.

It's necessary to compile this Java class

javac src/net/sukharevd/jni/example/MatrixMultiplier.java

and generate C header file from it with usage of javah

javah -d src/ -cp src/ net.sukharevd.jni.example.MatrixMultiplier

As result there will be a new header file in src folder that introduces multiply method

/* DO NOT EDIT THIS FILE - it is machine generated */
#include <jni.h>
/* Header for class net_sukharevd_jni_example_MatrixMultiplier */

#ifndef _Included_net_sukharevd_jni_example_MatrixMultiplier
#define _Included_net_sukharevd_jni_example_MatrixMultiplier
#ifdef __cplusplus
extern "C" {
#endif
/*
 * Class:     net_sukharevd_jni_example_MatrixMultiplier
 * Method:    multiply
 * Signature: ([[I[[I)[[I
 */
JNIEXPORT jobjectArray JNICALL Java_net_sukharevd_jni_example_MatrixMultiplier_multiply
  (JNIEnv *, jobject, jobjectArray, jobjectArray);

#ifdef __cplusplus
}
#endif
#endif

The next step is to write implementation for this header file with the same name, but with ".c" extension. Hope comments will help to understand this code.

#include <stdlib.h>
#include <jni.h>
#include "net_sukharevd_jni_example_MatrixMultiplier.h"

// Multiplies matrix A to matrix B.
JNIEXPORT jobjectArray JNICALL Java_net_sukharevd_jni_example_MatrixMultiplier_multiply
  (JNIEnv* env, jobject obj, jobjectArray a, jobjectArray b) {

     jint size = (*env)->GetArrayLength(env, a);

     // READ MATRIX B (JVM object) TO BUFFER (native C array)
     int k;
     jintArray* b_sub = malloc(size * sizeof(jintArray*));    // Subarray objects of B,
                                                              // stored for cleaning up.
     jint** buf_b = malloc(size * sizeof(jint*));             // Buffer for matrix B.
     for (k = 0; k < size; k++) {
         b_sub[k] = (*env)->GetObjectArrayElement(env, b, k); // Fetch subarray of B.
         buf_b[k] = (*env)->GetIntArrayElements(env, b_sub[k], NULL); // Read subarray to buffer.
     }

     // CREATE RESULT ARRAY
     jclass intArrCls = (*env)->FindClass(env, "[I");         // Find int[] class.
     if (intArrCls == NULL)
         return NULL; /* exception thrown */
     jobjectArray result = (*env)->NewObjectArray(env, size, intArrCls, NULL);
                                                              // Create array of int[].
     if (result == NULL)
         return NULL; /* out of memory error thrown */
     (*env)->DeleteLocalRef( env, intArrCls );                // Delete reference to int[] class.

     // MULTIPLY MATRICES
     int i, j;
     jint* buf_res_row = (jint*)malloc(size * sizeof(jint));  // Buffer for final matrix.

     for (i = 0; i < size; i++) {
         // READ I-TH ROW FROM A TO BUFFER
         jintArray a_i = (*env)->GetObjectArrayElement(env, a, i);
         jint* buf_a_i = (*env)->GetIntArrayElements(env, a_i, NULL);

         // CALCULATE I-TH ROW OF RESULT
         for (j = 0; j < size; j++) {
             buf_res_row[j] = 0;
             for (k = 0; k < size; k++) {
                 buf_res_row[j] += buf_a_i[k] * buf_b[k][j];
             }
         }

         (*env)->ReleaseIntArrayElements(env, a_i, buf_a_i, 0);
         (*env)->DeleteLocalRef(env, a_i);

         // COPY I-TH ROW TO RESULT OBJECT
         jintArray res_row = (*env)->NewIntArray(env, size);
         if (res_row == NULL)
             return NULL; /* out of memory error thrown */
         (*env)->SetIntArrayRegion(env, res_row, 0, size, buf_res_row); // Firstly copy to new
         (*env)->SetObjectArrayElement(env, result, i, res_row);        // array object, and
         (*env)->DeleteLocalRef(env, res_row);                          // then to matrix.
     }

     // CLEAN UP
     free(buf_res_row);
     for (k = 0; k < size; k++) {
         (*env)->ReleaseIntArrayElements(env, b_sub[k], buf_b[k], 0);
         (*env)->DeleteLocalRef(env, b_sub[k]);
     }
     free(buf_b);
     free(b_sub);

     return result;
}

When implementation is ready, it's possible to compile it obtaining .so shared library file. For Linux it could be done executing command

gcc -O3 -fPIC -shared -I /usr/lib/jvm/java-7-oracle/include/ \
    -I /usr/lib/jvm/java-7-oracle/include/linux/ \
    -o net_sukharevd_jni_example_MatrixMultiplier.so \
    src/net_sukharevd_jni_example_MatrixMultiplier.c

-O3 means that it's desired to optimize compilation reducing code size and execution time.
-shared means that the result should be a shared library.
-fPIC (Position Independent Code) makes jumps be generated with relative rather than absolute addressing, that makes sense for shared libraries.
-I tells compiler where to look for header files (e.g. jni.h)
-o specifies where output file should be saved.

To use gcc in Windows it's necessary to install it. For Windows gcc is supplied as part MinGW (or MinGW-w64 for x64 systems). It's also nice to have gcc in the PATH, so add MinGW's bin directory to PATH. As soon as you have gcc, execute the following in command line (replacing path to JDK):

gcc -O3 -shared^
    -I "C:\Program Files\Java\jdk1.7.0_25\include"^
    -I "C:\Program Files\Java\jdk1.7.0_25\include\win32"^
    -o MatrixMultiplier.dll^
    src\net_sukharevd_jni_example_MatrixMultiplier.c

Finally it's time to run the code

java -cp src/ net.sukharevd.jni.example.MatrixMultiplier

The Java class makes benchmarks for different sizes of matrices (N). Results are represented in a table

N	tJni, [ms]	tPureJava, [ms]	Ratio
250	16	28	1.75
500	257	480	1.87
750	1,288	1,801	1.40
1000	4,126	4,638	1.12
1250	9,513	13,040	1.37
1500	16,267	22,560	1.39
1750	26,781	40,954	1.53
2000	43,172	64,025	1.48
2250	62,540	89,050	1.42
2500	82,934	135,473	1.63
2750	113,912	180,310	1.58
3000	156,210	245,301	1.57
3250	201,585	300,490	1.49
3500	246,014	413,289	1.68
3750	308,626	460,653	1.49
4000	397,392	630,182	1.59

The table shows that implemented native method gave about 1.5 times boost which means that it was worth to write all this C code for the sake of performance.

But it's not the first version of this method. Former one was much less efficient. In fact the only difference between them is the way of fetching subarrays from matrices. The former one didn't buffered the second matrix, instead it obtained elements when they were required

// CALCULATE I-TH ROW OF RESULT
for (j = 0; j < size; j++) {
    buf_res_row[j] = 0;
    for (k = 0; k < size; k++) {

        // !!! NOW B ISN'T BUFFERED ENTIRELY !!!
        jintArray b_k = (*env)->GetObjectArrayElement(env, b, k);
        (*env)->GetIntArrayRegion(env, b_k, j, 1, buf_b_k);
        (*env)->DeleteLocalRef(env, b_k);

        buf_res_row[j] += buf_a_i[k] * buf_b_k[0];
    }
}

Such a change caused serious performance problems shown in the table below

N	tJni, [ms]	tPureJava, [ms]	Ratio
250	1199	30	42.82
500	10231	511	20.02
750	35587	1920	18.53
1000	85362	5158	16.55
1250	167264	12560	13.32
1500	293473	22748	12.90
1750	463501	39921	11.61
2000	694096	62484	11.11
2250	996630	91282	10.92
2500	1361381	120622	11.29
2750	1818452	189720	9.58
3000	2382053	240613	9.90

I couldn't imagine that this change would affect performance more than 10 times! As mentioned in this IBM article looking up for objects in JVM could be very expensive. In the former edition this looking up was done N times more than necessary. JVM objects should be cached to avoid redundant and expensive looking up and work with them also should be minimized. Not keeping these recommendation cost me tenfold drop in performance.

Using JNI could significantly improve performance of Java applications. But writing JNI programs is much more sophisticated than writing pure Java implementations. Moreover there are no guarantees of optimality for such solutions. Poor implementation could give opposite results decreasing overall performance.

Dmitriy Sukharev. IT Blog

Using JNI for calculations

Comments

Pages

Categories

Tags