Ai Mobile Tensor

December 6, 2021

Collecting my thoughts some… One of the projects I have been working on of late is to create a workflow which reads data off of an Android microphone and feeds it to a tensorflow model for inference.

This in theory is really simple. Kotlin makes it so you can just grab chunks off of a listening audio device and throw it over to a model. All of the libraries are easy to use and debug, and the results are easy to handle.

So the question is, why didn’t I do it this way?

It’s about memory management. Under the covers Kotlin runs on the JVM. The JVM is a notorious beast, renowned for its ability to handle memory management all by itself. This in the general case is very true - the Java Virtual Machine is a wonderful thing for your everyday computing tasks.

Unfortunately this task is not your everyday task. This one involves reading large amounts of data that is constantly changing - streaming in fact. This means that memory is constantly getting allocated, utilized, and then deallocated. The amount of overhead this creates for our friend the JVM is huge.

I was talking to a brilliant mind a few months ago at our Techstars demo day. He shook his head when he heard of the problem that I was facing. Solemnly, he said, to paraphrase, that there were two ways to handle this - the easy way or the right way. The easy way was to leave all of this on the Java side and deal with the fact your app is slow and prone to crashing. The right way was to take this into C++.

In Android, most development is done in a high level language. Lots of tooling is available, libraries, almost plug-and-play with the latest developments in Kotlin and Jetpack Compose. I can literally create a new, functional app in a day that has a nice usable UI. This is a relatively new development, as before everything was driven by an arcane View inflation system that blended a bunch of XML configuration files and code. It was a mess.

So, we have this wonderful high level environment, but under the covers is a whole basement of hell you can put yourself into. This layer is called the NDK. It uses JNI wrappers around a C interface to accomplish the goal of being able to use compiled binaries on an Android device.

That’s right, we are calling out of Java in to C/C++. Not just any C++, but super-modern C++17. Awesome stuff. Super powerful. It’s like development dynamite in that I can now blow holes in mountains, or I can blow myself up.

So, now I’m close to the metal, I’m streaming data into vector buffers straight from the NDK. It’s feeding in to a speech recognition library I linked over to using CMakeLists. That all took about two weeks to throw together, and it’s fast, memory efficient, and beautiful. Real computer programming at long last.

What is next? How about we take and load up a neural network model and get it to classify stuff. There are plenty of sound classifiers out there; in fact I have written a few myself. Bringing the power on to an edge device is right about where the bleeding edge of tech is at the moment. Edge AI.

So, we just take and slap a trained tensorflow model right on to our device, and we’re done right? NDK even has a neural networks API that we can theoretically leverage. Not so fast.

That fricking API is verbose. It wants you to define everything about a model yourself. How about you waste a week trying to get that to work.

Ok, so maybe we can just use the tensorflow library to make this work. Sure. This is the right path, but it takes a little bit of work to get going.

The first issue is that standard tensorflow models won’t run on the edge. We have to take our tensorflow model and convert it to a tflite model. Some models already have a tflite version, otherwise there is a slew of steps you follow to convert it yourself.

Second issue, how do you load the model? There are a few different ways - the first is you can take and do a lot of configuration to build your own C++ libraries for tensorflow lite. A lot of work. The other option is to not do a lot of work, and utilize the C api bindings that tensorflow lite has available and pretty much just work out of the box.

### h
const int INPUT_TENSOR_SIZE = 15600
const int OUTPUT_TENSOR_SIZE = 521;
const float DETECT_THRESHOLD = 0.5;

typedef std::vector<int16_t> AudioBuffer;
typedef std::vector<float> InputTensor;
typedef std::vector<float> OutputTensor;

class AudioDetectorModel {
public:
    AudioDetectorModel(std::string modelPath, float modelThreshold);

    ~AudioDetectorModel();

    std::string accept_waveform(AudioBuffer dataBuffer);
    bool infer(InputTensor input, OutputTensor &output);

private:
    float _threshold;
    TfLiteModel * _model;
    TfLiteInterpreterOptions * _options;
    TfLiteInterpreter * _interpreter;
};


### cpp

AudioDetectorModel::AudioDetectorModel(std::string modelPath, float modelThreshold) :
    _threshold(modelThreshold) {
    _model = TfLiteModelCreateFromFile(modelPath.c_str());
    _options = TfLiteInterpreterOptionsCreate();
    TfLiteInterpreterOptionsSetNumThreads(_options, 2);

    // Create the interpreter.
    _interpreter = TfLiteInterpreterCreate(_model, _options);
    TfLiteInterpreterAllocateTensors(_interpreter);
}

AudioDetectorModel::~AudioDetectorModel() {
    TfLiteInterpreterDelete(_interpreter);
    TfLiteInterpreterOptionsDelete(_options);
    TfLiteModelDelete(_model);
}

Not too bad, now we have a model loaded. What about the inputs and outputs?

That’s easy as well. You just have to run the visualize.py script and find your input and output references, properly match your datatypes, and make sure your buffers are the right size.

bool AudioDetectorModel::infer(InputTensor input, OutputTensor &output) {
    TfLiteTensor *input_tensor = TfLiteInterpreterGetInputTensor(_interpreter, 0);
    TfLiteTensorCopyFromBuffer(input_tensor, input.data(), input.size() * sizeof(float));

    // Execute inference.
    TfLiteInterpreterInvoke(_interpreter);

    // Extract the output tensor data.
    const TfLiteTensor *output_tensor = TfLiteInterpreterGetOutputTensor(_interpreter, 0);
    TfLiteTensorCopyToBuffer(output_tensor, &output[0], output.size() * sizeof(float));
    return true;
}

std::string AudioDetectorModel::accept_waveform(AudioBuffer dataBuffer) {
    InputTensor inTensor = InputTensor (INPUT_TENSOR_SIZE);
    int size = std::min(dataBuffer.size(), inTensor.size());
    for (int i = 0 ; i < size ; i++)
        inTensor[i] = dataBuffer[i] / INT16_MAX;
    OutputTensor outTensor = OutputTensor (OUTPUT_TENSOR_SIZE);

    infer(inTensor, outTensor);

    std::vector<int> matches;
    for (int i = 0 ; i < OUTPUT_TENSOR_SIZE ; i++)
        if (outTensor[i] >= _threshold)
            matches.push_back(i);

    std::string result;
    for (const auto &i : matches) {
        if (!result.empty())
            result += ", ";
        result += "[" + std::to_string(i) +
                ", " + std::to_string(outTensor[i]) + "]";
    }

    return "[" + result + "]";
}

You say it works now. Are you sure, you may ask? It’s all running on an Android device and there is no good way to test without some convoluted chain of shipping libraries and running test wrappers, you may say. Well all hope is not lost. You can compile your own local copy of tensorflow lite to test out our model.

As a bonus challenge, can you spot the bug in the above code? The answer is below.

git clone https://github.com/tensorflow/tensorflow.git tensorflow_src
mkdir tflite_build
cd tflite_build
cmake ../tensorflow_src/tensorflow/lite/c -DCMAKE_BUILD_TYPE=Debug
cmake --build . -j

Now you can test with your libtensorflowlite_c.so using your favorite test wrapper, like gtest.

Still more issues raise their head on the horizon now: Calling back to the app across JNI, dealing with loading the model file, keeping models up to date. A slew of tasks, but in the end it was the right way.

At least it seems that way so far. A few more days of testing; hopefully will not turn to weeks.

Next up, event streaming to the cloud. Timeseries databases. Welcome to the future.

For the curious, the bug is:

inTensor[i] = dataBuffer[i] / INT16_MAX;

should be

inTensor[i] = (float)dataBuffer[i] / INT16_MAX;

Not casting it causes all the values to be zero, since an int divided by a larger int is truncated. When I went to test, the model just said ‘Silence’. Oops, easy obvious fix.