March 9, 2023

I kind of liked the kanji dictionaries in which you could draw out a kanji, and it would identify it, such as this one. This is good, but what if it could be done on relatively cheap, handheld device, that isn't a smartphone? This is arguably a pretty dumb idea, since a smartphone would make this a whole lot easier, while probably being more convenient. Either way, I wondered if this could be done on a cheap microcontroller.

Firstly, I did some research on whether this kind of thing would be easier to implement as a hardcoded algorithm, or as a neural network. Basically, implementing such an algorithm is not a straightforward task, and requires a lot of knowledge in math/computer science, which I do not have.

The second, more popular approach is just to train a neural network to do this. While this still requires a lot of complicated math, there are several frameworks that help automate this process so that even a layperson like me can train such a neural network.

Now that I had decided on using a neural network for identifying handwritten kanjis, I had to pick a framework to use. PyTorch is quite popular, but I am not sure about how well they support low powered devices with limited resources such as microcontrollers. The other choice is Tensorflow. Tensorflow does have a reasonably well known branch made for smaller devices such as Raspberry Pi's, and it even has a separate branch made specifically for microcontrollers. Because it is more well known, (and honestly not much else), I chose to use Tensorflow for this project.

But hold up, where am I going to get training data? After searching on Google for some handwritten kanji datasets, I concluded that either I didn't search enough, or that I wasn't going to find what I needed. This one that I found is alright, but it since the model will be mainly reading handwritten copies of printed kanjis, which are quite different from these truly handwritten ones, I thought that I could make my own dataset.

I don't have the time to manually draw out the thousands of images that I need, so I thought that I could just generate them using some code. The approach I ended up using was basically to use a bunch of different typefaces and generate a series of distorted images of each kanji.

Generating a dataset

Firstly, I had to get a list of kanjis to use for the dataset. I ended up literally just copy and pasting this page, and then I just used notepad++ to get rid of the unnecessary stuff. This is the resulting text file.

Long story short, this program takes each character from the list, and uses a random font to generate an image. It then applies a randomized (within some limits) matrix transform, converts to grayscale, applies a binary threshold, and saves it. It then repeats this a number of times for each image.

For mine, I ended up using eight different fonts, 2500 kanjis and 300 images each, so 750,000 images total. It would be quite hard to find a dataset this large. (Although I am probably overfitting my model here, and typefaces are not really super accurate of handwriting, so its not actually that impressive).

Tensorflow model

(Since Tensorflow changes so often, (and because I don't really know what I am doing), don't expect to be able to copy and paste and code from here and have it work without modification. Also, these code blocks may be incomplete.)

Using the Tensorflow image classification tutorial, I was able to reasonably easily build a model that could identify between the images used in the tutorial. (I suppose that is to be expected of a tutorial).

Training data can be in the form of something like this: (The dataset generator script formats its output like this natively)

-> root -> class1 -> data1 -> data2 -> ... -> class2 -> data1 -> data2 -> ... -> ...

...and so on.

The names of the folders serve as the "labels" here.

Collections of images (one for training, one for validation) arranged like this can then be converted to a Tensorflow dataset using something like this:

train_ds = tf.keras.utils.image_dataset_from_directory( DATA_PATH, labels='inferred', validation_split=0.2, subset="training", color_mode='grayscale', seed=123, image_size=(img_height, img_width), batch_size=batch_size, shuffle=True ) val_ds = tf.keras.utils.image_dataset_from_directory( DATA_PATH, labels='inferred', validation_split=0.2, subset="validation", color_mode='grayscale', seed=123, image_size=(img_height, img_width), batch_size=batch_size, shuffle=True )

Prefetch data (from the Tensorflow example)

train_ds = train_ds.prefetch(buffer_size=tf.data.AUTOTUNE) val_ds = val_ds.prefetch(buffer_size=tf.data.AUTOTUNE)

As for the model, I have messed around with the configuration of the model quite a bit. Long story short, (again, I do not really know what I am doing!) It seems that it is better to have a mode with more Conv2D layers rather than bigger layers. After some experimentation, I ended up with something like this:

model = Sequential([ layers.Rescaling(scale=1.0 / 255, offset=0), layers.Conv2D(32, kernel_size=3, activation='relu', padding='same'), layers.MaxPooling2D(), layers.Dropout(0.1), layers.Conv2D(32, kernel_size=3, activation='relu', padding='same'), layers.MaxPooling2D(), layers.Dropout(0.1), layers.Conv2D(32, kernel_size=3, activation='relu', padding='same'), layers.MaxPooling2D(), layers.Dropout(0.1), layers.Conv2D(32, kernel_size=3, activation='relu', padding='same'), layers.MaxPooling2D(padding='same'), layers.Dropout(0.1), layers.Conv2D(128, kernel_size=3, activation='relu', padding='same'), layers.MaxPooling2D(padding='same'), layers.Dropout(0.1), layers.Flatten(), layers.Dense(num_classes) ])

The model can then be compiled, trained, and saved.

model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) epochs = 5 history = model.fit( train_ds, validation_data=val_ds, epochs=epochs) model.save("model.keras")

The problem

For me when I tried this, I noticed that sometimes it literally just won't "learn" anything from training. The loss metrics usually won't go down, and the whole thing just does not work. I eventually tried generating a new dataset using only 25 different kanjis, and using this, it trained perfectly fine. After much head scratching, I eventually landed on a solution. It turns out that if I train the model on just 25 different classes, then save it, and in a different script, open the saved model, replace the Dense layer (since it has [n kanjis] number of nodes), and train it again with all 2500 classes, it seems to work just fine.

Modifying the saved model: (num_classes is now 2500)

pretrained_model = keras.models.load_model("model.keras") last_layer_removed_model = keras.Model(inputs=pretrained_model.input, outputs=pretrained_model.layers[-2].output) output = layers.Dense(num_classes)(last_layer_removed_model.output) model = keras.Model(inputs=pretrained_model.input, outputs=output)

The dataset can be generated and loaded the same way as before, (except with all 2500 kanjis), and the model can now be compiled, trained and saved, as before.

Converting to TFLite

The model we have saved is a full Tensorflow/Keras model, which are normally used on desktop machines with reasonably good hardware. However, since we are trying to run this on a cheap microcontroller, we must convert it to a TFLite model. In addition, all the weights in the model are in the form of 32 bit floating point numbers, which are good, but they take up four bytes of memory each, and are horrendously slow to compute, especially on low cost chips. For this reason, the whole model is converted to one with 8 bit integer weights.

This can be done with something like this, using the 25 image dataset from before as a sample:

import tensorflow as tf import keras DATA_PATH = "25-image-dataset" batch_size = 64 img_height = 64 img_width = 64 train_ds = tf.keras.utils.image_dataset_from_directory( DATA_PATH, color_mode='grayscale', seed=123, image_size=(img_height, img_width), batch_size=batch_size, shuffle=True ) train_ds = train_ds.prefetch(buffer_size=tf.data.AUTOTUNE) def representative_data_gen(): for input_value, labels in train_ds.take(50): yield [input_value] model = keras.models.load_model("model.keras") converter = tf.lite.TFLiteConverter.from_keras_model(model) converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.representative_dataset = representative_data_gen converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] converter.inference_input_type = tf.int8 # or tf.uint8 converter.inference_output_type = tf.int8 # or tf.uint8 tflite_model = converter.convert() with open('output.tflite', 'wb') as f: f.write(tflite_model)

Using the model

Just to get a feel for how good or bad the model actually is, I wrote a short script to test out the model. It can be downloaded here.

The hardware

As for using the model, I mostly followed this tutorial, as well as looking around on the GitHub page of Zack Freedman's Somatic project. Using these, I was able to get my model to run on a few different microcontrollers.

My original configuration for the model used 128x128 inputs instead of 64x64, and its Conv2D layers were quite a bit larger. Thus, it required several megabytes of RAM, which did not fit on any microcontrollers natively. Instead, I tried using an ESP32-WROVER module with 4MB of accessable PSRAM. While this did work, the ESP32's PSRAM is quite slow compared to actual SRAM, and the model ended up taking over two minutes to run. This was not acceptable, so I ended up reducing the size of the inputs to the current size of 64x64, and I also reduced the size of the Conv2D layers to their current size.

This new model was able to fit in just under 170kB of RAM, which meant that it could be run on the RPi Pico, which I am more familiar with. It turned out that on the RP2040, this new model could run inference in under three seconds. Quite the improvement from the two minutes that it took before. With a casual 288% overclock, the model now ran in under 800ms, suitable for use in an actual device.

With a bunch of C++ code and a 2.4" TFT LCD display, a basic device can be made.