Mastodon

InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write

Abstract

Digital note-taking is gaining popularity, offering a durable, editable, and easily indexable way of storing notes in the vectorized form, known as digital ink. However, a substantial gap remains between this way of note-taking and traditional pen-and-paper note-taking, a practice still favored by a vast majority. Our work, InkSight, aims to bridge the gap by empowering physical note-takers to effortlessly convert their work (offline handwriting) to digital ink (online handwriting), a process we refer to as Derendering. Prior research on the topic has focused on the geometric properties of images, resulting in limited generalization beyond their training domains. Our approach combines reading and writing priors, allowing training a model in the absence of large amounts of paired samples, which are difficult to obtain. To our knowledge, this is the first work that effectively derenders handwritten text in arbitrary photos with diverse visual characteristics and backgrounds. Furthermore, it generalizes beyond its training domain into simple sketches. Our human evaluation reveals that 87% of the samples produced by our model on the challenging HierText dataset are considered as a valid tracing of the input image and 67% look like a pen trajectory traced by a human.

Publication
ArXiv

🎧 Podcast Summary of the Paper

Listen to an audio summary of this paper, generated by Notebook LM:

Overview

For centuries, handwritten notes have been a powerful tool for personal expression and information storage. In today’s digital world, handwritten notes offer a nostalgic charm but lack the convenience of digital formats—durability, easy indexing, and seamless integration with other digital content.

Now, with InkSight, a system built upon vision-language model with collaboration between Google Research and EPFL, we’re taking a significant step to convert offline handwriting to digital ink formats (online handwriting).

Left: Offline handwriting. Right: Output digital ink (online handwriting). In every word, character colors transition from red to purple, following the rainbow sequence, ROYGBIV. Within each stroke, the shade progresses from darker to lighter.

What is InkSight?

InkSight is designed to “derender” handwriting from an image into a sequence of digital ink strokes. This conversion allows a digital pen-and-paper experience from a simple photo of handwritten text, avoiding the need for specialized hardware like smart pens or digital paper.

InkSight full-page result of handwritten notes of mass-energy equivalence

This solution combines insights from how people read and write, even though obtaining large datasets of handwritten text with exact stroke information is challenging. Let’s first dive into the results of InkSight, and then explore how it works.

✨ Word-level Samples

help

帮助 (help)

Inksight

InkSight

google

谷歌 (google)

google

洛桑 (lausanne)

CHRISTIANS

CHRISTIANS

October

October

WELCOME

WELCOME

i love you

我爱你 (I love you)

letter

letter

though

though

priming

PRIMING

The

The

regards

regards

about

about

thoughts

thoughts

experiment

experiment

得

math

math

福

你

🎨 Sketch Samples

We study performance of our models on out-of-domain samples on simple sketches. We use the Vanilla Derender inference mode to obtain the inks. We observe that our models show generalization performance to simple sketches, although the performance varies across samples.

Eagle

Eagle

Cat

Cat

Cup

Cup

Penguin

Penguin

📝 Full-page Samples

Danke

Danke

Multilingual

Multilingual

Unsplash Frame

Unsplash Frame

Korean example from Unsplash

Korean example from Unsplash



How Does InkSight Work?

InkSight’s model operates on both reading and writing “priors”—knowledge or tendencies humans apply to interpret and recreate text. These priors allow it to generalize across diverse handwriting styles and appearances, which are challenging to standardize in training data.

  1. Reading Prior: The model learns to identify textual elements within varied and complex images, and can be aided by general text recognition capabilities, including OCR.
  2. Writing Prior: This ensures that the output digital ink aligns with natural handwriting dynamics, capturing the order of strokes in an authentic, human-like way.
InkSight Diagram: Detailed explanations for each component are provided below.

By integrating these priors, InkSight can produce robust digital inks that maintain both the semantic (content) and geometric (structure) properties of the handwritten input, making it uniquely adaptable to a wide range of visual conditions, from lighting variations to complex backgrounds.

Comparison between GVS (baseline) and 3 variants of InkSight model across various types of input.

The InkSight model architecture is based on combining a Vision Transformer (ViT) encoder with an mT5 encoder-decoder Transformer, resembling the structure of Pathways Language and Image Model ( PaLI):

How the InkSight word-level model outputs both text and digital ink through "Recognize and Derender" inference. [gif]

The model is trained using a unique multi-task training setup that includes five task types, each allowing the model to handle handwriting inputs of various complexities. Two derendering tasks (ink output), two recognition tasks (text output), and one mixed task (text-and-ink output). Each type of task utilizes a task-specific input text, enabling the model to distinguish between tasks during both training and inference.

Training tasks mixture.

A further necessary step, and a unique one for this modality, is the ink tokenizer, which represents the sequential nature of digital ink in a format that is friendly to a large language model (LLM) or a vision-language model (VLM).

To this end, we propose a novel ink tokenizer, that converts ink strokes into a sequence of discrete tokens, thus acting as the tokenizer for digital ink. Each digital ink stroke is normalized by resampling it at a fixed rate, reducing the sequence length with the Ramer-Douglas-Peucker algorithm, and centering it on a fixed-size canvas.

Ink tokenization starts with a “beginning of stroke” token, followed by tokens encoding the x and y locations of each sampled point. The token dictionary size, which balances rounding error with vocabulary size, is determined by a parameter $N$.

Ink tokenization for a single stroke begins with a “b” token, followed by x and y coordinate tokens for sampled points (1–7) along the stroke, ordered by a color gradient. [gif]

One of the major achievement with InkSight is its ability to move from word-level derendering to full-page derendering. This allows the model to handle entire pages of handwritten notes, identifying and derendering each word individually before seamlessly combining them into a cohesive digital ink document.

Pipeline to scale up to full page. [gif]

Evaluations

InkSight’s performance is measured both by human evaluation and automated metrics. We used human traced data on the HierText dataset as the control group and the output of our model on these samples as the experiment group. Evaluators were shown the original image alongside a digital ink, which was either model-generated or human-traced (unknown to the evaluators).

Human evaluators rated 87% of InkSight’s outputs as accurate tracings of the input, while 67% were considered realistic enough to have been drawn by a human. The automated metrics further support these results, aligning closely with human judgments.

Human evaluation user interface.
Human evaluation result.

Limitations and Future Directions

While InkSight demonstrates strong capabilities in converting offline handwriting to digital ink, it encounters challenges in certain scenarios. The model can struggle with thick or variable stroke widths and highly ambiguous or distorted text. In full-page derendering, InkSight relies on accurate segmentation to avoid misalignment issues, especially on intricate page layouts. Additionally, minor details like punctuation can sometimes be omitted or duplicated, affecting the fidelity of the digital ink. These limitations highlight areas for future refinement, aiming to boost InkSight’s precision and adaptability across diverse handwriting styles and conditions.

Model Card Datasets and Training

In-house Training Mixture

Training tasks mixture for in-house models; left: derendering, right: recognition.

Small-p Training Mixture

Small-p training tasks mixture; left: derendering, right: recognition.
Task Type Dataset Number of Samples
Derendering DeepWriting (words) 89,565
DeepWriting (lines) 33,933
DeepWriting (characters) 359,643
VNonDB 66,991
SCUT-COUCH Chinese characters 1,998,784
SCUT-COUCH Chinese pinyin 156,535
OCR IAM word-level (train) 53,839
IMGUR5k (train) 181,792
RIMES word-level (train) 51,738
HierText (train) 5,978
ICDAR-2015 (train) 1,535

Model and Training Summary

Model Architecture A multimodal sequence-to-sequence Transformer model with the mT5 encoder-decoder architecture. It takes text tokens and ViT dense image embeddings as inputs to an encoder and autoregressively predicts discrete text and ink tokens with a decoder.
Input(s) A pair of image and text.
Output(s) Generated digital ink and text.
Usage Application: The model is for research prototype, and the public version is released and available for the public.
Known Caveats: None.
System Type System Description: This is a standalone model.
Upstream Dependencies: None.
Downstream Dependencies: None.
Implementation Frameworks Hardware & Software: Hardware: TPU v5e.
Software: T5X , JAX/Flax, Flaxformer.
Compute Requirements: We train all of our models for 340k steps with batch size 512. With frozen ViT encoders, the training of Small-i takes ∼33h on 64 TPU v5e chips and the training of Large-i takes ∼105h on 64 TPU v5e chips.
Data Overview Training Datasets: The ViT encoder of Small-p is pretrained on ImageNet-21k, mT5 encoder and decoder are initialized from scratch. The entire model is trained on the mixture of publicly available datasets described in the previous section.
Evaluation Results Evaluation Methods: Human evaluation (reported in Section 4.5.1 of the paper) and automated evaluations (reported in Section 4.5.2 of the paper).
Model Usage & Limitations Sensitive Use: The model is capable of converting images to digital inks. This model should not be used for any of the privacy-intruding use cases, e.g., forging handwritings.
Known Limitations: Reported in Appendix I of the paper.
Ethical Considerations & Potential Societal Consequences: Reported in Sections 6.1 and 6.2 of the paper.

Acknowledgements

The authors thank Leandro Kieliger, Philippe Schlattner, Anastasiia Fadeeva, Mircea Trăichioiu, Efi Kokiopoulou, Diego Antognini, Henry Rowley, Reeve Ingle, Manuel Drazyk, Sebastian Goodman, Jialin Wu, Xiao Wang, Tom Duerig, and Tomáš Ižo for their help and support.



📃 Citation

If you find our work useful for your research or applications, please consider citing it using the following BibTeX:

Paper Thumbnail

@article{mitrevski2024inksight,
  title={InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write},
  author={Mitrevski, Blagoj and Rak, Arina and Schnitzler, Julian and Li, Chengkun and Maksai, Andrii and Berent, Jesse and Musat, Claudiu},
  journal={arXiv preprint arXiv:2402.05804},
  year={2024}
}

Feedback

This blog post was written by Chengkun Li on behalf of all authors.

For questions, feedback, or issues with this post, please contact Chengkun at: lichengkun0805@gmail.com

For questions regarding the paper, please contact corresponding author Andrii at: amaksai@google.com

Next
Previous