Digital note-taking is gaining popularity, offering a durable, editable, and easily indexable way of storing notes in the vectorized form, known as digital ink. However, a substantial gap remains between this way of note-taking and traditional pen-and-paper note-taking, a practice still favored by a vast majority. Our work, InkSight, aims to bridge the gap by empowering physical note-takers to effortlessly convert their work (offline handwriting) to digital ink (online handwriting), a process we refer to as Derendering. Prior research on the topic has focused on the geometric properties of images, resulting in limited generalization beyond their training domains. Our approach combines reading and writing priors, allowing training a model in the absence of large amounts of paired samples, which are difficult to obtain. To our knowledge, this is the first work that effectively derenders handwritten text in arbitrary photos with diverse visual characteristics and backgrounds. Furthermore, it generalizes beyond its training domain into simple sketches. Our human evaluation reveals that 87% of the samples produced by our model on the challenging HierText dataset are considered as a valid tracing of the input image and 67% look like a pen trajectory traced by a human.
Listen to an audio summary of this paper, generated by Notebook LM:
For centuries, handwritten notes have been a powerful tool for personal expression and information storage. In today’s digital world, handwritten notes offer a nostalgic charm but lack the convenience of digital formats—durability, easy indexing, and seamless integration with other digital content.
Now, with InkSight, a system built upon vision-language model with collaboration between Google Research and EPFL, we’re taking a significant step to convert offline handwriting to digital ink formats (online handwriting).
InkSight is designed to “derender” handwriting from an image into a sequence of digital ink strokes. This conversion allows a digital pen-and-paper experience from a simple photo of handwritten text, avoiding the need for specialized hardware like smart pens or digital paper.
This solution combines insights from how people read and write, even though obtaining large datasets of handwritten text with exact stroke information is challenging. Let’s first dive into the results of InkSight, and then explore how it works.
帮助 (help)
InkSight
谷歌 (google)
洛桑 (lausanne)
CHRISTIANS
October
WELCOME
我爱你 (I love you)
letter
though
PRIMING
The
regards
about
thoughts
experiment
得
math
福
你
We study performance of our models on out-of-domain samples on simple sketches. We use the Vanilla Derender inference mode to obtain the inks. We observe that our models show generalization performance to simple sketches, although the performance varies across samples.
Eagle
Cat
Cup
Penguin
Danke
Multilingual
Unsplash Frame
Korean example from Unsplash
InkSight’s model operates on both reading and writing “priors”—knowledge or tendencies humans apply to interpret and recreate text. These priors allow it to generalize across diverse handwriting styles and appearances, which are challenging to standardize in training data.
By integrating these priors, InkSight can produce robust digital inks that maintain both the semantic (content) and geometric (structure) properties of the handwritten input, making it uniquely adaptable to a wide range of visual conditions, from lighting variations to complex backgrounds.
The InkSight model architecture is based on combining a Vision Transformer (ViT) encoder with an mT5 encoder-decoder Transformer, resembling the structure of Pathways Language and Image Model ( PaLI):
The model is trained using a unique multi-task training setup that includes five task types, each allowing the model to handle handwriting inputs of various complexities. Two derendering tasks (ink output), two recognition tasks (text output), and one mixed task (text-and-ink output). Each type of task utilizes a task-specific input text, enabling the model to distinguish between tasks during both training and inference.
A further necessary step, and a unique one for this modality, is the ink tokenizer, which represents the sequential nature of digital ink in a format that is friendly to a large language model (LLM) or a vision-language model (VLM).
To this end, we propose a novel ink tokenizer, that converts ink strokes into a sequence of discrete tokens, thus acting as the tokenizer for digital ink. Each digital ink stroke is normalized by resampling it at a fixed rate, reducing the sequence length with the Ramer-Douglas-Peucker algorithm, and centering it on a fixed-size canvas.
Ink tokenization starts with a “beginning of stroke” token, followed by tokens encoding the x and y locations of each sampled point. The token dictionary size, which balances rounding error with vocabulary size, is determined by a parameter $N$.
One of the major achievement with InkSight is its ability to move from word-level derendering to full-page derendering. This allows the model to handle entire pages of handwritten notes, identifying and derendering each word individually before seamlessly combining them into a cohesive digital ink document.
InkSight’s performance is measured both by human evaluation and automated metrics. We used human traced data on the HierText dataset as the control group and the output of our model on these samples as the experiment group. Evaluators were shown the original image alongside a digital ink, which was either model-generated or human-traced (unknown to the evaluators).
Human evaluators rated 87% of InkSight’s outputs as accurate tracings of the input, while 67% were considered realistic enough to have been drawn by a human. The automated metrics further support these results, aligning closely with human judgments.
While InkSight demonstrates strong capabilities in converting offline handwriting to digital ink, it encounters challenges in certain scenarios. The model can struggle with thick or variable stroke widths and highly ambiguous or distorted text. In full-page derendering, InkSight relies on accurate segmentation to avoid misalignment issues, especially on intricate page layouts. Additionally, minor details like punctuation can sometimes be omitted or duplicated, affecting the fidelity of the digital ink. These limitations highlight areas for future refinement, aiming to boost InkSight’s precision and adaptability across diverse handwriting styles and conditions.
Task Type | Dataset | Number of Samples |
---|---|---|
Derendering | DeepWriting (words) | 89,565 |
DeepWriting (lines) | 33,933 | |
DeepWriting (characters) | 359,643 | |
VNonDB | 66,991 | |
SCUT-COUCH Chinese characters | 1,998,784 | |
SCUT-COUCH Chinese pinyin | 156,535 | |
OCR | IAM word-level (train) | 53,839 |
IMGUR5k (train) | 181,792 | |
RIMES word-level (train) | 51,738 | |
HierText (train) | 5,978 | |
ICDAR-2015 (train) | 1,535 |
Model Architecture | A multimodal sequence-to-sequence Transformer model with the mT5 encoder-decoder architecture. It takes text tokens and ViT dense image embeddings as inputs to an encoder and autoregressively predicts discrete text and ink tokens with a decoder. |
---|---|
Input(s) | A pair of image and text. |
Output(s) | Generated digital ink and text. |
Usage |
Application: The model is for research prototype, and the public version is released and available for the public. Known Caveats: None. |
System Type |
System Description: This is a standalone model. Upstream Dependencies: None. Downstream Dependencies: None. |
Implementation Frameworks |
Hardware & Software: Hardware: TPU v5e. Software: T5X , JAX/Flax, Flaxformer. Compute Requirements: We train all of our models for 340k steps with batch size 512. With frozen ViT encoders, the training of Small-i takes ∼33h on 64 TPU v5e chips and the training of Large-i takes ∼105h on 64 TPU v5e chips. |
Data Overview | Training Datasets: The ViT encoder of Small-p is pretrained on ImageNet-21k, mT5 encoder and decoder are initialized from scratch. The entire model is trained on the mixture of publicly available datasets described in the previous section. |
Evaluation Results | Evaluation Methods: Human evaluation (reported in Section 4.5.1 of the paper) and automated evaluations (reported in Section 4.5.2 of the paper). |
Model Usage & Limitations |
Sensitive Use: The model is capable of converting images to digital inks. This model should not be used for any of the privacy-intruding use cases, e.g., forging handwritings. Known Limitations: Reported in Appendix I of the paper. Ethical Considerations & Potential Societal Consequences: Reported in Sections 6.1 and 6.2 of the paper. |
The authors thank Leandro Kieliger, Philippe Schlattner, Anastasiia Fadeeva, Mircea Trăichioiu, Efi Kokiopoulou, Diego Antognini, Henry Rowley, Reeve Ingle, Manuel Drazyk, Sebastian Goodman, Jialin Wu, Xiao Wang, Tom Duerig, and Tomáš Ižo for their help and support.
If you find our work useful for your research or applications, please consider citing it using the following BibTeX:
@article{mitrevski2024inksight,
title={InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write},
author={Mitrevski, Blagoj and Rak, Arina and Schnitzler, Julian and Li, Chengkun and Maksai, Andrii and Berent, Jesse and Musat, Claudiu},
journal={arXiv preprint arXiv:2402.05804},
year={2024}
}
This blog post was written by Chengkun Li on behalf of all authors.
For questions, feedback, or issues with this post, please contact Chengkun at: lichengkun0805@gmail.com
For questions regarding the paper, please contact corresponding author Andrii at: amaksai@google.com