<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>&lt;A Href=&#39;mailto:amaksai@google.com&#39; Style=&#39;color: #0073e6;&#39;&gt;&lt;sup&gt;†&lt;/Sup&gt;Correspondence&lt;/A&gt; | Chengkun Li</title>
    <link>https://charlieleee.github.io/authors/a-hrefmailtoamaksai@google.com-stylecolor-%230073e6sup/supcorrespondence/a/</link>
      <atom:link href="https://charlieleee.github.io/authors/a-hrefmailtoamaksai@google.com-stylecolor-%230073e6sup/supcorrespondence/a/index.xml" rel="self" type="application/rss+xml" />
    <description>&lt;A Href=&#39;mailto:amaksai@google.com&#39; Style=&#39;color: #0073e6;&#39;&gt;&lt;sup&gt;†&lt;/Sup&gt;Correspondence&lt;/A&gt;</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>© 2026 Chengkun Li</copyright><lastBuildDate>Fri, 20 Jun 2025 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://charlieleee.github.io/images/icon_hu_8f13c06844116d5a.png</url>
      <title>&lt;A Href=&#39;mailto:amaksai@google.com&#39; Style=&#39;color: #0073e6;&#39;&gt;&lt;sup&gt;†&lt;/Sup&gt;Correspondence&lt;/A&gt;</title>
      <link>https://charlieleee.github.io/authors/a-hrefmailtoamaksai@google.com-stylecolor-%230073e6sup/supcorrespondence/a/</link>
    </image>
    
    <item>
      <title>InkSight: Oﬄine-to-Online Handwriting Conversion by Teaching Vision-Language Models to Read and Write</title>
      <link>https://charlieleee.github.io/publication/inksight/</link>
      <pubDate>Fri, 20 Jun 2025 00:00:00 +0000</pubDate>
      <guid>https://charlieleee.github.io/publication/inksight/</guid>
      <description>&lt;!-- 






  
    
  
  
















&lt;figure class=&#34;width-normal&#34; id=&#34;figure-inksight-system-diagram&#34;&gt;



  &lt;img src=&#34;derender_diagram.svg&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;



  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    InkSight system diagram.
  &lt;/figcaption&gt;


&lt;/figure&gt; --&gt;







  
    
  
  
















&lt;figure class=&#34;width-normal&#34; id=&#34;figure-this-work-was-a-collaboration-between-google-researchhttpsresearchgoogle-and-epflhttpsepflchen&#34;&gt;



  &lt;img src=&#34;cor.svg&#34; alt=&#34;&#34; width=&#34;55%&#34; &gt;



  
  
  &lt;figcaption&gt;
    This work was a collaboration between 
&lt;a href=&#34;https://research.google&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Google Research&lt;/a&gt; and 
&lt;a href=&#34;https://epfl.ch/en&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;EPFL&lt;/a&gt;
  &lt;/figcaption&gt;


&lt;/figure&gt;
&lt;div style=&#34;border: 1px solid #d3d3d3; padding: 10px; background-color: #fdf9f3; width: 100%; margin-top: 0px; border-radius: 6px; box-shadow: 0px 2px 6px rgba(0, 0, 0, 0.05); color: #444; font-size: 0.85em; line-height: 1.3;&#34;&gt;
    &lt;h4 style=&#34;margin-top: 0; font-size: 1em; color: #222; font-weight: bold;&#34;&gt;Feedback and correspondence&lt;/h4&gt;
    &lt;p style=&#34;margin: 5px 0;&#34;&gt; For questions, feedback, or issues with this post and open-source code, please contact Chengkun at: &lt;a href=&#34;mailto:chengkun.li@epfl.ch&#34; style=&#34;color: #0073e6; &#34;&gt;chengkun.li@epfl.ch&lt;/a&gt;. For questions regarding the paper, please contact corresponding author Andrii at:
        &lt;a href=&#34;https://www.linkedin.com/in/andrii-maksai-7199961b5/&#34; style=&#34;color: #0073e6; &#34;&gt;his linkedin profile&lt;/a&gt;.
    &lt;/p&gt;
&lt;/div&gt;
&lt;h1 id=&#34;overview&#34;&gt;Overview&lt;/h1&gt;
&lt;p&gt;For centuries, handwritten notes have been a powerful tool for personal expression and information storage. In today&amp;rsquo;s digital world, handwritten notes offer a nostalgic charm
but lack the convenience of digital formats—durability, easy indexing, and seamless integration with other digital content.&lt;/p&gt;
&lt;p&gt;Now, with &lt;strong&gt;InkSight&lt;/strong&gt;, a system built upon vision-language models, we&amp;rsquo;re taking a significant step toward converting offline handwriting to digital ink formats (online handwriting).







  
  
    
  
















&lt;figure class=&#34;width-medium&#34; id=&#34;figure-left-offline-handwriting-right-output-digital-ink-online-handwriting-in-every-word-character-colors-transition-from-red-to-purple-following-the-rainbow-sequence-roygbiv-within-each-stroke-the-shade-progresses-from-darker-to-lighter&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;inksight.gif&#34; data-caption=&#34;Left: Offline handwriting. Right: Output digital ink (online handwriting). In every word, character colors transition from red to purple, following the rainbow sequence, ROYGBIV. Within each stroke, the shade progresses from darker to lighter.&#34;&gt;


  &lt;img src=&#34;inksight.gif&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;
&lt;/a&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Left: Offline handwriting. Right: Output digital ink (online handwriting). In every word, character colors transition from red to purple, following the rainbow sequence, ROYGBIV. Within each stroke, the shade progresses from darker to lighter.
  &lt;/figcaption&gt;


&lt;/figure&gt;&lt;/p&gt;
&lt;!-- &lt;div style=&#34;border: 1px solid #e0e0e0; border-radius: 6px; padding: 12px; margin: 12px 0; text-align: left; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.05);&#34;&gt;
    &lt;h3 style=&#34;font-size: 1.1em; color: #2c2c2c; margin: 0 0 8px 0;&#34;&gt;🎧 Podcast Summary&lt;/h3&gt;
    &lt;p style=&#34;font-size: 0.9em; color: #444; margin: 0 0 8px 0;&#34;&gt;
        Generated by &lt;a href=&#34;https://notebooklm.google/&#34; target=&#34;_blank&#34; style=&#34;color: #1a73e8; text-decoration: none;&#34;&gt;Notebook LM&lt;/a&gt;
    &lt;/p&gt;
    &lt;audio controls style=&#34;width: 100%; margin: 0; height: 32px;&#34;&gt;
        &lt;source src=&#34;inksight.mp3&#34; type=&#34;audio/mpeg&#34;&gt;
        Your browser does not support the audio element.
    &lt;/audio&gt;
&lt;/div&gt; --&gt;
&lt;h1 id=&#34;what-is-inksight&#34;&gt;What is InkSight?&lt;/h1&gt;
&lt;p&gt;InkSight is designed to &amp;ldquo;&lt;strong&gt;derender&lt;/strong&gt;&amp;rdquo; handwriting from an image into a sequence of digital ink strokes. This conversion allows a digital pen-and-paper experience from a simple photo of handwritten text, avoiding the need for specialized hardware like smart pens or digital paper.







  
  



  
  














&lt;figure class=&#34;width-medium&#34; id=&#34;figure-inksight-full-page-result-of-handwritten-notes-of-mass-energy-equivalence&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;https://charlieleee.github.io/publication/inksight/wiki_hu_725187dcc646fd3d.jpg&#34; data-caption=&#34;InkSight full-page result of handwritten notes of mass-energy equivalence.&#34;&gt;


  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/wiki_hu_725187dcc646fd3d.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; height=&#34;1624&#34;&gt;
&lt;/a&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    InkSight full-page result of handwritten notes of mass-energy equivalence.
  &lt;/figcaption&gt;


&lt;/figure&gt;&lt;/p&gt;
&lt;p&gt;This solution combines intuitions from how people read and write, even though obtaining large datasets of handwritten text with exact stroke information is challenging. Let&amp;rsquo;s first dive into the results of InkSight, and then explore how it works.&lt;/p&gt;
&lt;h3 id=&#34;-word-level-samples&#34;&gt;✍️ Word-level Samples&lt;/h3&gt;
&lt;style&gt;
.gallery-container {
  display: grid;
  grid-template-columns: repeat(auto-fill, minmax(140px, 1fr)); /* Reduced from 160px */
  gap: 16px; /* Reduced from 24px */
  padding: 12px 0; /* Reduced from 20px */
}

.gallery-item {
  text-align: center;
}

.gallery-item img {
  width: 100%;
  height: auto;
  border-radius: 6px; /* Slightly reduced from 8px */
  margin-bottom: 4px; /* Reduced from 8px */
  box-shadow: 0 2px 4px rgba(0,0,0,0.1); /* Slightly reduced shadow */
}

.gallery-item p {
  font-size: 13px; /* Slightly reduced from 14px */
  font-weight: 500;
  margin: 4px 0; /* Adjusted margins */
  color: #333;
  line-height: 1.3; /* Added to keep text compact */
}
&lt;/style&gt;
&lt;div class=&#34;gallery-container&#34;&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/help.gif&#34; alt=&#34;help&#34;&gt;
        &lt;p&gt;帮助 (help)&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/InkSight_gif.gif&#34; alt=&#34;Inksight&#34;&gt;
        &lt;p&gt;InkSight&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/lausanne.gif&#34; alt=&#34;lausanne&#34;&gt;
        &lt;p&gt;洛桑 (lausanne)&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/christians.gif&#34; alt=&#34;CHRISTIANS&#34;&gt;
        &lt;p&gt;CHRISTIANS&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/october.gif&#34; alt=&#34;October&#34;&gt;
        &lt;p&gt;October&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/welcome.gif&#34; alt=&#34;WELCOME&#34;&gt;
        &lt;p&gt;WELCOME&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/wan.gif&#34; alt=&#34;i love you&#34;&gt;
        &lt;p&gt;我爱你 (I love you)&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/though.gif&#34; alt=&#34;though&#34;&gt;
        &lt;p&gt;though&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/priming.gif&#34; alt=&#34;priming&#34;&gt;
        &lt;p&gt;PRIMING&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/The.gif&#34; alt=&#34;The&#34;&gt;
        &lt;p&gt;The&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/regards.gif&#34; alt=&#34;regards&#34;&gt;
        &lt;p&gt;regards&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/thoughts.gif&#34; alt=&#34;thoughts&#34;&gt;
        &lt;p&gt;thoughts&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/experiment.gif&#34; alt=&#34;experiment&#34;&gt;
        &lt;p&gt;experiment&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/math.gif&#34; alt=&#34;math&#34;&gt;
        &lt;p&gt;math&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/fu.gif&#34; alt=&#34;福&#34;&gt;
        &lt;p&gt;福&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;gallery-item&#34;&gt;
        &lt;img src=&#34;figures/gifs/ni.gif&#34; alt=&#34;你&#34;&gt;
        &lt;p&gt;你&lt;/p&gt;
    &lt;/div&gt;
&lt;/div&gt;
&lt;h3 id=&#34;-full-page-samples&#34;&gt;📝 Full-page Samples&lt;/h3&gt;
&lt;p&gt;






  
    
  
  
















&lt;figure class=&#34;width-medium&#34; id=&#34;figure-ood-input----danke-written-on-the-sand&#34;&gt;



  &lt;img src=&#34;figures/full_page/danke.svg&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;



  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    OOD input &amp;ndash; Danke written on the sand.
  &lt;/figcaption&gt;


&lt;/figure&gt;







  
    
  
  
















&lt;figure class=&#34;width-medium&#34; id=&#34;figure-multilingual-chinese-english-french-input&#34;&gt;



  &lt;img src=&#34;figures/full_page/multilingual.svg&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;



  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Multilingual (Chinese, English, French) input.
  &lt;/figcaption&gt;


&lt;/figure&gt;







  
    
  
  
















&lt;figure class=&#34;width-medium&#34; id=&#34;figure-texts-written-in-a-frame-from-unsplash&#34;&gt;



  &lt;img src=&#34;figures/full_page/unsplash_frame.svg&#34; alt=&#34;&#34; width=&#34;100%&#34; &gt;



  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Texts written in a frame, from Unsplash.
  &lt;/figcaption&gt;


&lt;/figure&gt;







  
  



  
  














&lt;figure class=&#34;width-medium&#34; id=&#34;figure-korean-example-from-unsplash&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;https://charlieleee.github.io/publication/inksight/figures/full_page/korean_hu_2742e256307bfbe1.jpg&#34; data-caption=&#34;Korean example from Unsplash.&#34;&gt;


  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/figures/full_page/korean_hu_2742e256307bfbe1.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; height=&#34;1675&#34;&gt;
&lt;/a&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Korean example from Unsplash.
  &lt;/figcaption&gt;


&lt;/figure&gt;







  
  



  
  














&lt;figure class=&#34;width-medium&#34; id=&#34;figure-sticky-note-example&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;https://charlieleee.github.io/publication/inksight/figures/full_page/sticky_note_hu_9f6411279d62d549.jpg&#34; data-caption=&#34;Sticky note example.&#34;&gt;


  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/figures/full_page/sticky_note_hu_9f6411279d62d549.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; height=&#34;587&#34;&gt;
&lt;/a&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Sticky note example.
  &lt;/figcaption&gt;


&lt;/figure&gt;&lt;/p&gt;
&lt;h1 id=&#34;how-does-inksight-work&#34;&gt;How Does InkSight Work?&lt;/h1&gt;
&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    We released the inference code and dataset at our 
&lt;a href=&#34;https://github.com/google-research/inksight&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;GitHub Repository&lt;/a&gt;.
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;InkSight’s model operates on both &lt;strong&gt;reading&lt;/strong&gt; and &lt;strong&gt;writing&lt;/strong&gt; “priors”—knowledge or tendencies humans apply to interpret and recreate text. These priors allow it to generalize across diverse handwriting styles and appearances, which are challenging to standardize in training data.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Reading Prior:&lt;/strong&gt; The model learns to identify textual elements within varied and complex images, and can be aided by general text recognition capabilities, including OCR.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Writing Prior:&lt;/strong&gt; This ensures that the output digital ink aligns with natural handwriting dynamics, capturing the order of strokes in an authentic, human-like way.&lt;/li&gt;
&lt;/ol&gt;







  
  



  
  














&lt;figure class=&#34;width-medium&#34; id=&#34;figure-inksight-diagram-detailed-explanations-for-each-component-are-provided-below&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;https://charlieleee.github.io/publication/inksight/inksight_diagram_hu_3feec436c8697e5.jpg&#34; data-caption=&#34;InkSight Diagram: Detailed explanations for each component are provided below.&#34;&gt;


  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/inksight_diagram_hu_3feec436c8697e5.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; height=&#34;735&#34;&gt;
&lt;/a&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    InkSight Diagram: Detailed explanations for each component are provided below.
  &lt;/figcaption&gt;


&lt;/figure&gt;
&lt;p&gt;By integrating these priors, InkSight can produce robust digital inks that maintain both the semantic (content) and geometric (structure) properties of the handwritten input, making it uniquely adaptable to a wide range of visual conditions, from lighting variations to complex backgrounds.







  
  



  
  














&lt;figure class=&#34;width-normal&#34; id=&#34;figure-comparison-between-gvs-baselinehttpsmarkmohrgithubiovirtual_sketching-and-3-variants-of-inksight-model-across-various-types-of-input&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;https://charlieleee.github.io/publication/inksight/result_hu_a25b09ff8ab86d61.jpg&#34; data-caption=&#34;Comparison between 
&amp;lt;a href=&amp;#34;https://markmohr.github.io/virtual_sketching/&amp;#34; target=&amp;#34;_blank&amp;#34; rel=&amp;#34;noopener&amp;#34;&amp;gt;GVS (baseline)&amp;lt;/a&amp;gt; and 3 variants of InkSight model across various types of input.&#34;&gt;


  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/result_hu_a25b09ff8ab86d61.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;70%&#34; height=&#34;2060&#34;&gt;
&lt;/a&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Comparison between 
&lt;a href=&#34;https://markmohr.github.io/virtual_sketching/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;GVS (baseline)&lt;/a&gt; and 3 variants of InkSight model across various types of input.
  &lt;/figcaption&gt;


&lt;/figure&gt;&lt;/p&gt;
&lt;p&gt;The InkSight model architecture is based on combining a 
&lt;a href=&#34;https://arxiv.org/abs/2010.11929&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Vision Transformer (ViT)&lt;/a&gt; encoder with an 
&lt;a href=&#34;https://arxiv.org/abs/2010.11934&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;mT5&lt;/a&gt; encoder-decoder Transformer, resembling the structure of 
&lt;a href=&#34;https://sites.research.google/pali/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Pathways Language and Image Model (PaLI)&lt;/a&gt;:&lt;/p&gt;






















&lt;figure class=&#34;width-normal&#34; id=&#34;figure-how-the-inksight-word-level-model-outputs-both-text-and-digital-ink-through-recognize-and-derender-inference-gifinksight_animation_gifgif&#34;&gt;

&lt;video autoplay loop muted playsinline style=&#34;width:90%&#34;&gt;
    &lt;source src=&#34;inksight_animation.mp4&#34; type=&#34;video/mp4&#34;&gt;
    
&lt;/video&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    How the InkSight word-level model outputs both text and digital ink through &amp;ldquo;Recognize and Derender&amp;rdquo; inference. 
&lt;a href=&#34;inksight_animation_gif.gif&#34;&gt;[gif]&lt;/a&gt;
  &lt;/figcaption&gt;


&lt;/figure&gt;
&lt;p&gt;The model is trained using a unique multi-task training setup that includes five task types, each allowing the model to handle handwriting inputs of various complexities. Two derendering tasks (ink output), two recognition tasks (text output), and one mixed task
(text-and-ink output). Each type of task utilizes a task-specific input text, enabling the model to distinguish
between tasks during both training and inference.







  
  



  
  














&lt;figure class=&#34;width-medium&#34; id=&#34;figure-training-tasks-mixture&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;https://charlieleee.github.io/publication/inksight/task_mixture_hu_d0b68ac9d3779d62.jpg&#34; data-caption=&#34;Training tasks mixture.&#34;&gt;


  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/task_mixture_hu_d0b68ac9d3779d62.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; height=&#34;825&#34;&gt;
&lt;/a&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Training tasks mixture.
  &lt;/figcaption&gt;


&lt;/figure&gt;&lt;/p&gt;
&lt;p&gt;A crucial and distinctive step for this modality is the ink tokenizer. It tokenizes the 2D space and encodes the sequential nature of digital ink in a format optimized for compatibility with large language models (LLMs) or vision-language models (VLMs).&lt;/p&gt;
&lt;p&gt;To this end, we propose a novel ink tokenizer, that converts ink strokes into a sequence of discrete tokens, thus acting as the tokenizer for digital ink.&lt;/p&gt;
&lt;h1 id=&#34;digital-ink-tokenization&#34;&gt;Digital Ink Tokenization&lt;/h1&gt;
&lt;p&gt;Digital ink is usually represented as a sequence of strokes $I = \{s_1, s_2, \cdots, s_n\}$, where each stroke $s_i$ consists of a sequence of $m_i$ (length of the $i$-th stroke) coordinate-time triplets, represented as $s_i = \{(x_i, y_i, t_i)\}_{i=1}^{m_i}$. Each digital ink stroke is normalized by resampling it at a fixed rate, reducing the sequence length with the 
&lt;a href=&#34;https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Ramer-Douglas-Peucker algorithm&lt;/a&gt;, and centering it on a fixed-size canvas.&lt;/p&gt;
&lt;p&gt;Ink tokenization starts with a “beginning of stroke” token, followed by tokens encoding the $x$ and $y$ locations of each sampled point. The token dictionary size, which balances rounding error with vocabulary size, is determined by a parameter $N$.&lt;/p&gt;






















&lt;figure class=&#34;width-normal&#34; id=&#34;figure-ink-tokenization-for-a-single-stroke-begins-with-a-b-token-followed-by-x-and-y-coordinate-tokens-for-sampled-points-17-along-the-stroke-ordered-by-a-color-gradient-gifink_tokenizergif&#34;&gt;

&lt;video autoplay loop muted playsinline style=&#34;width:80%&#34;&gt;
    &lt;source src=&#34;ink_tokenizer.mp4&#34; type=&#34;video/mp4&#34;&gt;
    
&lt;/video&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Ink tokenization for a single stroke begins with a &amp;ldquo;b&amp;rdquo; token, followed by $x$ and $y$ coordinate tokens for sampled points (1–7) along the stroke, ordered by a color gradient. 
&lt;a href=&#34;ink_tokenizer.gif&#34;&gt;[gif]&lt;/a&gt;
  &lt;/figcaption&gt;


&lt;/figure&gt;
&lt;p&gt;One of the major achievement with InkSight is its ability to move from word-level derendering to full-page derendering. This allows the model to handle entire pages of handwritten notes, identifying and derendering each word individually before seamlessly combining them into a cohesive digital ink document.&lt;/p&gt;






















&lt;figure class=&#34;width-normal&#34; id=&#34;figure-pipeline-to-scale-up-to-full-page-giffull_page_animationgif&#34;&gt;

&lt;video autoplay loop muted playsinline style=&#34;width:90%&#34;&gt;
    &lt;source src=&#34;full_page_animation.mp4&#34; type=&#34;video/mp4&#34;&gt;
    
&lt;/video&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Pipeline to scale up to full page. 
&lt;a href=&#34;full_page_animation.gif&#34;&gt;[gif]&lt;/a&gt;
  &lt;/figcaption&gt;


&lt;/figure&gt;
&lt;p&gt;In the project, we use with 
&lt;a href=&#34;https://cloud.google.com/vision/docs/handwriting&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Google Cloud Vision Handwriting Text Detection API&lt;/a&gt; for word-level bounding box detection. However, there are open-source and free alternatives like 
&lt;a href=&#34;https://github.com/tesseract-ocr/tesseract&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Tesseract OCR&lt;/a&gt;, 
&lt;a href=&#34;https://github.com/mindee/doctr&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;docTR&lt;/a&gt;, etc. We provide code examples of using both free alternatives for full-page derendering in our 
&lt;a href=&#34;https://github.com/google-research/inksight&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;
&lt;h1 id=&#34;highlighted-findings&#34;&gt;Highlighted Findings&lt;/h1&gt;
&lt;h2 id=&#34;human-evaluation-reveals-high-quality-digital-ink-generation&#34;&gt;Human Evaluation Reveals High-Quality Digital Ink Generation&lt;/h2&gt;
&lt;p&gt;Our evaluation combined human assessment with automated metrics to validate InkSight&amp;rsquo;s performance. In a blind study, evaluators compared model outputs against human-traced samples without knowing their source. The results were exciting: &lt;span style=&#34;color: rgb(40, 167, 69);&#34;&gt;87%&lt;/span&gt; of InkSight&amp;rsquo;s outputs were judged as valid tracings of the input text, and remarkably, &lt;span style=&#34;color: rgb(40, 167, 69);&#34;&gt;67%&lt;/span&gt; were deemed indistinguishable from human-generated digital ink.







  
  



  
  














&lt;figure class=&#34;width-medium&#34; id=&#34;figure-human-evaluation-result&#34;&gt;



  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/eval_hu_3859e115899ce729.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; height=&#34;1598&#34;&gt;



  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Human evaluation result.
  &lt;/figcaption&gt;


&lt;/figure&gt;&lt;/p&gt;
&lt;p&gt;These human evaluations were conducted with 16 digital ink experts, each sample being assessed by three independent raters to ensure reliability (inter-rater reliability κ: 0.44-0.46).







  
  



  
  














&lt;figure class=&#34;width-normal&#34; id=&#34;figure-human-evaluation-user-interface&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;https://charlieleee.github.io/publication/inksight/interface_hu_7d2f3a411fbeaeac.jpg&#34; data-caption=&#34;Human evaluation user interface.&#34;&gt;


  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/interface_hu_7d2f3a411fbeaeac.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;99%&#34; height=&#34;940&#34;&gt;
&lt;/a&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Human evaluation user interface.
  &lt;/figcaption&gt;


&lt;/figure&gt;&lt;/p&gt;
&lt;h2 id=&#34;the-role-of-recognition-tasks-reading-in-writing-quality&#34;&gt;The Role of Recognition Tasks (Reading) in Writing Quality&lt;/h2&gt;
&lt;p&gt;An interesting finding from our ablation studies emerges in the relationship between recognition and writing quality. When recognition tasks were removed from training &lt;span style=&#34;color: rgb(234, 189, 134);&#34;&gt;(rows highlighted in yellow)&lt;/span&gt;, the model maintained reasonable geometric similarity to input images (F1 scores) but showed marked decline in semantic consistency. On the HierText dataset, removing recognition tasks reduced accuracy from &lt;span style=&#34;color: rgb(40, 167, 69);&#34;&gt;0.45&lt;/span&gt; to &lt;span style=&#34;color: rgb(220, 53, 69);&#34;&gt;0.13-0.38&lt;/span&gt;, while F1 scores remained relatively stable. This suggests that recognition training, or &amp;ldquo;reading,&amp;rdquo; plays a key role in producing semantically consistent writing, beyond simple visual pattern matching.&lt;/p&gt;







  
  



  
  














&lt;figure class=&#34;width-normal&#34; id=&#34;figure-ablation-studies&#34;&gt;



  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/ablations_hu_2fdc6dd5bec0ee07.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; height=&#34;728&#34;&gt;



  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Ablation studies
  &lt;/figcaption&gt;


&lt;/figure&gt;
&lt;h2 id=&#34;handling-ambiguous-cases-the-impact-of-inference-mode&#34;&gt;Handling Ambiguous Cases: The Impact of Inference Mode&lt;/h2&gt;
&lt;p&gt;Analysis of challenging cases reveals the critical role of inference strategy in handling ambiguous handwriting. Using &amp;ldquo;DESiGNÉRS&amp;rdquo; as an example: while Vanilla Derender captures basic geometry, it struggles with character precision (rendering &amp;lsquo;E&amp;rsquo; and &amp;lsquo;ÉR&amp;rsquo; as rough strokes). In contrast, Derender with Text maintains precise character structure by leveraging OCR input, while Recognize and Derender produces plausible tracings aligned with its own text recognition (using lowercase &amp;rsquo;e&amp;rsquo; and &amp;rsquo;er&amp;rsquo;). This demonstrates how different levels of textual understanding yield distinct yet valid interpretations of ambiguous handwriting.&lt;/p&gt;
&lt;p&gt;






  
  



  
  














&lt;figure class=&#34;width-normal&#34; id=&#34;figure-comparison-of-inksight-inference-modes-on-ambiguous-handwriting-samples-column-headers-show-ground-truth-labels-recognized-indicates-inksights-text-recognition-while-ocr-input-shows-external-ocr-system-recognition-used-as-input-to-derender-with-text-inference-mode&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;https://charlieleee.github.io/publication/inksight/inference_hu_f67b16489d7c11db.jpg&#34; data-caption=&#34;Comparison of InkSight inference modes on ambiguous handwriting samples. Column headers show ground truth labels; &amp;amp;lsquo;Recognized&amp;amp;rsquo; indicates InkSight&amp;amp;rsquo;s text recognition while &amp;amp;lsquo;OCR Input&amp;amp;rsquo; shows external OCR system recognition used as input to Derender with Text inference mode.&#34;&gt;


  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/inference_hu_f67b16489d7c11db.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; height=&#34;1014&#34;&gt;
&lt;/a&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Comparison of InkSight inference modes on ambiguous handwriting samples. Column headers show ground truth labels; &amp;lsquo;Recognized&amp;rsquo; indicates InkSight&amp;rsquo;s text recognition while &amp;lsquo;OCR Input&amp;rsquo; shows external OCR system recognition used as input to Derender with Text inference mode.
  &lt;/figcaption&gt;


&lt;/figure&gt;







  
  



  
  














&lt;figure class=&#34;width-medium&#34; id=&#34;figure-comparison-of-inksight-inference-modes-on-more-ambiguous-handwriting-samples&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;https://charlieleee.github.io/publication/inksight/ambiguous_hu_baaf26eeac490ec9.jpg&#34; data-caption=&#34;Comparison of InkSight inference modes on more ambiguous handwriting samples.&#34;&gt;


  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/ambiguous_hu_baaf26eeac490ec9.jpg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; height=&#34;1502&#34;&gt;
&lt;/a&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Comparison of InkSight inference modes on more ambiguous handwriting samples.
  &lt;/figcaption&gt;


&lt;/figure&gt;&lt;/p&gt;
&lt;h2 id=&#34;a-novel-data-source-for-handwriting-recognition&#34;&gt;A Novel Data Source for Handwriting Recognition&lt;/h2&gt;
&lt;p&gt;Our experiments with handwriting recognition yielded informative results. Training with &lt;span style=&#34;color: rgb(192, 131, 88);&#34;&gt;IAM derendered&lt;/span&gt; ink produced a Character Error Rate of &lt;span style=&#34;color: rgb(192, 131, 88);&#34;&gt;7.8%&lt;/span&gt;, performing worse than the &lt;span style=&#34;color: rgb(111, 142, 187);&#34;&gt;6.1%&lt;/span&gt; baseline established using &lt;span style=&#34;color: rgb(111, 142, 187);&#34;&gt;IAMOnDB&lt;/span&gt;. However, when combining both &lt;span style=&#34;color: rgb(111, 142, 187);&#34;&gt;IAMOnDB&lt;/span&gt; and &lt;span style=&#34;color: rgb(192, 131, 88);&#34;&gt;IAM derendered&lt;/span&gt; data, the CER improved to &lt;span style=&#34;color: rgb(134, 174, 146);&#34;&gt;4.6%&lt;/span&gt;. This suggests that derendered ink, while not matching real data quality for training recognizer models in isolation, can serve as valuable complementary data for training recognition systems, opening new possibilities for improving handwriting recognition systems where high-quality digital ink data is scarce.&lt;/p&gt;
&lt;p&gt;






  
  



  
  














&lt;figure class=&#34;width-normal&#34; id=&#34;figure-online-handwriting-recognition-evaluation-results-on-iamondb-test-set-between-3-setups&#34;&gt;



  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/cer_hu_e5dab15b602d699f.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;95%&#34; height=&#34;719&#34;&gt;



  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Online handwriting recognition evaluation results on IAMOnDB test set between 3 setups.
  &lt;/figcaption&gt;


&lt;/figure&gt;







  
  



  
  














&lt;figure class=&#34;width-normal&#34; id=&#34;figure-visualization-of-two-data-sources&#34;&gt;



  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/preview_hu_d77820e8f49f9b5c.jpeg&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; height=&#34;200&#34;&gt;



  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Visualization of two data sources.
  &lt;/figcaption&gt;


&lt;/figure&gt;&lt;/p&gt;
&lt;h1 id=&#34;limitations-and-future-directions&#34;&gt;Limitations and Future Directions&lt;/h1&gt;
&lt;p&gt;While InkSight demonstrates strong capabilities in converting offline handwriting to digital ink, it encounters challenges in certain scenarios. The model can struggle with thick or variable stroke widths and highly ambiguous or distorted text. In full-page derendering, InkSight relies on accurate segmentation to avoid misalignment issues, especially on intricate page layouts. Additionally, minor details like punctuation can sometimes be omitted or duplicated, affecting the fidelity of the digital ink. These limitations highlight areas for future refinement, aiming to boost InkSight’s precision and adaptability across diverse handwriting styles and conditions.&lt;/p&gt;
&lt;h1 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;InkSight introduces a novel approach to converting handwritten text from offline into online digital ink format. The system achieves this without requiring paired training data, making it readily applicable in real-world scenarios. Our evaluation demonstrates the model&amp;rsquo;s ability to handle diverse inputs - from basic handwriting to simple sketches - while maintaining semantic consistency and natural stroke dynamics.&lt;/p&gt;
&lt;p&gt;Key aspects that set this work apart include its use of standard architecture components, well-designed training methodology, and ability to process full pages of handwritten notes. We make the 
&lt;a href=&#34;https://huggingface.co/Derendering/InkSight-Small-p&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Small-p model weights&lt;/a&gt;, 
&lt;a href=&#34;https://github.com/google-research/inksight?tab=readme-ov-file&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;inference code&lt;/a&gt;, and a curated 
&lt;a href=&#34;https://huggingface.co/datasets/Derendering/InkSight-Derenderings&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;dataset&lt;/a&gt; of synthetic and human-traced digital ink publicly available to support further research in this area.&lt;/p&gt;
&lt;p&gt;While current limitations include constraints on input length and sketch complexity, the framework establishes a foundation for bridging the gap between physical and digital note-taking, opening new possibilities for handwriting digitization and recognition systems.&lt;/p&gt;
&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Please refer to our 
&lt;a href=&#34;https://www.alphaxiv.org/abs/2402.05804&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;paper&lt;/a&gt; for more details.
  &lt;/div&gt;
&lt;/div&gt;

&lt;h1 id=&#34;data-mixture-model-card-and-training&#34;&gt;Data Mixture, Model Card, and Training&lt;/h1&gt;
&lt;h2 id=&#34;in-house-training-mixture&#34;&gt;In-house Training Mixture&lt;/h2&gt;







  
  



  
  














&lt;figure class=&#34;width-normal&#34; id=&#34;figure-training-tasks-mixture-for-in-house-models-left-derendering-right-recognition&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;https://charlieleee.github.io/publication/inksight/inhouse_mixture_hu_ec216b46b5749c27.png&#34; data-caption=&#34;Training tasks mixture for in-house models; left: derendering, right: recognition.&#34;&gt;


  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/inhouse_mixture_hu_ec216b46b5749c27.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; height=&#34;907&#34;&gt;
&lt;/a&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Training tasks mixture for in-house models; left: derendering, right: recognition.
  &lt;/figcaption&gt;


&lt;/figure&gt;
&lt;h2 id=&#34;small-p-training-mixture&#34;&gt;Small-p Training Mixture&lt;/h2&gt;







  
  



  
  














&lt;figure class=&#34;width-normal&#34; id=&#34;figure-small-p-training-tasks-mixture-left-derendering-right-recognition&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;https://charlieleee.github.io/publication/inksight/mixture_hu_9eb8ce5d197e9078.png&#34; data-caption=&#34;Small-p training tasks mixture; left: derendering, right: recognition.&#34;&gt;


  &lt;img data-src=&#34;https://charlieleee.github.io/publication/inksight/mixture_hu_9eb8ce5d197e9078.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;100%&#34; height=&#34;840&#34;&gt;
&lt;/a&gt;


  
  
  &lt;figcaption data-pre=&#34;Figure &#34; data-post=&#34;:&#34; class=&#34;numbered&#34;&gt;
    Small-p training tasks mixture; left: derendering, right: recognition.
  &lt;/figcaption&gt;


&lt;/figure&gt;
&lt;table style=&#34;width:100%; border-collapse: collapse; font-family: Arial, sans-serif;&#34;&gt;
    &lt;tr&gt;
        &lt;th style=&#34;width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;&#34;&gt;Task Type&lt;/th&gt;
        &lt;th style=&#34;border: 1px solid #333; padding: 10px; background-color: #f2f2f2;&#34;&gt;Dataset&lt;/th&gt;
        &lt;th style=&#34;border: 1px solid #333; padding: 10px; background-color: #f2f2f2;&#34;&gt;Number of Samples&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th style=&#34;border: 1px solid #333; padding: 10px; text-align: center;&#34; rowspan=&#34;6&#34;&gt;Derendering&lt;/th&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;DeepWriting (words)&lt;/td&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px; text-align: right;&#34;&gt;89,565&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;DeepWriting (lines)&lt;/td&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px; text-align: right;&#34;&gt;33,933&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;DeepWriting (characters)&lt;/td&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px; text-align: right;&#34;&gt;359,643&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;VNonDB&lt;/td&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px; text-align: right;&#34;&gt;66,991&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;SCUT-COUCH Chinese characters&lt;/td&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px; text-align: right;&#34;&gt;1,998,784&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;SCUT-COUCH Chinese pinyin&lt;/td&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px; text-align: right;&#34;&gt;156,535&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th style=&#34;border: 1px solid #333; padding: 10px; text-align: center;&#34; rowspan=&#34;5&#34;&gt;OCR&lt;/th&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;IAM word-level (train)&lt;/td&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px; text-align: right;&#34;&gt;53,839&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;IMGUR5k (train)&lt;/td&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px; text-align: right;&#34;&gt;181,792&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;RIMES word-level (train)&lt;/td&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px; text-align: right;&#34;&gt;51,738&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;HierText (train)&lt;/td&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px; text-align: right;&#34;&gt;5,978&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;ICDAR-2015 (train)&lt;/td&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px; text-align: right;&#34;&gt;1,535&lt;/td&gt;
    &lt;/tr&gt;
&lt;/table&gt;
&lt;h2 id=&#34;model-and-training-summary&#34;&gt;Model and Training Summary&lt;/h2&gt;
&lt;table style=&#34;width:100%; border-collapse: collapse; font-family: Arial, sans-serif;&#34;&gt;
    &lt;tr&gt;
        &lt;th style=&#34;width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;&#34;&gt;Model Architecture&lt;/th&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;A multimodal sequence-to-sequence Transformer model with the mT5 encoder-decoder architecture. It takes text tokens and ViT dense image embeddings as inputs to an encoder and autoregressively predicts discrete text and ink tokens with a decoder.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th style=&#34;width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;&#34;&gt;Input(s)&lt;/th&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;A pair of image and text.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th style=&#34;width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;&#34;&gt;Output(s)&lt;/th&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;Generated digital ink and text.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th style=&#34;width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;&#34;&gt;Usage&lt;/th&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;
            &lt;strong&gt;Application:&lt;/strong&gt; The model is for research prototype, and the public version is &lt;a href=&#34;https://huggingface.co/Derendering/InkSight-Small-p&#34;&gt;released&lt;/a&gt; and available for the public.&lt;br&gt;
            &lt;strong&gt;Known Caveats:&lt;/strong&gt; None.
        &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th style=&#34;width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;&#34;&gt;System Type&lt;/th&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;
            &lt;strong&gt;System Description:&lt;/strong&gt; This is a standalone model.&lt;br&gt;
            &lt;strong&gt;Upstream Dependencies:&lt;/strong&gt; None.&lt;br&gt;
            &lt;strong&gt;Downstream Dependencies:&lt;/strong&gt; None.
        &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th style=&#34;width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;&#34;&gt;Implementation Frameworks&lt;/th&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;
            &lt;strong&gt;Hardware &amp; Software:&lt;/strong&gt; Hardware: TPU v5e.&lt;br&gt;
            Software: T5X , JAX/Flax, Flaxformer.&lt;br&gt;
            &lt;strong&gt;Compute Requirements:&lt;/strong&gt; We train all of our models for 340k steps with batch size 512. With frozen ViT encoders, the training of Small-i takes ∼33h on 64 TPU v5e chips and the training of Large-i takes ∼105h on 64 TPU v5e chips.
        &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th style=&#34;width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;&#34;&gt;Data Overview&lt;/th&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;
            &lt;strong&gt;Training Datasets:&lt;/strong&gt; The ViT encoder of Small-p is pretrained on ImageNet-21k, mT5 encoder and decoder are initialized from scratch. The entire model is trained on the mixture of publicly available datasets described in the previous section.
        &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th style=&#34;width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;&#34;&gt;Evaluation Results&lt;/th&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;
            &lt;strong&gt;Evaluation Methods:&lt;/strong&gt; Human evaluation (reported in Section 4.5.1 of the paper) and automated evaluations (reported in Section 4.5.2 of the paper).
        &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th style=&#34;width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;&#34;&gt;Model Usage &amp; Limitations&lt;/th&gt;
        &lt;td style=&#34;border: 1px solid #333; padding: 10px;&#34;&gt;
            &lt;strong&gt;Sensitive Use:&lt;/strong&gt; The model is capable of converting images to digital inks. This model should not be used for any of the privacy-intruding use cases, e.g., forging handwritings.&lt;br&gt;
            &lt;strong&gt;Known Limitations:&lt;/strong&gt; Reported in Appendix I of the paper.&lt;br&gt;
            &lt;strong&gt;Ethical Considerations &amp; Potential Societal Consequences:&lt;/strong&gt; Reported in Sections 6.1 and 6.2 of the paper.
        &lt;/td&gt;
    &lt;/tr&gt;
&lt;/table&gt;
&lt;h1 id=&#34;acknowledgements&#34;&gt;Acknowledgements&lt;/h1&gt;
&lt;p&gt;The authors thank Leandro Kieliger, Philippe Schlattner, Anastasiia Fadeeva, Mircea Trăichioiu, Efi Kokiopoulou, Diego Antognini, Henry Rowley, Reeve Ingle, Manuel Drazyk, Sebastian Goodman, Jialin Wu, Xiao Wang, Tom Duerig, and Tomáš Ižo for their help and support.&lt;/p&gt;
</description>
    </item>
    
  </channel>
</rss>
