ArchXAI project develops AI-powered handwritten text recognition for 19th and 20th century Russian
“I wonder what that says,” ponders a researcher unfamiliar with the Russian language and Cyrillic alphabet when exploring a part of the National Archives of Finland’s collection of archival materials. For instance, Orthodox parish church records before the 20th century were written in Russian, as were many documents from the highest authorities during the Grand Duchy of Finland (1809–1917). For genealogists and researchers who don’t know the language, using these sources can be difficult or even impossible.
The ArchXAI project (AI enhanced cross-border archives, 2025–2028) addresses this challenge. Funded by the EU’s Interreg Central Baltic program, the project brings together the national archives of Finland, Estonia, and Latvia, along with the South-Eastern Finland University of Applied Sciences. The goal is to improve archive accessibility and information availability using artificial intelligence. One solution we develop is a handwritten text recognition (HTR) model that reads the Cyrillic alphabet.
Handwritten text recognition uses machine learning technology powered by artificial intelligence. An HTR model is trained to recognize text using training data created by humans. Once finished, it is used to recognize handwritten historical text from digital images. Text-recognized materials become searchable, making them much easier to use. The National Archives of Finland has already developed an HTR model for Finnish and Swedish, which has been used to recognize text in court records from the 1600s to the 1900s.
How to train an HTR model
Developing an HTR model starts with creating training data—transcribing text word for word and character for character. So far, archival experts with Russian language skills have transcribed over 1,700 pages of various Russian-language archival materials, mainly from the 1800s and 1900s. We use Transkribus platform as a tool to transcribe text and create training data from many different handwritings and different types of materials so the model learns to recognize them broadly.
Training began for the first version of the Russian HTR model in October. The training data is exported as XML files, which are used to extract text from digital images line by line and attach each line image with corresponding transcription. Model training is executed based on these line images. Training HTR models requires significant computing power, so the fastest approach uses supercomputers. Through the project, we have access to supercomputers at both the Memory Lab at South-Eastern Finland University of Applied Sciences and CSC – IT Center for Science.
Once the first version is trained, we test the model’s functionality. When possible, we select test material that hasn’t been used in training to see how well the model reads completely new material. Testing also reveals how much the model still needs development and what kind of training data is needed for the next version.
Training an HTR model is a continuous development process. Performance can be evaluated by calculating the CER (Character Error Rate), which indicates how many characters out of a hundred the model misinterprets on average. The lower the character error rate, the better the text recognition result.
The ArchXAI project enables collaboration among three countries’ national archives and allows us to share expertise in model training and training data creation. The HTR model is trained to read Russian-language archive material from all participating national archives. Much of the material across different archives is similar, as Finland, Estonia, and Latvia share a common history under the Russian Empire.
At the National Archives of Finland, the model will be used to read at least the Orthodox church records and the acts of the Governor-General of Finland’s Office. The HTR model doesn’t translate Russian into other languages like Finnish, but machine transcription makes it much easier to read the text and use various translation tools. The machine-read materials and model source codes will be made freely available to each archive’s customers as the project progresses.

