Historical informatics - rubric New methods and techniques of processing historical sources

Abstract: The subject of this research is software methods of automated preprocessing of historical sources and the development of effective solutions to problems when working with sources of personal origin. The article analyzes the current situation in the use of modern software methods. The authors demonstrate the main range of arguments for which such historical sources from a technical point of view should be considered separately. A methodological analysis of the features of the application of optical character recognition based on preprocessed data is carried out. Special attention is paid to the advantages and key parameters of the effectiveness of the final result of work when using automated text processing, including the further use of OCR methods. The scientific novelty of the research lies in the proposal and detailed description of a software solution to the current problem based on machine learning methods. The developed program has three phases of working with digital copies of sources of personal origin. It is based on the use of the OpenCV library and solving a number of problems using the Hough transform. Based on the general analysis of the study, we can highlight the main advantages of automated preprocessing of scanned documents: reducing time, improving accuracy, combating distortion and optimizing the process. The presented results of successful testing of the developed solution allow us to judge the possible areas of its effective application.

DOI: 10.7256/2585-7797.2023.2.40601

Abstract: The subject of the study is a methodology for analyzing the electronic content of social networks (forums) as a historical source. The discussion of the revolution of 1917 during the centenary of this historical event was used as a material for analysis. The aim of the study was to test approaches to the methodology of working with large arrays of online texts, and the possible combination of two approaches to working with online texts - quantitative analysis tools (distant reading) and traditional methods of working with historical text (slow reading). As part of the "distant reading", thematic modeling is used using the LDA (latent Dirichlet placement) and LSA (latent semantic analysis) algorithm in the R programming environment in the R studio program (version 4.2.1). During the "slow reading" we analyze the entire volume of the text directly.The novelty of the research lies in the application of thematic modeling to sources in the R programming environment in conjunction with classical methods of analyzing historical texts. Within the framework of the study, a methodology for analyzing the content of social networks (forums) has been tested, focused on substantial arrays of text that are physically impossible to read in full or at least in a significant part, using exclusively traditional means of interaction of the researcher with the corpus of sources. A step-by-step research algorithm is proposed, in which the researcher needs to analyze the text by "distant reading" methods, identifying the topics of texts consisting of terms (words). Then, using these keywords, you should find the relevant text fragments in which the identified topic was discussed most actively, and analyze the fragments in more detail using traditional methods of working with a text source. A possible way to improve the quality of identifying topics necessary for the researcher in social networks and forums by the LDA algorithm is proposed, namely, preliminary splitting of a large text and subsequent analysis of fragments by the LDA method as separate documents.

DOI: 10.7256/2585-7797.2020.2.33446

Abstract: The article studies the diachronic corpus of the Buryat language compiled on the basis of annals written in old Mongolian used to reconstruct the history and historical geography of the Buryat people. In this regard, the article discusses the main problems of semantic markup of corpus data. The size of the corpus currently exceeds 82,000 words. The research novelty is that classical Mongolian texts presented in Latin transliteration are addressed by computer linguistics methods for the first time. The author describes approaches to develop the ontological outline of the historical and cultural subject area as well identifies the kinship and geographical context elements. The MS Access and SQL simulation experiment demonstrates the advantages of the authority control methodology, in particular the “family” and “place” categories, for the initial analysis of corpus data and the formation of semantic clusters. The use of authoritative records has significantly accelerated the accumulation of empirical data for automation of the substantive analysis of texts in the corpus. These experiments allowed the author to see further steps to create and improve the Buryat language diachronic corpus semantic markup tools and transform this language into a convenient tool for historical research.

DOI: 10.7256/2585-7797.2023.4.68943

Abstract: The paper describes the results of determining the haplogroups of two ancient burials of the 12th century from the middle reaches of the Klyazma. The data obtained make it possible to determine the Y-haplogroup and mitohaplogroup using the markers identified in the study. The article describes the using of bioinformatics methods and the result obtained. The result with a high probability determines the Y-haplogroup I1-Z58 of burial No. 26 and the mitochondrial haplogroup H1-146C (highly probable, H1m1) of burial No. 25. This work summarizes the initial stage of research undertaken in 2019-2020, and continued in other works by the team of authors. Some of the results have been published; mitoDNA from burial No. 25 is published for the first time and completes the series of DNA data from the described group of burials from ancient Klyazma settlements, published by the authors earlier. Modern technologies make it possible to extract DNA and test it using various methods, including determination of the Y-chromosome haplogroup and mitochondrial DNA. The article presents the results of the study conducted in 2019-2020 years. The first evidence of the presence of mitochondrial haplogroup H1-146C (burial No. 25) and Y-haplogroup I1-Z58 (burial No. 26) among the Klyazma population of North-Eastern Russia in the 12th century not only confirms the presence of Y-haplogroup H1 in medieval Russian lands (inhabited descendants of the Eastern Slavs), but also indicates that some genetic unity with the western parts of the Slavic area might exist that time.

DOI: 10.7256/2585-7797.2023.4.69120

Abstract: The purpose of the database is to reconstruct the "ceremonial" and "non-official" portraits of students of state labor reserves (using the example of visual images of students of educational institutions of the Sverdlovsk region in the 1940s and 50s). The analysis of the information contained in the database will in the future allow answering a number of research questions and identify important characteristics of the visual image and social portrait of students: types of activities, including various aspects of work and educational activities, "bodily activity" of students; acts of human interaction and non-verbal communication (gestures, facial expressions, body poses, etc.); objects of material culture used; everyday stereotypes of behavior reconstructed through a series of photographs; "atypical experience", description of deviant groups of students, irregular clothes, atypical behavior. When creating the database, the concepts of visual images in L.N. Mazur's historical research, D. Zeitlin's digitalization of visual anthropology, G. Kreidlin's nonverbal semiotics, and K. Girtz's "dense description" were taken into account. The results of the study are: 1) development and description of the database structure that allows taking into account the features of visual sources aimed at reconstructing the "ceremonial" and "non-official" portraits of students of educational institutions of the Sverdlovsk region in the 1940s and 50s through a detailed description of the poses, gestures, visual behavior, spatial interaction, clothes and shoes of the persons depicted in the photo; 2) primary analysis of 145 photographs from the official albums of 4 educational institutions devoted to the description of the results of their participation in the All-Union Socialist Competition in 1943-1945; 3) more accurate identification and systematization of external behavioral practices of students based on the database; 4) demonstration of the possibilities of detailed description of images by means of the database to identify individual sides of the "non-official portrait" of students. The results of the study can be used in the study of everyday life and socio-cultural portrait of students in the Soviet period.

DOI: 10.7256/2585-7797.2024.1.69789

Abstract: The article discusses the problems of developing electronic scientific publications of sources and creating electronic standards based on them. As a sample for such a publication, M. Yu. Andreicheva suggests turning to the "Tale of Bygone Years" – a narrative monument that, on the one hand, can best demonstrate the features of a whole complex of sources (chronicles), and on the other hand, has been studied well enough to show using his example, a detailed study of the linguistic, source study and textual capabilities of the created electronic publishing model. In his work, the author introduces the image of a portal dedicated to the electronic edition of the initial chronicle. The basis of the publication should be the hypertext of the Tale, that is, a text that includes a system of internal hyperlinks that make it possible to visually represent its lists, translations and original handwritten form, its stratification in textological stems, as well as textual and semantic intersections with other monuments of the era being studied. The electronic scientific publication of "The Tale of Bygone Years" will appear in the form of an open semantic network, the content of which will be updated as the monument is further studied. In the future, an indexed electronic scientific journal may be created on its basis, in which works devoted to the study of PVL and the history of Ancient Rus' will be published. Ultimately, electronic scientific publishing around the world has become a full-fledged scientific platform that takes the study of chronicle text to a new research and technological level. After creating a working model, it can be tested on other types of sources. The result of work on the project may be the creation of a designer for electronic publications of various levels (scientific, popular science, etc.).

DOI: 10.7256/2585-7797.2018.1.25686

Abstract: The article addresses the issue of transcribing handwritten materials of the 1950 Norwegian Population Census. These are 801 000 scanned double sided questionnaires. Optical character recognition programs have been improving for over four decades. Now researchers aim to extend similar techniques to handle handwritten historical source material. The article analyzes studies carried by the Center of Historical Documents at the University of Tromsø which address handwritten text recognition as well as considers the use of various text recognition techniques as far as nominative sources are concerned. Since it is difficult to distinguish and separate individual handwritten characters, the words are mathematically clustered according to image similarity or searched for within sources that have been transcribed earlier. After the recognition quality control, the software uses the line numbers to place the information taken from the transcribed cells. After that the latter become a part of the census database. Moreover, special software has been developed to process handwritten numerical codes, data on occupations and education, etc. The methods offered in the article provide for handwritten texts transcribing quality improvement and can be used to recognize nominative source notes in Russia, for instance, parish registers and vital records. The main goals are still the search for methods and algorithms which optimally link different variables as well as the rationalization of interactive proofread methods.

DOI: 10.7256/2585-7797.2023.1.40387

Abstract: Our article is presenting an attempt to apply NLP methods to optimize the process of text recognition (in case of historical sources). Any researcher who decides to use scanned text recognition tools will face a number of limitations of the pipeline (sequence of recognition operations) accuracy. Even the most qualitatively trained models can give a significant error due to the unsatisfactory state of the source that has come down to us: cuts, bends, blots, erased letters - all these interfere with high-quality recognition. Our assumption is to use a predetermined set of words marking the presence of a study topic with Fuzzy sets module from the SpaCy to restore words that were recognized with mistakes. To check the quality of the text recovery procedure on a sample of 50 issues of the newspaper, we calculated estimates of the number of words that would not be included in the semantic analysis due to incorrect recognition. All metrics were also calculated using fuzzy set patterns. It turned out that approximately 119.6 words (mean for 50 issues) contain misprints associated with incorrect recognition. Using fuzzy set algorithms, we managed to restore these words and include them in semantic analysis.

DOI: 10.7256/2585-7797.2020.1.32103

Abstract: A present-day task of historical GIS is to geotag ancient maps within еру modern coordinate system. These maps are sure to have many inaccuracies. In this regard, there is a need to develop algorithms accounting for these inaccuracies and allowing one to position sources with the smallest deformations and drawbacks. This task is also relevant for Russian plans of the General Survey. Their peculiarity is that they have accurate geodetic characteristics of plots. The research subject is a set of Nizhny Novgorod plans of the late 18th сentury which were the basis for a technique used to reconstruct the city borders and land survey plans. The research methodology is based on the historicism principal, systematicity and objectivity. The authors emphasize the role of statistical methods and apply specifically historical (historical and typological as well as historical and genetic) methods, the geodetic method to process and equalize transit traverse, modeling and cartometry. The research novelty is determined by the algorithm of city borders and historical land survey plans reconstruction, technological solutions for studying the object by means of geodetic programs, new data on land management and cartographic materials based on land management results in the specific region of Russia. The main conclusions are the positioned borders of Nizhny Novgorod in the conditional coordinate system. It was found that transit traverses of plots studied had significant angle linear errors. For settlement plots they are 3°29' and 1/31 and for pasture plots they are 2°49' and 1/80. For Blagoveshchenskiy Monastery they are 0°37’and 1/139. A raster land survey plan of Nizhny Novgorod has been made. It can be further used for geotagging and creating historical GIS.

DOI: 10.7256/2585-7797.2023.1.40440

Abstract: This article is devoted to the application of 3D laser scanning technology to solve the urgent problems of modern museum work. The possibility of using this technology for digitizing cultural and historical heritage objects for the purpose of documenting them, monitoring the state of preservation, restoration, virtual reconstruction, as well as creating copies of them is shown. The results of practical work on the creation of high-precision copies of marble sculptures from the museums of St. Petersburg as a result of the combined use of 3D scanning and milling stone processing machines with numerical control are presented. In addition, the prospects of using laser additive technologies for the restoration and replication of historical monuments are shown.

DOI: 10.7256/2585-7797.2022.1.37719

Abstract: The article describes various approaches to the classification of occupations in historical research, using the example of the database "Victims of political terror in the USSR". A brief overview of the methods by which this problem was previously solved is given: from manual assignment of certain occupations and professions of the repressed to different social groups that existed in the 1930s in the USSR, to automatic clustering. Further, a new method is proposed: to apply supervised machine learning for classification: use records already divided into groups during the author’s previous studies for training the algorithm and automatic labeling. The best of the tested methods turned out to be the support vector machine, which showed an accuracy of 95% on the test sample. The advantages and limitations of such a classification are considered, with the main limitation appears to be that some social groups are systematically defined more poorly. Nevertheless, the application of this technique made possible to mark up 350 thousand new records from the database extremely quickly. Markup based on the "training" data processed by the historian appears to be a promising direction for historical data science.

DOI: 10.7256/2585-7797.2023.2.43466

Abstract: The key task of the presented article is to test how we can analyze the information potential of a historical sources collection by using thematic modeling. Some modern collections of digitized historical materials number tens of thousands of documents, and at the level of an individual researcher, it is difficult to cover available funds. Following a number of researchers, we suggest that thematic modeling can become a convenient tool for preliminary assessment of the content of a collection of historical documents; can become a tool for selecting only those documents that contain information relevant to the research tasks. In our case, the Birzhevye Vedomosti newspaper was chosen as one of the main collection of historical documents. At this stage, we can confirm that in our study, the use of topic modeling proved to be a productive solution for optimizing the process of searching for historical documents in a large collection of digitized historical materials. At the same time, it should be emphasized that in our work topic modeling was used exclusively as an applied tool for primary assessment of the information potential of a documents collection through the analysis of selected topics. Our experience has shown that, at least for Birzhevye Vedomosti, topic modeling with LDA does not allow us to draw conclusions from the standpoint of our content analysis methodology. The data of our models are too fragmentary, it can only be used for the initial assessment of the topics describing the information contained in the source.

DOI: 10.7256/2585-7797.2019.4.31588

Abstract: The article studies the script as a material object that is the system of traces left by a writing medium on a writing material (paper or vellum). Traces of the writing medium are a combination of a relief and a dye (for instance, ink). The text understood as a combination of such traces is characterized by different dye thickness and its chemical composition on different text structure levels. Such differences are determined by varying aspects of the writing ability and can be used to characterize it. The article aims at presenting the advantages of a new electro-optical spectrozonal examination of historical inks to study handwritten scripts. It discusses the technology of digital visualization of documents in the near-infra-red region followed by computer processing of the image. The result of the work is the main research paths to study information potential of the text as a physical object (system of traces) by means of spectrozonal visualization. These paths are the study of writing medium traces to reconstruct the system of movements and the writing technique, the finding of zones written in different time and the search for corrections.

DOI: 10.7256/2585-7797.2022.3.38752

Abstract: The subject of the research in this paper is the liberal agenda in the Russian press of the pre-Decabrist period. The object of the study is the newspapers published during this period. The novelty of the work lies in the fact that the proposed study searches for pink noise in the data that were obtained from the press of the first quarter of the XIX century. The paper shows that the public consciousness of this period was in a state of self-organized criticality. Previously, the state of self-organized criticality could be found only in systems that arose at the end of the XIX century or later. The difficulty of the problem considered in this paper is that there are almost no mass sources for such an early historical period, and very few of the available ones lend themselves to formalization. The novelty of the conducted research lies in the application of the scientific tool of the theory of self-organized criticality to data having origins in the first quarter of the XIX century. The main conclusion made by the authors of the article is that the public consciousness in the pre-Decabrist period was in a state of self-organized criticality. For the analysis, statistics of publications in newspapers and magazines were collected, which served as a reflection of the liberal agenda relevant to the period of the genesis of the Decembrists. The paper shows that the sequence of publications on liberal information issues in the Russian press in the period 1815-1825 contains pink noise. Fourier analysis was used to determine it in the dynamic series.

DOI: 10.7256/2585-7797.2022.4.39224

Abstract: The article is devoted to modern digital methods of working with the handwritten heritage of Peter I. They were applied within the framework of the scientific project "Autographs of Peter the Great: Reading by artificial intelligence technologies". The project was initiated by the Russian Historical Society and implemented by specialists of the St. Petersburg Institute of History of the Russian Academy of Sciences, Sberbank PJSC. The article describes the methodology of preparing a data set for creating a program for machine reading of the manuscripts of Peter the Great ("Digital Peter"). Special emphasis is placed by the authors on the principles of transcribing of the historical text developed during the project. In addition, the cases of the use of non-letter characters by Peter I and the difficulties caused by this in the formation of a data set are analyzed. The article also reflects the results of the created algorithm and identifies variants of the organization of the text of Peter I, which reduce the quality of recognition. The authors also paid attention to the electronic archive "Autographs of Peter I", which became a continuation of the project on machine reading of the manuscripts of the first Russian emperor. The archive, which is being worked on, contains digital copies of Peter's autographs, the results of their recognition by the Digital Peter program, as well as scientific publications of these unique historical sources. The Internet portal "Autographs of Peter I" is associated with the resource: "Biochronics of Peter the Great day by day" (created on the HSE website). The connection of the two sites opens up additional opportunities for researchers: each digitized autograph is introduced into a historical context.

DOI: 10.7256/2585-7797.2020.2.32961

Abstract: The article attempts to study the Latin text of the chronicle “Historia de regibus Gothorum, Wandalorum et Sueborum” written by the famous 17th c. theologist and scholar Isidoro de Sevilla by means of advanced methods of intellectual text analysis. The main goal is to verify the hypothesis that the author had ideas about the hierarchy of barbarians. The main focus is to clarify the implicit semantic relationship between different parts of the chronicle in order to find out the author’s attitude to these three barbaric groups. The analysis of the text was performed with the R programming language. The specific method is that of latent semantic analysis providing for comparing clustering of texts on the basis of semantic space designed through the singular decomposition of term-document matrix. The research novelty of the study is that it is the first time when a full cycle latent semantic analysis of a Medieval Latin text has been performed which covered the text preprocessing, the creation of the semantic space and the calculation of the semantic similarity of texts on the basis of cosine similarity measure. The analysis results suggest that Isidoro de Sevilla really built the hierarchy of three barbarian groups providing greater similarity to the description of the Visigoths and the Suebi and putting the Vandals apart.

DOI: 10.7256/2585-7797.2019.2.30126

Abstract: The historical population register of Norway contains data on the country's population from 1800 to 1964. Information on the country's population from 1964 to the present is collected in the Central Population Register. The historical register consists of these metric books and civil records, filling in the gaps between population censuses conducted every ten years. In 1801 and, beginning in 1865, these censuses were nominative, that is, contained the names of people. This article is devoted to the problems of linking census records and metric books (record linkage) from 1800 to 1920. Special attention is paid to the identification of individuals and the difficulties of linking records. The main problem is to identify a person by the records belonging to different years, in terms of a significant number of namesakes and variations in the fixation of their names, as well as age. The creation of stable identifiers for individuals and the procedure for linking records from various sources required the development of new software combining automatic and manual methods. Analysis of local databases allows us to hope for successful linking from 2/3 to 90% of records for various periods and regions of the country. The historical register of Norway is unique in its coverage of the territory and the variety of historical sources related to it.

DOI: 10.7256/2585-7797.2020.2.33330

Abstract: The article discusses methods of systematization and visualization of codicological observations on an archival manuscript by means geoinformatics. This solution provides for summarizing the information of a historical source and its maximum accessibility for a wide range of Internet users. The web project created can be used not only for research but educational purposes as well. The paper grounds on the results of 1542 Semen Klushin’s codicological study of Novgorod pistsovaya kniga covering Vodskaya Pyatina (The work is stored in the Russian State Archive of Ancient Manuscripts, RGADA). The physical medium of a historical text, i.e. a manuscript, is considered as a special space in its own reference system. This makes geoinformatics methods applicable to determine the topology (i.e. the mutual relationship) of its objects. The approach proposed is tested for the first time that's why the main attention is paid to the description of the most important stages followed when processing the source codicological materials to turn them into a GIS project based on a relational database. The web resource created provides for visualizing a significant bulk of manuscript data. However, it should not be considered a map or a spatial model. It may be determined as a manuscript codiological GIS scheme published as a web resource but without a map. The scheme is adjusted and controlled by tools which are used when working with databases and are not limited to the cartographic interface.

DOI: 10.7256/2585-7797.2021.2.35089

Abstract: In many Russian nature reserves traditional landscapes are objects of important historical and cultural heritage. To preserve and restore them one needs to deeply understand their development, formation and degradation processes. In the north of European Russia agricultural landscapes are often covered with forests and lose their features when agricultural activity decreases. However, structural characteristics of these forests as a rule tell us about their development and peculiarities of successions. The study aims at creating a technique to estimate the scope of former agricultural land development, model historical transformation of agricultural landscapes and identify plots of slash and burn, shifting, two and three field agriculture judging by structural characteristics of post-agrarian forests. Aided by GIS the study compares raster analogs of land demarcation plans of the second half of the 19th century and vector layers of present day forests with attributive data on the forest structure. The use of cartographic forest data and inventory forest characteristics to compare with former land management documents related to the plot named have not been found in studies before. High precision of present day land management provides for permitted comparability with old demarcation plans and allows one to use inventory data for inter-landscape differentiation of agricultural landscapes in the 19th century. The study covers a model plot within Kenozero National Park (Arkhangelsk Region) addressing 1861 demarcation plans and 2014 forests GIS developed by Arkhangelsk branch of Roslesinforg. GIS processing of 19th century and present day demarcation plans provides for modeling agricultural landscape changes in relation to separate plots, trace the influence of soil conditions and elements of agrarian use on topological and inventory changes of emerging forests and reconstruct the biodiversity of ecosystems in the past.

DOI: 10.7256/2585-7797.2019.2.29770

Abstract: The authors assess how 1897 Census papers stored in Russian and foreign archives are represented and preserved. The study of primary data document collections leads to a conclusion that the term “census papers” is heterogeneous and includes several different forms used depending on a type of household and region as well as first, second and third copies of census forms. A peculiar feature of the article is the presentation of conclusions in the form of cartograms based on modern and historical maps. The study has used source studies analysis and spatial analysis as well as a complex approach treating census papers as a unified historical source irrespective of their storage place. The research novelty is identification and introduction of a complex of nominative 1897 Census data. In addition, the authors propose an original approach that takes into account both the number of areas populated and the number of census papers preserved in them which allowed them to assess the degree of preservation of census materials in Russian Empire uezds. The article concludes that census papers with different preservation state have been identified for 47 % of guberniyas and 25.5% of uezds. Census paper collections cover regions of European Russia and Siberia, partly those of the Caucasus and Central Asia. The volume of census paper data preserved and their "territorial spread" allows one to consider them a complex source on the history of the Russian Empire population at the turn of the 19th century. .