Litera - rubric Automatic language processing
Litera
12+
Journal Menu
> Issues > Rubrics > About journal > Authors > About the Journal > Requirements for publication > Editorial collegium > Editorial board > Peer-review process > Policy of publication. Aims & Scope. > Article retraction > Ethics > Online First Pre-Publication > Copyright & Licensing Policy > Digital archiving policy > Open Access Policy > Article Processing Charge > Article Identification Policy > Plagiarism check policy
Journals in science databases
About the Journal

MAIN PAGE > Journal "Litera" > Rubric "Automatic language processing"
Automatic language processing
Zhikulina C.P. - Alice's Tales: the transformation of the composition, fairy-tale formulas and contexts of the voice assistant in the skill "Let's come up with" pp. 45-64

DOI:
10.25136/2409-8698.2024.2.69760

EDN: AQYOMS

Abstract: The subject of the study is a spontaneously generated text by the voice assistant Alice when it is creating a fairy tale together with the user, and the purpose of the study is the transformation of the structure, fairy-tale formulas and context in terms of the selection of linguistic elements and meanings using artificial intelligence technology. Particular attention has focused on the skill "Let's make it up", which became available to users in the spring of 2023. The collision and interaction of folklore canons with the realities of the 21st century give rise to an ambiguous reaction to the interactive opportunity to play the role of a storyteller together with a voice assistant. The main research method was a continuous sample, which was used to distribute the steps, stages and actions when it is creating a fairy-tale plot together with a voice assistant. In addition, methods such as comparative and contextual analyses were used to identify similarities and differences between traditional Russian fairy tales and a spontaneously generated fairy tale plot. To obtain the data and subsequent analysis of the components, a linguistic experiment with the voice assistant Alice from Yandex was conducted and described. The rapid development of neural network language models allows us to talk about the scientific novelty of the material under study, since this area is unexplored and is being modified too quickly. It is important to emphasize that to date, the texts of spontaneously generated fairy-tale, their structural division and the correspondence of fairy-tale formulas in them to folklore canons have not been studied. The main conclusion of the study is that the user's share in creating a fairy tale with the voice assistant Alice is greatly exaggerated.
Zaripova D.A., Lukashevich N.V. - Automatic Generation of Semantically Annotated Collocation Corpus pp. 113-125

DOI:
10.25136/2409-8698.2023.11.44007

EDN: QRBQOI

Abstract: Word Sense Disambiguation (WSD) is a crucial initial step in automatic semantic analysis. It involves selecting the correct sense of an ambiguous word in a given context, which can be challenging even for human annotators. Supervised machine learning models require large datasets with semantic annotation to be effective. However, manual sense labeling can be a costly, labor-intensive, and time-consuming task. Therefore, it is crucial to develop and test automatic and semi-automatic methods of semantic annotation. Information about semantically related words, such as synonyms, hypernyms, hyponyms, and collocations in which the word appears, can be used for these purposes. In this article, we describe our approach to generating a semantically annotated collocation corpus for the Russian language. Our goal was to create a resource that could be used to improve the accuracy of WSD models for Russian. This article outlines the process of generating a semantically annotated collocation corpus for Russian and the principles used to select collocations. To disambiguate words within collocations, semantically related words defined based on RuWordNet are utilized. The same thesaurus is also used as the source of sense inventories. The methods described in the paper yield an F1-score of 80% and help to add approximately 23% of collocations with at least one ambiguous word to the corpus. Automatically generated collocation corpuses with semantic annotation can simplify the preparation of datasets for developing and testing WSD models. These corpuses can also serve as a valuable source of information for knowledge-based WSD models.
Golikov A., Akimov D., Romanovskii M., Trashchenkov S. - Aspects of creating a corporate question-and-answer system using generative pre-trained language models pp. 190-205

DOI:
10.25136/2409-8698.2023.12.69353

EDN: FSTHRW

Abstract: The article describes various ways to use generative pre-trained language models to build a corporate question-and-answer system. A significant limitation of the current generative pre-trained language models is the limit on the number of input tokens, which does not allow them to work "out of the box" with a large number of documents or with a large document. To overcome this limitation, the paper considers the indexing of documents with subsequent search query and response generation based on two of the most popular open source solutions at the moment the Haystack and LlamaIndex frameworks. It has been shown that using the open source Haystack framework with the best settings allows you to get more accurate answers when building a corporate question-and-answer system compared to the open source LlamaIndex framework, however, requires the use of an average of several more tokens. The article used a comparative analysis to evaluate the effectiveness of using generative pre-trained language models in corporate question-and-answer systems using the Haystack and Llamaindex frameworks. The evaluation of the obtained results was carried out using the EM (exact match) metric. The main conclusions of the conducted research on the creation of question-answer systems using generative pre-trained language models are: 1. Using hierarchical indexing is currently extremely expensive in terms of the number of tokens used (about 160,000 tokens for hierarchical indexing versus 30,000 tokens on average for sequential indexing), since the response is generated by sequentially processing parent and child nodes. 2. Processing information using the Haystack framework with the best settings allows you to get somewhat more accurate answers than using the LlamaIndex framework (0.7 vs. 0.67 with the best settings). 3. Using the Haystack framework is more invariant with respect to the accuracy of responses in terms of the number of tokens in the chunk. 4. On average, using the Haystack framework is more expensive in terms of the number of tokens (about 4 times) than the LlamaIndex framework. 5. The "create and refine" and "tree summarize" response generation modes for the LlamaIndex framework are approximately the same in terms of the accuracy of the responses received, however, more tokens are required for the "tree summarize" mode.
Zhikulina C.P. - Siri and the skills of encoding personal meanings in the context of English speech etiquette pp. 338-351

DOI:
10.25136/2409-8698.2023.12.69345

EDN: KZVBFU

Abstract: The subject of the study is the content of personal meanings of greeting questions in the context of English communication formulas of Siri. The object of the study is the ability of the voice assistant to simulate spontaneous dialogue with a person and the adaptation of artificial intelligence to natural speech. The purpose of the study is to identify the features and level of Siri's language skills in the process of communicating with users in English. Such aspects of the topic as the problem of understanding that exists in two types of communication are considered in detail: 1) between a person and a person; 2) between a machine and a person; the use of stable communication formulas by artificial intelligence as responses to the question How are you?; determining the level and speech-making potential in the responses of the voice assistant. The following methods were used in the research: descriptive, comparative, contextual, comparative method and linguistic experiment. The scientific novelty is that the problems related to encoding the personal meanings of the Siri voice assistant have never been studied in detail in philology and linguistics. Due to the prevalence use of voice systems in various spheres of social and public life, there is a need to analyze errors in speech and describe communication failures in dialogues between voice assistants and users. The main conclusions of the study are: 1) the machine is not able to generate answers based on the experience of past impressions; 2) deviations from the norms of English speech etiquette in Siri's responses are insignificant, but often lead to communicative failures; 3) the one-sided encoding of personal meaning was found in the responses: from the machine to the person, but not vice versa.
Maikova T. - On the Concept of Translation Unit in a Machine Translation Framework pp. 352-360

DOI:
10.25136/2409-8698.2023.12.69470

EDN: LAWSMV

Abstract: The article looks at the question whether the concept of translation unit might apply to the sphere of machine translation and whether the size of the unit influences the quality of translation. While modern machine translation systems offer an acceptable level of quality, a number of problems mainly related to the structural organization of the text remain unresolved, hence the question posed in the paper. The article offers a review of modern readings of the concept and pays special attention to the question whether the scope of the term changes depending on whether the object of research is the target text or the translation process. The paper also provides a quick look on the research methods for both text-oriented and process-oriented approaches, such as comparative analysis of language pairs and Think Aloud Protocol. Based on a review of existing machine translation models, each of them is analyzed to answer the question whether a unit of translation can be defined for a given system and what its size is. It is concluded that a unit of translation can be viewed as either a unit of analysis or a unit of processing with respect to text-oriented and process-oriented perspectives on to the study of translation. The unit of translation has a dynamic character and influences the quality of the target text. In machine translation, the unit of translation as a unit of analysis is not applicable for systems based on probabilistic non-linguistic methods. For rule-based machine translation systems, both readings of the unit of translation concept are applicable, but hardly go beyond a single sentence. Accordingly, at least one type of translation problem intra-textual relations resolutions remains largely unaddressed in the present state of affairs in machine translation.
Other our sites:
Official Website of NOTA BENE / Aurora Group s.r.o.