Please use this identifier to cite or link to this item:
https://hdl.handle.net/10321/4677
Title: | Word sense disambiguation pipeline framework for low resourced morphologically rich languages | Authors: | Masethe, Mosima Anna Masethe, Hlaudi Daniel Ojo, Sunday Olusegun Owolawi, Pius A. |
Keywords: | Corpus;Continuous bag of words;Natural Language Processing;SBERT;SkipGram;Word Sense Disambiguation | Issue Date: | 2022 | Publisher: | Elsevier BV | Source: | Masethe, M.A et al. 2022. Word sense disambiguation pipeline framework for low resourced morphologically rich languages. SSRN Journal. doi:10.2139/ssrn.4332896 | Journal: | SSRN | Abstract: | Resolving ambiguity problem is a prolonged natural language processing theoretical research challenge. Sesotho sa Leboa language is an official name for Sepedi or Northern Sotho language as known to be an official language among 11 others in South Africa spoken by 4.7 million people. Sesotho sa Leboa is an indigenous rich morphologically low resourced South African language which is a highly polysemous language, with words that have numerous context. Disambiguating polysemous words remain a challenging problem for computational linguistics research. Deficiencies of several polysemy assessments suggest that dealing with the sense distinctiveness versus polysemy problems remains an uncluttered academic issue. A practical problem in natural language processing applications is Word Sense Disambiguation which suffers drastically from shortcomings when working with ambiguous polysemous words. Therefore, Word Sense Disambiguation seeks both academic and practical results. Many Word Sense Disambiguation applications gives high accuracy for the English language, and poor accuracy for Sesotho sa Leboa language. In this research, Word Sense Disambiguation pipeline framework is developed for Sesotho sa Leboa low resourced morphologically rich language which addresses academic and practical problems of the polysemy problem. The proposed Word Sense Disambiguation pipeline framework shows pre-processing modules which is a process to reduce ambiguity from the unstructured text corpus that serve to input sentences. Hence, the researchers compute the probability of Word Sense Disambiguation when polysemy and homonymy is observed for cosine similarity measures using sentence transformer (SBERT) and Word2Vec algorithms (Skip-Gram and Continuous Bag of Words). Computation of cosine similarity measure shows SBERT outperforms other algorithms with 87% threshold which shows strong similarity between context and sense definition while Continuous Bag of Words gives cosine similarity threshold of 51%, outperforming Skip-Gram algorithms which has a threshold below 50% with two vectors approaching a perpendicular angle of 90-degrees orthogonally indicating that orientation of vectors do not match. |
URI: | https://hdl.handle.net/10321/4677 | ISSN: | 1556-5068 (Online) | DOI: | 10.2139/ssrn.4332896 |
Appears in Collections: | Research Publications (Academic Support) |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
SSRN Copyright Clearance.docx | Copyright clearance | 189.3 kB | Microsoft Word XML | View/Open |
Masethe_Ojo_et al_2022.pdf | Article | 508.87 kB | Adobe PDF | View/Open |
Page view(s)
221
checked on Dec 22, 2024
Download(s)
139
checked on Dec 22, 2024
Google ScholarTM
Check
Altmetric
Altmetric
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.