Please use this identifier to cite or link to this item: https://hdl.handle.net/10321/4677
Title: Word sense disambiguation pipeline framework for low resourced morphologically rich languages
Authors: Masethe, Mosima Anna 
Masethe, Hlaudi Daniel 
Ojo, Sunday Olusegun 
Owolawi, Pius A. 
Keywords: Corpus;Continuous bag of words;Natural Language Processing;SBERT;SkipGram;Word Sense Disambiguation
Issue Date: 2022
Publisher: Elsevier BV
Source: Masethe, M.A et al. 2022. Word sense disambiguation pipeline framework for low resourced morphologically rich languages. SSRN Journal. doi:10.2139/ssrn.4332896
Journal: SSRN
Abstract: 
Resolving ambiguity problem is a prolonged natural language processing theoretical research
challenge. Sesotho sa Leboa language is an official name for Sepedi or Northern Sotho
language as known to be an official language among 11 others in South Africa spoken by 4.7
million people. Sesotho sa Leboa is an indigenous rich morphologically low resourced South
African language which is a highly polysemous language, with words that have numerous
context. Disambiguating polysemous words remain a challenging problem for computational
linguistics research. Deficiencies of several polysemy assessments suggest that dealing with
the sense distinctiveness versus polysemy problems remains an uncluttered academic issue.
A practical problem in natural language processing applications is Word Sense Disambiguation
which suffers drastically from shortcomings when working with ambiguous polysemous
words. Therefore, Word Sense Disambiguation seeks both academic and practical results.
Many Word Sense Disambiguation applications gives high accuracy for the English language,
and poor accuracy for Sesotho sa Leboa language. In this research, Word Sense
Disambiguation pipeline framework is developed for Sesotho sa Leboa low resourced
morphologically rich language which addresses academic and practical problems of the
polysemy problem. The proposed Word Sense Disambiguation pipeline framework shows
pre-processing modules which is a process to reduce ambiguity from the unstructured text
corpus that serve to input sentences. Hence, the researchers compute the probability of Word
Sense Disambiguation when polysemy and homonymy is observed for cosine similarity
measures using sentence transformer (SBERT) and Word2Vec algorithms (Skip-Gram and
Continuous Bag of Words). Computation of cosine similarity measure shows SBERT
outperforms other algorithms with 87% threshold which shows strong similarity between
context and sense definition while Continuous Bag of Words gives cosine similarity threshold
of 51%, outperforming Skip-Gram algorithms which has a threshold below 50% with two
vectors approaching a perpendicular angle of 90-degrees orthogonally indicating that
orientation of vectors do not match.
URI: https://hdl.handle.net/10321/4677
ISSN: 1556-5068 (Online)
DOI: 10.2139/ssrn.4332896
Appears in Collections:Research Publications (Academic Support)

Files in This Item:
File Description SizeFormat
SSRN Copyright Clearance.docxCopyright clearance189.3 kBMicrosoft Word XMLView/Open
Masethe_Ojo_et al_2022.pdfArticle508.87 kBAdobe PDFView/Open
Show full item record

Page view(s)

221
checked on Dec 22, 2024

Download(s)

139
checked on Dec 22, 2024

Google ScholarTM

Check

Altmetric

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.