Word sense disambiguation pipeline framework for low resourced morphologically rich languages

Masethe, Mosima Anna; Masethe, Hlaudi Daniel; Ojo, Sunday Olusegun; Owolawi, Pius A.

Word sense disambiguation pipeline framework for low resourced morphologically rich languages

Files

SSRN Copyright Clearance.docx (189.3 KB)

Masethe_Ojo_et al_2022.pdf (508.87 KB)

Date

2022

Authors

Masethe, Mosima Anna

Masethe, Hlaudi Daniel

Ojo, Sunday Olusegun

Owolawi, Pius A.

Publisher

Elsevier BV

Abstract

Resolving ambiguity problem is a prolonged natural language processing theoretical research challenge. Sesotho sa Leboa language is an official name for Sepedi or Northern Sotho language as known to be an official language among 11 others in South Africa spoken by 4.7 million people. Sesotho sa Leboa is an indigenous rich morphologically low resourced South African language which is a highly polysemous language, with words that have numerous context. Disambiguating polysemous words remain a challenging problem for computational linguistics research. Deficiencies of several polysemy assessments suggest that dealing with the sense distinctiveness versus polysemy problems remains an uncluttered academic issue. A practical problem in natural language processing applications is Word Sense Disambiguation which suffers drastically from shortcomings when working with ambiguous polysemous words. Therefore, Word Sense Disambiguation seeks both academic and practical results. Many Word Sense Disambiguation applications gives high accuracy for the English language, and poor accuracy for Sesotho sa Leboa language. In this research, Word Sense Disambiguation pipeline framework is developed for Sesotho sa Leboa low resourced morphologically rich language which addresses academic and practical problems of the polysemy problem. The proposed Word Sense Disambiguation pipeline framework shows pre-processing modules which is a process to reduce ambiguity from the unstructured text corpus that serve to input sentences. Hence, the researchers compute the probability of Word Sense Disambiguation when polysemy and homonymy is observed for cosine similarity measures using sentence transformer (SBERT) and Word2Vec algorithms (Skip-Gram and Continuous Bag of Words). Computation of cosine similarity measure shows SBERT outperforms other algorithms with 87% threshold which shows strong similarity between context and sense definition while Continuous Bag of Words gives cosine similarity threshold of 51%, outperforming Skip-Gram algorithms which has a threshold below 50% with two vectors approaching a perpendicular angle of 90-degrees orthogonally indicating that orientation of vectors do not match.

Keywords

Corpus, Continuous bag of words, Natural Language Processing, SBERT, SkipGram, Word Sense Disambiguation

Citation

Masethe, M.A et al. 2022. Word sense disambiguation pipeline framework for low resourced morphologically rich languages. SSRN Journal. doi:10.2139/ssrn.4332896

URI

https://hdl.handle.net/10321/4677

DOI

10.2139/ssrn.4332896

Collections

Research Publications (Academic Support)

Full item page

Word sense disambiguation pipeline framework for low resourced morphologically rich languages

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

DOI

Collections