Madhu, T. and Mallikarjun, M. and Teja, P. Charan and Rahitya, K. (2025) Semantic Document Clustering Using NLP. International Journal of Innovative Science and Research Technology, 10 (5): 25may1946. pp. 3415-3420. ISSN 2456-2165

[thumbnail of IJISRT25MAY1946.pdf] Text
IJISRT25MAY1946.pdf - Published Version

Download (409kB)

Abstract

This project explores a semantic-based document clustering system designed to group documents based on the similarity of their content. Unlike traditional keyword-based methods, which rely solely on word frequency, this system leverages Natural Language Processing (NLP) to understand and compare the semantic meaning within documents. Using pre-trained language models such as BERT and Sentence-BERT, each document is converted into a dense vector representation that captures its underlying meaning. These vectors enable precise comparison of documents’ semantic content, allowing for more accurate clustering. The project employs clustering algorithms such as K-Means and DBSCAN, which group documents into clusters based on similarity. Cosine similarity further ensures that related documents are accurately clustered together. Experimental results demonstrate that this approach produces more coherent and contextually relevant clusters compared to traditional techniques, making it an effective solution for applications in content organization, topic analysis, and information retrieval.

Item Type: Article
Subjects: L Education > L Education (General)
Divisions: Faculty of Law, Arts and Social Sciences > School of Education
Depositing User: Editor IJISRT Publication
Date Deposited: 20 Jun 2025 10:56
Last Modified: 20 Jun 2025 10:56
URI: https://eprint.ijisrt.org/id/eprint/1324

Actions (login required)

View Item
View Item