Document Type : Research Paper

Authors

1 Baghdad College of Medical Sciences, Baghdad, Iraq.

2 University of Information Technology and Communications, Baghdad, Iraq.

Abstract

Text classification has been a significant domain of study and research because of the increased volume of text datasets and documents available in digital format. Text classification is one of the major approaches used to arrange digital information via automatically allocating text dataset records or documents into predetermined classes depending on their contents. This paper proposes a technique that implements supervised machine learning algorithms such as KNN, Decision tree, Random Forest, Bernoulli Naive Bayes, and Multinomial Naive Bayes classifiers to classify a dataset into distinct classes. The proposed technique combines the above-mentioned machine learning classifiers with the TF-IDF feature extraction method as a vector space model to achieve more precise classification results. The proposed technique yields high accuracy, precision, recall, and f1-measure metric values for all the implemented classifiers. After comparing the obtained results of different classifiers, it is found that the Random Forest classifier is the best algorithm used to classify the textual dataset records with the highest accuracy value of 0.9995930.

Graphical Abstract

Highlights

  • Automatic text classification has been a significant research domain because of the increased volume of text datasets and documents.
  • Any text-based problem should be converted into a form that can be modeled.
  • The input text is converted into features using Feature Extraction - Inverse Document Frequency TF-IDF technique.
  • Then, five supervised classification methods are used to classify the product’s textual keywords into individual classes.
  • The suggested technique shows that the Random Forest algorithm is the best and ideal classifier utilized to categorize the dataset with the highest accuracy

Keywords

Main Subjects

[1] A. Patra and D. Singh, A survey report on text classification with different term weighing methods and comparison between classification algorithms, International journal of computer applications, 75 (2013).
[2]  D. Kalita, Supervised and unsupervised document classification-a survey, International journal of computer science and information technologies, 6 (2015).
[3]  R. Jindal, R. Malhotra and A. Jain, Techniques for text classification: literature review and current trends, Webology, 12 (2015).
[4]  M. Mowafy, A. Rezk and H. M.El-bakry, An efficient classification model for unstructured text document, American journal of computer science and information technology, 6 (2018).
[5]  A. Patra and D. Singh, Neural network approach for text classification using relevance factor as term weighing method , International journal of computer applications , 68 (2013).
[6]  Korde and C. N. Mahender, Text classification and classifiers: a survey, international journal of artificial intelligence & applications (IJAIA), 3 (2012).
[7]  W. M. Hadi, M. A. H. Eljinini and S. Alhawari, The automated arabic text categorization using SVM and KNN, knowledge management and innovation: a business competitive edge perspective, (2010).
[8]  A. Bilski , a review of artificial intelligence algorithms in document classification, International  journal of electronics and telecommunications, 57 (2011) 263–270 .
[9]  M. Thangaraj and M. Sivakami, Text classification techniques: a literature review, Interdisciplinary journal of information, knowledge, and management, 13 (2018).
[10]  M. Manjotho, T. J. S. Khanzada, L. A. Thebo and A. A. Manjotho, Improving performance of mobile SMS classification using TF-IDF & multinational naïve Bayes classifier, Engineering science and technology international research journal, 2 (2018).
[11]  M. Abbas, K. A. Memon, A. Jamali, S. Memon and A. Ahmed, Multinomial naive Bayes classification model for sentiment analysis, International journal of computer science and network security, 19 (2019).
[12]  W. M.U. Noormanshah, P. N.E. Nohuddin and Z. Zainol, Document categorization using decision tree: preliminary study, International journal of engineering & technology, 7 (2018) 437-440.
[13]  D. Singh and S. Malhotra, Intra news category classification using n-gram tf idf features and decision tree classifier, IJSART, 4 (2018).
[14]  B. Trstenjak, S. Mikac and D. Donko, KNN with TF-IDF based framework for text categorization, International symposium on intelligent manufacturing and automation, Procedia engineering  69 (2014) 1356 -1364.
[15]  S. A. Nasser , I. A. Hashim and W. H. Ali, Visual depression diagnosis from face based on various classification algorithms , Eng. Technol.  J., 38 (2020) 1717-1729.
[16]  N. T. Mahmood, M. H. Al-Muifraje , S. K. Salih and T. R. Saeed, Pattern recognition of composite motions based on emg signal via machine learning , Eng. Technol.  J., 39 (2021) 295-305.
[17]  M. A. jabbar, B.L. Deekshatulua and P. Chandra, " Classification of heart disease using k- nearest neighbor and genetic algorithm" , International conference on computational intelligence: modeling techniques and applications (CIMTA), (2013).
[18]  G. Guo, H. Wang, D. Bell, Y. Bi and K. Greer, Using KNN model-based approach for automatic text categorization, (2006).
[19]  N. N. A. Sjarif, N. F. M. Azmi, S. Chuprat, H. M. Sarkan, Y. Yahya, S. M. Sam, SMS Spam message detection using term frequency-inverse document frequency and random forest  algorithm, The fifth information techniques international conference, (2019).
[20]  H. Shimodaira, Text classification using naive Bayes, (2015).
[21]  S. Raschka, Naive Bayes and text classification I introduction and theory, (2014).
[22]  N. C. Le, P. W. C. Prasad, A. Alsadoon, L. Pham and A. Elchouemi, Text classification: naïve Bayes classifier with sentiment lexicon, IAENG International journal of computer science, (2019).
[23]  S. Xu, Y. Li and Z. Wang, Bayesian multinomial naïve Bayes classifier to text classification, conference paper, (2017).