Document Type : Research Paper


Computer science Dept., University of Technology-Iraq, Alsina’a street, 10066 Baghdad, Iraq.


Regardless of the data source and type (text, digital, photo group, etc.), they are usually unclean data. The term (unclean) means that data contains some bugs and paradoxes that can strongly impact machine learning processes. The nature of the input data of the dataset is the most important reason for the success of the learning algorithm. More than one factor influences machine learning results in a specific task. The characteristics and the nature of the data are the main reasons for the algorithm's success. This paper generally examines data processing entered into an algorithm to learn machines. The paper explains the operations of each stage of prior treatment data for the best achievement of its data set. In this paper, four models for teaching machines (SVM, Multiple Bayes - NB, and Bernoulli - NB) will be used. Best accuracy (Bernoulli - NB) model 89%. The pre-processing algorithm applied to the data set (dirty data) will be developed and compared to previous results before development. The Bernoulli-NB model reaches 91% accuracy and improves the value of the rest of the models used in this process.

Graphical Abstract


  • From four models to teach machines (SVM, Multiple Bayes - NB, and Bernoulli - NB)   used, Best accuracy (Bernoulli - NB) model 89%.
  •  with the Bernoulli-NB model reaching 91% of accuracy, as well as improving the value of the rest of the models used in this process


Main Subjects

[1] Sa, Pankaj Kumar, Sambit Bakshi, Ioannis K. Hatzilygeroudis, and Manmath Narayan Sahoo. Recent Findings in Intelligent Computing Techniques. Proceedings of the 5th ICACNI 1 (2017)
[2] Hussein Attya, Yossra H Ali and Aalaa Abdulwahab. Documents Classification Based On Deep Learning. Int. J. Sci. Eng. Res. & Biosci. Biotechnol. Res. 9 (2020) 2277-8616.
[3] Abdulhakeem Q. Albayati, Ahmed S. Al-Araji, Saman H. Ameen. A Method of Deep Learning Tackles Sentiment Analysis Problem in Arabic Texts. Iraqi Journal of Computers, Control & System Engineering (IJCCCE), 20 (2020).
[4] Kalra, Vaishali, and Rashmi Aggarwal. Importance of Text Data Preprocessing & Implementation in RapidMiner. In ICITKM, 2017, pp. 71-75
[5] Kadhim, Ammar Ismael. An Evaluation of Preprocessing Techniques for Text Classification. International Journal of Computer Science and Information Security 16, no. 6 (2018).
[6] Abd, Dhafar Hamed, Ahmed T. Sadiq, and Ayad R. Abbas. Classifying political arabic articles using support vector machine with different feature extraction. In International Conference on Applied Computing to Support Industry: Innovation and Technology, pp. 79-94. Springer, Cham, (2019).
[7] Chandrasekar, Priyanga, Kai Qian, Hossain Shahriar, and Prabir Bhattacharya. Improving the prediction accuracy of decision tree mining with data pre-processing. In 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC), 2 (2017) 481-484.
[8] Paulauskas, Nerijus, and Juozas Auskalnis. Analysis of data pre-processing influence on intrusion detection using NSL-KDD dataset. In 2017 open conference of electrical, electronic and information sciences (eStream), pp. 1-5. IEEE, (2017).
[9] Ammar Ismael Kadhim. An Evaluation of Preprocessing Techniques for Text Classification. International Journal of Computer Science and Information Security (IJCSIS), 16 (2018).
[10] M. Sornam and M. Meharunnisa. The discovery of normality of body weight using principal component analysis: a comparative study on machine learning techniques using different data pre-processing methods. Int. J. Knowledge Engineering and Data Mining, 6 (2019).
[11] Vijayarani, S., Ms J. Ilamathi, and Ms Nithya. Pre-processing techniques for text mining-an overview. International Journal of Computer Science & Communication Networks 5 (2015) 7-16.
[12] Sarkar, Dipanjan. Text Analytics with Python. (2016).
[13] Ahmed B. Abdul-Wahhab, Alia K. Abdul Hassan, Opinion Extraction Framework Using Aspect Based Sentiment, Phd. Dissertation, Computer science Dept., Univ. of Technology, Iraq,(2019).
[14] Vijayarani, S., and R. Janani. Text mining: open source tokenization tools-an analysis. Advanced Computational Intelligence: An International Journal (ACII) 3, no. 1 (2016): pp. 37-47.
[15] Qaiser, Shahzad, and Ramsha Ali. Text mining: use of TF-IDF to examine the relevance of words to documents. International Journal of Computer Applications 181 (2018) 25-29.
[16] R. M. Hadi, S. H. Hashem, A. T. Maolood. An Effective Preprocessing Step Algorithm in Text Mining Application. Eng. Technol.  J.,35 (2017) 2017.
[17] Hamid R. Arabnia , Kevin Daimi. Principles of Data Science.( 2020).
[18] Vaishali Kalra, Dr. Rashmi Aggarwal. Importance of Text Data Preprocessing & Implementation in RapidMiner.. Proceedings of the First International Conference on Information, 2018, Technology and Knowledge Management pp. 71–75.
[19] Robertson, Stephen. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of documentation (2004).
[20] Zhang, Fan, Hasan Fleyeh, Xinru Wang, and Minghui Lu. Construction site accident analysis using text mining and natural language processing techniques. Automation in Construction 99 (2019) 238-248.
[21] Alkhateeb, Zainab Khamees, and Abeer Tariq Maolood. Machine Learning-Based Detection of Credit Card Fraud: A Comparative Study. (2019).
[22] Hashim, Soukaena H. Proposed Hybrid Classifier to Improve Network Intrusion Detection System using Data Mining Techniques. Eng. Technol.  J.,38 (2020) 6-14.
[23] Al-Taie, Mohammed Zuhair, Seifedine Kadry, and Joe Pinho Lucas. Online data pre-processing: a case study approach. International Journal of Electrical & Computer Engineering (2088-8708) 9, no. 4 (2019).
[24] Singh, Gurinder, Bhawna Kumar, Loveleen Gaur, and Akriti Tyagi. Comparison between multinomial and Bernoulli naïve Bayes for text classification. In 2019 International Conference on Automation, Computational and Technology Management (ICACTM), 2019, pp. 593-596. IEEE
[25] Wiratama, Gabriella Putri, and Andre Rusli. Sentiment Analysis of Application User Feedback in Bahasa Indonesia Using Multinomial Naive Bayes. In 2019 5th International Conference on New Media Studies (CONMEDIA), 2019, pp. 223-227. IEEE.
[26] Surya, Prabha PM, Lakshmi V. Seetha, and B. Subbulakshmi. Analysis of user emotions and opinion using Multinomial Naive Bayes Classifier. In 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA), 2019, pp. 410-415. IEEE.
[27] Bafjaish, Salem Saleh. Comparative Analysis of Naive Bayesian Techniques in Health-Related For Classification Task. Journal of Soft Computing and Data Mining 1 (2020) 1-10
[28] Khalaf, Mohammed I., Dhiya Al-Jumeily, and Alexei Lisitsa, eds. Applied Computing to Support Industry: Innovation and Technology: First International Conference, ACRIT 2019, Ramadi, Iraq, September 15–16, 2019, Revised Selected Papers. 1174 (2020).