HMM Based POS Tagging System for 8 Different Languages and Several Tagsets

We propose, in this paper,Part-Of-Speech (POS) tagging system is proposed which based on Hidden Markov Modal (HMM) for several languages. HMM is implemented using Viterbi algorithm on 8 languages English, Hindi, Telugu, Bangla (Bengali), Marathi, Standard Chinese, Portuguese and Spanish languages. The data for these languages were taken from the freely available corpora: Brown, NPS-Chat, Indiana, Sinica, Floresta and CESS-ESP Corpora. HMM is the most learning method used in many NLP applications, especially POS tagging. HMM taggerwas implemented, by other researchers,for a lot of languages where each one takeoneprivate language. system testing is done by splitting each corpus to 99% training and 1% testing. This testing is repeated for 10 times by changing the training and test data. The accuracies (average for all 10 tests) for English (using two tagsets of 40 tags and 472 tags), English (NPS corpus), Hindi, Telugu, Bangla or Bengali, Marathi, Standard Chinese, Portuguese (using two tagsets of 32 tags and 269 tags), and Spanish (using two tagsets of 14 tags and 289 tags) are (95.3%& 92.39%), 87.17%, 81.3%, 74.03%, 72.01%, 69


INTRODUCTION
OS tagging is the most studied field in natural language processing (NLP) area. It is very important task for many NLP applications such as machine translation (MT) and many others.POS tagging,or simply tagging, is the process of classifying words into their parts-of-speech and labeling them accordingly [1].
In such task, we are given some observation(s) and our job is to determine which of a set of classes it belongs to. Part-of-speech tagging is generally treated as a sequence classification task. So here the observation is a sequence of words (may be sentence), and it is our job to assign them a sequence of part-of-speech tags [2]. For understanding tagging problem, suppose we try toclassify (tagging) a sequence of words w 1 …w n by a set of classes (tags) {t 1 …t m }. What is the best sequence of classes (tags) which corresponds to this sequence of words? The Bayesian interpretation of this task starts by considering all possible sequences of classes (in this case, all possible sequences of tags). Out of this universe of tag sequences, we want to choose the tag sequence which is most probable given the observation sequence of these n words [2].
A part-of-speech (POS) tagger assigns a POS label to each word of an input text. The tagger first obtains the set of possible POS tagsfor each word from a lexicon and then disambiguates between them based on the word context [3]. Parts-of-speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset [1].
There are many approaches used for tagging, one of them Hidden Markov Model (HMM). HMM used for tagging complete sentence according to the context. In this P work we will implement a HMM tagger on several languages with several tests. Different corpora for the same language are used and same corpus, annotated by different tagsets, isalso used. Finally, executable application will be provided which used for tagging any input (sentence) from any used language.

HMM on tagging
Often we want to consider a sequence of random variables that aren't independent, but rather the value of each variable depends on previous elements in the sequence. For many such systems, it seems reasonable to assume that all we need to predict the future random variables is the value of the present random variable, and we don't need to know the values of all the past random variables in the sequence.This is called Markov Model. In an HMM, the state sequence that the model passes through are not known, but only some probabilistic function of it [4].
Use of a Hidden Markov Model to do part-of-speech-taggingisa special case ofBayesian inference. Bayesian inference or Bayesian classification was applied successfully to many language problems [2].
Hidden Markov Model (HMM) is the most frequently used technique for POS tagging. It can be used for tagging one complete sentence at a time, by selecting the most likely sequence of tags for its word [5].It uses the formula [2]: is the probability of tags sequence t 1 …t n given that the words sequence w 1 … w n . n t 1 is the best tags sequence for given words where the ) | ( 1 1 n n w t p maximum. Equation 1 cannot be computed directly, therefore by using Bayes' rule it will be [2]: HMM tagger simplifies this formula by two assumptions. The first assumption is that the probability of a word depends on its part-of-speech tag and is independent of other words around it, and of the other tags around it [2]: The second assumption is that the probability of a tag appearing depends only on the previous tag, the bigram assumption [2]: From equations 3 & 4, we will get: This isthe first order HMM. The second order HMM uses trigram assumption where the current tag depends on the two previous tags only.
These parameters are estimated from training on annotated corpus as follows: appearing in the training data. The important thing here, we must know that are variables for all tags in the tagset not only one tag.
is counts of the word w i appears with tag t i in the training data.

The Used Languages and Corpora
Several well-known languages are used, in our work,as English (using Brown corpus&NPS-Chat), Hindi (using Indiana corpus), Bangla (using Indiana corpus), Marathi (using Indiana corpus), Telugu (using Indiana corpus), Chinese (using Sinica corpus), Portuguese (using Floresta Corpus) and Spanish (using CESS-ESP Corpus). In the implementation, we take two tagsetsFor English,Portuguese and Spanish languages.
English(Brown corpus) is a simple inflected language comparing with the other languages. Itis now the most widely used language in the world.Brown and NPS-Chat corpora are taken, in our work, which are freely available. Brown corpuscontains57340 sentences (1161192 tagged tokens ≈ Million words). Two tagsetsare used in Brown Corpus which are taken in our test.NPS-Chat corpus contains10567 sentences (45010 tagged tokens).
Hindi (Indiana corpus) is standardized of the Hindustani language. Hindi is one of the official languages of India.Indiana corpus is used, in our work, which is freely available containing3631sentences (49365 tagged token) for four languages. It contains 541sentences of Hindi language (9475 tagged tokens).
Telugu (Indiana corpus) is an official language in some position in India. Telugu ranks third by the number of native speakers in India (74 million speakers). Indiana corpus contains994sentences of Telugu language (10004tagged tokens).
Bangla or Bengali(Indiana corpus) is native to Bangladesh, the Indian state ofWestBangladesh, and parts of the Indian states. Bengali is one of the most spoken languages, ranked seventh in the world(250 million speakers).Indiana corpus contains899 sentences of Banglalanguage (10427 tagged tokens). Marathi language (Indiana corpus) is the official language of Maharashtra state of India. Marathi has the fourth largest number of native speakers in India(73 million speakers).Indiana corpus contains1197 sentences of Marathi language (19459 tagged tokens).
Standard Chinese (Sinicacorpus) is a standardized variety of Chinese. It is the sole official language Republic of China. Sinicacropus designed for analyzing modern Chinese. Every text in the corpus was segmented. Part from it is freely available containing 9999sentences of Standard Chineselanguage (91627 tagged tokens).
Portuguese is official language of little countries as Portugal, Brazil, and others.Florestacorpusis a publicly available Treebank for Portuguese language. It contains9266 sentences of Portuguese language(211852 tagged tokens). Two tagsets are used in Floresta corpus and are taken in our test.

Spanish (CESS-ESP Corpus) is official language of of Spain (406 million speakers).
It is one of the six official languages of the United Nations. CESS-ESP Corpus is part of CESS-ECE project. Itcontains 6030 sentences of Spanish language (192685 tagged tokens). Two tagsets are used in CESS-ESP corpus and they are taken in our test.

Related Works
There are many POS tagging works on a lot of languages, some of them used HMM with private language, but we can't list them here for the paper limit. We list the works which has the same approach on the same language and/or the same corpus.
Avinesh and Karthik [6]used CRF(Conditional random field) and TBL (Transformation-based learning) based POS tagger and has an accuracy of about 77.37%, 78.66%, and 76.08% for Telugu, Hindi and Bengali languages respectively. They usedIndian corpus, the same corpus used in our work.The size of data used by them was much more than these available to us. Thetagset is the same in both works. Singh et Al. [7] used Trigram Method for taggerdevelopment on Marathi language. It is second order HMM.They used a private test corpus of 2000 sentences (48,635 words). They used IL POS tagset which consist of 24 tags. The accuracy of the system was 91.63%.
Nisheeth et Al. [8]used HMM onHindi Language. They used IL POS tagset and achieved an accuracy of 92% on a corpus of 15,200 sentences (358288 words). Rodrigues et Al. [9] combined HMMs and character language models which it being applied to Portuguese texts. In this approach, the emission probabilities for each hidden state in a HMM are estimated by a proper character language model. The tagger built has been trained and tested on Bosque, a subset of Floresta Treebank. They reached 96.2% accuracy with a tagset of 39 tags and 92.0% with a tagset of 257 tags.
Chao-hung&Cheng-Der [10] used first order HMM tagger on Chinese language with word identification possibility. They achieved an accuracy of 96% on a private corpus.
Our work different than the other works by the following nodes: 1-Applying one approach on many languages in order to recording behavior of this approach on these languages.

2-
Very different languages are selected from the world languages in order to recording the possibility of getting high accuracy on tagging for the same approach. 3-More than one tagsets, for the same corpus,are taken.It is useful to record how tagsetsize affects the results. 4-Different corpora, for the same language, are used. In summary various testing conditions are reported which make novelty of our work comparing with related works.

Implementation and Results
Theused data is partitioned, for each corpus, into 100 parts. 99% is taken as training and 1% as test. This data partitions can be 100-fold-cross-validation with one difference 2 where the test is repeated for 10 times not for 100 times (see Figure 1 for more detailson partitioning). The samples (test data) are very smallbecause some of the used corpora are very small which leads to many unknown words then raising the errors.
The used data set are 6 corpora: Brown, NPS-Chat, Indiana, Sinica, Floresta and CESS-ESP Corpora. Each corpus contains one language except Indiana corpus which contains 4 languages. I.e. the used languages, in our work, are 8 languages: English, Hindi, Telugu, Bangla or Bengali, Marathi, Standard Chinese, Portuguese and Spanish languages. First order HMM tagger was implemented using Viterbi algorithm. We used Laplace smoothing for sparse data and unknown words. Two corpora, For English language, are used with three tagsets.These corpora are Brown corpus and NPS-Chat corpus. Brown corpus has gotannotation with two tagsets of 40 tags and 472 tags. The results of implementing HMM tagger on Brown In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k−1 subsamples are used as training data. The crossvalidation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. corpus are shown in Tables1&2 respectively. Implementing HMM tagger on NPS-Chat corpus is shown in Table3.
The results of Implementing HMM tagger on Hindi, Telugu, Bangla and Marathilanguages are shown in Tables4, 5, 6, and 7 respectively. These languages are taken from Indiana corpus.
Sinicacorpus, for Standard Chinese language, is used.The results of implementing HMM on Standard Chinese language is shown in Table8. Florestacorpus, for Portuguese language, is used with two tagsets. It has got annotation with a tagset of 32 tags and a tagset of 269 tags. The results of implementing HMM tagger on Floresta corpus are shown in Tables9&10 respectively.
CESS-ESP corpus, for Spanish language, is used with two tagsets. It had annotation with a tagset of 32 tags and a tagset of 269 tags. The results of implementing HMM tagger on CESS-ESP corpus are shown in Tables11& 12 respectively.

Size of Training Data
If we used small data set, there are many words from the language not existing in this data set. These words are unknown words. The unknown words is highly affected the accuracy of the system. Also, large data set will give good statistics for learner. Surely, increasing the size of learning data will increase the accuracy.See the results in Tables1&3.
There are some tests has drop down in the accuracy as(36.7%&32.2%) in Table6 and the accuracies (22.7% & 20.2%) in Table7. They are very small accuracies. After analyzing the used data in training and test samples manually, we found there are many unknown words in same sentence. This will lead to more errors collected from the context then drop down the accuracy. We suggest, as future, using trigram and quad-gram HMM tagger on the same test. other taggers as Brill and Maximum entropy tagger can be used for the same data. Unknown words are manipulated using more efficient approach than Laplace smoothing.