Improving Laboratories Efficiency through Website Using Text Mining

Text mining is an emerging technology that can be used to augment existing data in corporate databases by making unstructured text data available for analysis. This research aim to present a proposed text mining system customized to improve laboratories efficiency. This is done by taking the electronic comments and e-mails produced to the organization web as inputs for that proposed mining system. The proposed text miner is customized for emails and comments written to the organization. For that the basic text mining algorithm will almost be modified by adding new steps, modify some steps by customizing Natural Language Processing (NLP), data mining techniques and building the document database

and Each source has a separate storage and algebra ), Security and Authority of source ( IBM is more likely to be an authorized source then my second far cousin), Ambiguity ( Word ambiguity which imply Pronouns (he, she …), -Synonyms (buy, purchase…) and Words with multiple meanings (batis related to baseball or mammal)), Semantic ambiguity (The king saw the rabbit with his glasses (multiple meanings), Noisy data (Spelling mistakes, Abbreviations and Acronyms) ,Not well structured text (Email/Chat rooms imply "r u available ? ) And Speech (Multilingual) [6].
The problem is that there are a huge number of texts submitted to the enhancement of the laboratories.The method of processing the gigantic amount of text is by extracting an abstract from these comments to obtain the important D PDF created with pdfFactory Pro trial version www.pdffactory.commeaningful and desired words out of these comments which are related to the enhancement process.PDF created with pdfFactory Pro trial version www.pdffactory.com

RELATED WORKS
Many research performed to enhance text mining some of these are: 1. Give a survey on text mining facilities in XML and explain how typical application tasks can be carried out using proposed framework.Present techniques for count-based analysis methods, text clustering, text classification and string kernels [7].2. The application of neural networks in the data mining has become wider.
Although neural networks may have complex structure, long training time, and uneasily understandable representation of results, neural networks have high acceptance ability for noisy data and high accuracy and are preferable in data mining [8].

THE PROPOSED TEXT MINING
The proposed system consists of three steps (phases) that work simultaneously to fulfill the goal of this work: 1.The first step is the training phase.
2. The second step is the classification phase.
3. The third phase is the clustering phase.The structure of the proposed system design is shown in figure(2).2. Each one deal with it as document and apply the following customized proposed algorithms: • Tokenize the documents and then apply the part of speech on all tokens (such as read (verb)), explained in algorithm (1).

Output: tokens
Process: 1. while not EOF do 2. collect characters 3. if char is not space or punctuation mark then with each space or punctuation consider collected characters as token.4. End {while}

end process
Step 3: Clustering Customized Hierarchic agglomerative clustering (HAC) Algorithm and the output will be dendrogram of clusters.A dendrogram (from Greek dendron "tree",gramma "drawing") is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering.

Example:
The following database is a general training database, see Table (1): Some Comments from the users (examples), as known comments may have many of noisy characters such as *, funny faces and unstructured sentences.
1. ***The computer is infected with strong viruses*** 2. The manager of the laboratories not a good one :(( 3. The courses that teaches in the laboratories is not updated 4. Teachers not good 5.The computers in the laboratories must replace with new one 6.Every laboratory must contain wire and wireless connection 7. My degree is bad in the laboratory 8.The computer is not working In computer science department every laboratory contains the following 1.Personal Computers (PC) 2. Wire or wireless connecter 3.At least two lecturers for any subject 4. Printers 5.One manager for the all laboratories 6.One sheet for each subject which contains the experiments  portable, thus allowing for ease of use.This is due to the fact that they are fully integrated into JAVA, therefore enabling them to run on virtually any modern computing platform.Another element that contributes to the ease of use is the userfriendly graphical interface.
These platforms are comprehensive collections of data preprocessing techniques.That is why they can be used for this proposal.PDF created with pdfFactory Pro trial version www.pdffactory.com

DISCUSSION AND CONCLUSIONS
This research reached to the following points: 1.Text mining success depending on strong methodology used in natural language processing (NLP).2. Feature selections depend on the words extracted from text make the proposal much more flexible since the words differ from text to another.3. Using these techniques (two clustering methods and two classification methods) for feature selections give support and strength to the obtained results.
text classifier Text classification is the problem of assigning a document D to one of a set of |C| predefined categories C = { c 1 , c 2 , c | C |}. Normally a supervised learning framework is used to train a text classifier, where a learning algorithm is provided a set of N labeled training examples {(di , ci) : i = 1, . . ., N} from which it must produce a classification function F:D → C that maps documents to categories.Here di denotes the ith training document and ci is the corresponding category label of di.We use the random variables D and C to denote the document and category values respectively.A popular learning algorithm for text classification is based on a simple application of Bayes' rule: Where D and C are instances of D and C, to simplify the presentation, we rewrite Eq. (1) as: Bayes' rule decomposes the computation of a posterior probability into the computation of likelihood and a prior probability.In text classification, a document d is normally represented by a vector of K attributes d = (v1, v2, . . ..vK ) 2 Computing p(d | c) in this case is not generally trivial, since the space of possible documents d = (v1, v2, . . ..vK) is vast.To simplify this computation, the naive Bayes model introduces an additional assumption that all of the attribute values, vj , are independent given the category label, c.That is, for i ≠ j , vi and vj are conditionally independent given c.This assumption greatly simplifies the computation by reducing Eq. (2) to: Based on Eq. (3), maximum a posterior (MAP) classifier can be constructed by seeking the optimal category which maximizes the posterior P(c | d):

Figure
Figure (2) Steps of Research Methodology.
Calculate similarity matrix SIM[i,j] 2. Repeat 3. Merge the most similar two clusters, K and L, to form a new cluster KL 4. Compute similarities between KL and each of the remaining cluster and update SIM[i,j] 5. Until there is a single (or specified number) cluster 6. End process pdfFactory Pro trial version www.pdffactory.com

Example ( 1 ):
Suppose we take the first comment: ***The computer is infected with strong viruses*** Apply the suggested algorithm on it (training phase) 1. Read the comment and apply the following algorithms: 2algorithm 2. (clean up) Remove *** , and *** Remove the, is, with 4. The comment is unambiguous PDF created with pdfFactory Pro trial version www.pdffactory.com 5. Convert the parse tree into graphs 6.Return word into it's root Infected = infect 7. The result contain the following features D1final training database is as in Table (1): 9. apply either classification or clustering according to the feature extraction Implementation The first step in the proposal is to convert all emails and document to uniform file with extension .txt,that by open all files in the proposed subprogram and click save command to save it in specific store with predefined extension, see Figure (3).

‫اﻻﻟﻜﺘﺮوﻧﻲ‬ ‫اﻟﻤﻮﻗﻊ‬ ‫ﺧﻼل‬ ‫ﻣﻦ‬ ‫اﻟﻤﺨﺘﺒﺮات‬ ‫ﻛﻔﺎءة‬ ‫ﺗﺤﺴﯿﻦ‬ ‫اﻟﻨﺺ‬ ‫ﺗﺤﻠﯿﻞ‬ ‫طﺮﯾﻖ‬ ‫ﻋﻦ‬
PDF created with pdfFactory Pro trial version www.pdffactory.com Build the final training database of documents to text mining.