Finding the Relevance Degree between an English Text and its Title

Keywords are useful tools as they give the shorter summary of the document. Keywords are useful for a variety of purposes including summarizing, indexing, labeling, categorization, clustering, and searching, and in this paper we will use keywords in order to find the relevance degree between an English text and its title. The proposed system solves this problem through simple statistic (Term frequency) and linguistic approaches by extracting the keywords of the title and keywords of the text (with their frequency that appear in the text) and finding the average of title's keywords frequency across the text that represent the relevance degree that required, with depending on a lexicon of a particular field(in this work we choose computer science field). This lexicon is represented using two different B + trees one for non-keywords and the other for candidate keywords, these keywords was stored in a manner that prevent redundancy of these terms or even sub-terms to provide efficient memory usage and to minimize the search time. The proposed system was implemented using Visual Prolog 5.1 and after testing, it proved to be valuable for finding the degree of relevance between a text and its title (from point of view of accuracy and search time).


INTRODUCTION
utomatic keyword assignment is a research topic that has received less attention than it deserves.Keyword extraction is an important technique for document retrieval, Web page retrieval, document clustering, summarization, text mining, and so on.By extracting appropriate keywords, we can easily choose which document to read to learn the relationship among documents.[1] The focus of our work is in enabling an ordinary user know the relevance degree between a text and its title.To make this possible, we need to separate this process in three phases: first there is a need to extract the keywords that describe the title, second there is a need to extract the keywords that describe the text with their frequency, and then finding the average of frequency for title's keywords across the text that represent the required relevance degree.

KEYWORD EXTRACTION
Automatic keyword extraction is the task to identify a small set of words, key phrases, keywords, or key segments from a document that can describe the meaning of the document [2].It should be done systematically and with either minimal or no human intervention, depending on the model.The goal of automatic extraction is to apply the power and speed of computation to the problems of access and discoverability, adding value to information A PDF created with pdfFactory Pro trial version www.pdffactory.comorganization and retrieval without the significant costs and drawbacks associated with human indexers [3].

EXISTING APPROACHES
The manual extraction of keywords is slow, expensive and bristling with mistakes.Therefore, most algorithms and systems to help people perform automatic keyword extraction have been proposed.Existing methods can be divided into four categories: simple statistics, linguistics, machine learning and mixed approaches [3,4], in our work we will use a mixed between a some linguistic methods with common statistical measures such as term frequency in order to find the relevance degree between an English text and its title for a specific domain (for evaluation we use computer science field to be our domain).

Simple Statistics Approaches
These methods are simple, have limited requirements and don't need the training data.They tend to focus on non-linguistic features of the text such as term frequency, inverse document frequency, and position of a keyword.The statistics information of the words can be used to identify the keywords in the document.Some statistical methods use N-Gram statistical information to automatic index the document [5].Other statistics methods include word frequency, TF*IDF, word co-occurrences [1], etc.The benefits of purely statistical methods are their ease of use.

Linguistics Approaches
These approaches use the linguistic features of the words, sentences and document.Methods which pay attention to linguistic features such as part-ofspeech, syntactic structure and semantic qualities tend to add value, functioning sometimes as filters for bad keywords.Plas et al. [6] use for evaluation two lexical resources: the EDR electronic dictionary, and Princeton University's freely available WordNet.Both provide well-populated lexicons including semantic relationships and linking, such as IS-A and PART-OF relations and concept polysemy.During automatic keyword extraction from multipleparty dialogue episodes, the advantages of using the lexical resources are compared to a pure statistical method and relative frequency ratio.Hulth [2] examines a few different methods of incorporating linguistics into keyword extraction.Terms are vetted as keywords based on three features: document frequency (TF), collection frequency (IDF), relative position of its first occurrence in a document and the term's part of speech tag.The results indicate that the use of linguistic features signify the remarkable improvement of the automatic keyword extraction.

Machine Learning Approaches
Keyword extraction can be seen as supervised learning from the examples.The machine learning mechanism works as follows.First a set of training documents is provided to the system, each of which has a range of human-PDF created with pdfFactory Pro trial version www.pdffactory.comchosen keywords as well.Then the gained knowledge is applied to find keywords from new documents.The Keyphrase Extraction Algorithm (KEA) [7] uses the machine learning techniques and naive Bayes formula for domainbased extraction of technical keyphrases.Suzuki et al. [8] use spoken language processing techniques to extract keywords from radio news, using an encyclopedia and newspaper articles as a guide for relevance.

Mixed Approaches
Other approaches about keyword extraction mainly combine the methods mentioned above or use some heuristic knowledge in the task of keyword extraction, such as the position, length, layout feature of the words, html tags around of the words, etc [9,5].The task of automatic keywords extraction using combined methods (AKWE) [10] is to extract keywords from the document abstract.This system have three stages: the entered document is firstly pre-processed to remove noisy data, word tagging, and word stemming.Secondly to give candidate keywords, three extracting approaches presented in the proposed system, N-gram approach to extract uni-gram, bi-grams and tri-grams; part-of-speech approach (POS) that extracts phrases which match a set of patterns, and NP-chunk which extract noun phrases.

MORPHOLOGY
If fact there are two types of morphology, one is used for analyzing a word and the other is used for generation a word, and in this paper we dealing with the morphology that can analyse a word ( English word).The morphology of English language deal with the changes that may occur through adding affixes to the English words (such as rot will be rotting and give will be giving by adding "ing").
In English language we have two types of affixes: prefixes and suffixes, table (1) show some spelling rules for adding the suffixes to English words, each row in this table shows the suffix and its spelling rule depending on the end of the word.In this table (C) means non vowel letter while (V) means vowel letter, and the symbol (∅) means the last letter must be deleted [11].
Table (1 B + tree is called an index to database, such that each record will be stored in the database, the reference number (and the key) of that record will be stored in the B + tree.So when we want to reach a certain record, we need to know its key to get its reference number from the B + tree.When we get the reference number of that record we can retrieve the required record directly.B + tree is an arranged and balanced tree (see figure 1), and this is why it is so fast in retrieving the required data.

DESCRIPTION FOR THE PROPOSED METHOD
Basically the proposed method includes three stages:extract keywords of the title, extract keywords from the text (with interaction with the lexicon and morphology) and then finding the average of keyword frequency of the title (after filtering process) across the text to be the relevance degree between the text and its title (see figure 2).
The input to the proposed system will be a title and a text consists of sentences ( a sentence is considered to be a set of words separated by a stop mark ".","?" or "!"), and the sentence cutter is responsible on producing these sentences.
The user interface responsible on interaction between the proposed system and the user in ease form (since we use a visual programming language), also the user can update the contents of the lexicon through user interface by removing or adding a new english keyword (or non-keyword) with its information (its suffix, prefix, synonyms and its abbreviation).
Tokenization part of the proposed system is used for converting the title or a sentence (of the text) to a list of tokens.
English morphology is responsible on extract the stem for English word by removing its suffix or prefix and removing the changes that occur during adding these affixes according to the apelling rules of English language.
The other parts of the proposed system will be discussed with more details in the following sections.

Lexicon of the Proposed Method
Lexicon is an important part in any linguistic system, and it is responsible for providing the system with its required information.The lexicon of the proposed method is represented using one database (dbase1) with its index tree (Bt1) for keywords and another database (dbase2) with its index tree (Bt2) for non-keywords.
Non-keywords that stored in the lexicon are: Articles , Conjunctions, Demonstratives , Prepositions, Pronouns , qualities, Main verbs, Auxiliary verbs and Modals.The key of "Bt2" is the stem of the non-keyword.
Keywords that stored in the lexicon are all candidate keywords in a particular domain (in our work we choose the computer science field).The keyword may be one word or a sequence of words, we will store with each keyword its Synonyms and its abbreviation if any.The first word of keyword will be the key for the index tree "Bt1".The keywords is stored in the lexicon in manner that prevent redundancy to provide efficient memory usage and to minimize search time, in general if one keyword consists of [word 1 , word 2 , word 3 , word 4 ] with logical term P1 (the first word will be the key of Bt1) and another keyword consists of [word 1 , word 2 , word 3 ] with logical term P2 then there is no need to restore the second keyword, only P2 need to be added to the dbase1(see figure 3-a), and if another keyword consists of [word 1 , word 2 , word 3, word 5 ,word 6 ] with logical term P3 that content its required information then only word 5 and word 6 will be added to the dbase1 with P3(see figure 3-b) .PDF created with pdfFactory Pro trial version www.pdffactory.com

Title processing of the proposed method
In this work, each word in the title will be one of the following (see figure 4): non-keywords and will be discarded.• keyword and will be stored in a keyword list.
• Candidate keyword and will be stored in a candidate keyword list.
To check if the current word is not keyword we have either the current word is found in the index tree of non-keywords (Bt2), or the current word is found in the index tree of non-keywords (Bt2) after processing by the morphology to extract the stem of non-keyword , and the same thing happened for keyword if it is consist of one word.If the keyword consists of more than word then we must found the first word of the keyword in the index tree of the keywords (Bt1), and found the sequence of next words in the first term of the logical predicates of the keyword in dbase1, in other words we search for an item with length equal to the length of the keyword, and if not found we need to return back to search for a keyword with length less by one.
If the current word of the title is not found in keyword database and not found in non keyword database then it will be stored in a candidate keyword list, and will be depend on its frequency in the text in order to determine if it is keyword or not.
PDF created with pdfFactory Pro trial version www.pdffactory.com

Text processing of the proposed method
All the words of the text will be discarded if it is not found in one of the following cases (see figure 5): PDF created with pdfFactory Pro trial version www.pdffactory.com • Keyword, its first word found in Bt1 and other words found in keyword database (before or after morphology processing).• Candidate keyword found in the list of candidate keyword of the title.
• Synonyms or abbreviations found with keywords list of the title.
• Synonyms or abbreviations found with keywords list of the text.Only the frequency of above cases will be computed and put it in the list of keywords of the text.

Filtering process of the proposed method
In this paper we use a filter function that can remove any un-useful terms from the candidate keywords (filters for bad keywords) depending on their frequency in the text as in algorithm1.Algorithm1: " filtering " Input: List0: list of keywords.
List1:list of candidate keywords.List2: list of keywords found in the text with their frequency.Output: List: list of keywords.Process: Begin 1. List=List0; 2. Find the average frequency for each keyword in List0 by using their frequency found in List2; 3. Get the minimum average frequency to be Min; PDF created with pdfFactory Pro trial version www.pdffactory.com

Compute the relevance degree
We will find the average of keywords frequency of the title across the text to be the relevance degree between the text and its title as in algorithm 2.
Algorithm2: "find_relevance_degree" Input: Title; Text.Output: Relevance degree between the text and its title.

IMPLEMENTATION FOR THE PROPOSED METHOD
We will take an example in order to describe our proposed method: let the title be: " Using Natural language processing for Steganography purpose".Let the text be: " Steganography is the technique of hiding information within some format in a way that makes it difficult to detect by one who doesn't know it's there.
Steganography has become quite advanced and allows for information hiding in all types of data files.One important method of information hiding is constructing the context free grammar that may found in computation theory but this method is not suitable when the CFG is ambiguous.Thus in our proposed method we will use natural language processing instead of CFG to avoid the problem of ambiguous and unambiguous grammar.NLP is used as abbreviation for natural language processing".
PDF created with pdfFactory Pro trial version www.pdffactory.com We have three stages: processing the title to extract its keywords, processing the text to extract its keywords with their frequency and then finding the average of keyword frequency of the title (after filtering process) across the text to be the relevance degree between the text and its title.

Stage1: Title processing
• "using" will be not found in index tree of non-keywords, but after extracting its stem "use" by the morphology and the suffix "ing" the system will found "use" in the non-keyword index tree (Bt2) and after retrieving its logical term the system will found the suffix "ing" with its affixes, so "using" will be discarded as non-keyword.• "natural" will be found in keyword index tree (Bt1), and after check its stored information and follow the reference for the next words we will have the keyword "natural language processing", and retrieve its synonyms "linquistics" and its appreviation "NLP".• "for" will be found in the index tree of non_keywords directly and will be discarded.• "steganography" will be found in the index tree of keywords "Bt1" and will be treat it as keyword, and retrieve its synonyms "information hiding".• "purpose" will be not found neither in index tree of non-keywords "Bt1" nor in the index tree of keywords"Bt1" and will be regarded as candidate keyword.

Stage2: Text processing
In this case the proposed system will use the same process that used with the title in oreder to find the keywords list of the text with their frequency and discard the other words, so the keywords of the text will be: keyword Synonyms and abbreviations frequency steganography Information hiding 5 Context free grammar CFG 3 Computation theory Computer theory 1 ambiguous ambiguation 3 Natural language processing Linguistic, NLP 3 Since the word of candidate keywords of the title "purpose" is not appears in the text then the output keywords of the title will be the output keywords of the filtering process.

Stage3: compute the relevance degree:
• Compute the summation (Total) for the frequency of all keywords of the text, Total=15.• Compute the summation (Sum) for the frequency of title keywords that appear in the text, Sum=8.
• Relevance degree=Av*100.So the relevance degree between the text and its title will be 53%(see figure 6).

DISCUSSION
For the purpose of evaluation, we compare our approach with frequency-based keyword extraction that can be used to find the relevance degree between a text and its title (see table 2).
) Some spelling rules for adding some suffixes to the end of the English words B + Tree[12,13] B + Tree is a structure of nodes linked by pointers is anchored by a special node called the root, and bounded by leaves has a unique path to each leaf, and all paths are equal length stores keys only at leaves, and stores reference values in other, internal, nodes guides key search, via the reference values, from the root to the leaves.

Figure ( 2
Figure (2): the architecture of the proposed method

Figure ( 3 )
Figure (3) show how redundancy is preventing in the lexicon of the proposed method.

Figure ( 4 )
Figure (4) flowchart for title processing of the proposed method

Figure ( 5 )
Figure (5) flowchart for text processing of the proposed method pdfFactory Pro trial version www.pdffactory.com