Genetic Based Optimization Models for Enhancing Multi-Document Text Summarization

Extractive multi-document text summarization – a summarization with the aim of removing redundant information in a document collection while preserving its salient sentences – has recently enjoyed a large interest in proposing automatic models. This paper proposes two models for extractive multi-document summarization based on genetic algorithm (GA). First, the problem is described and modeled as a discrete optimization problem with two candidate expressions and a specific fitness function is designed to effectively cope with each candidate. Then, a binary-encoded representation together with a heuristic mutation and a local repair operator are proposed to characterize the adopted GA. The semantic roles of similarity of sentence to sentence, sentence to center of document collection and center of summary to center of document collection are exploited in the proposed model formulations. Experiments are applied to ten clusters from DUC2002 datasets (d061j through d070f) and compared with another state-of-the-art model. Results clarify the effectiveness of the proposed models. Moreover, the injection of several levels of text similarity in the model formulation shows a positive impact on enhancing the overall performance of the proposed GA


INTRODUCTION
dentification of relevant information that meets user needs becomes very difficult as a result of exponential growth of Internet and availability of huge amount of online information.This has triggered a race for developing automatic document summarization tools.This race is not necessary just for professionals who aim to find the information in a short time but also for large search engines like Google, Yahoo, AltaVista, and others.
The main goal of any text summarization technique is the presentation of the common and most important information in a shorter version of the original text while preserving its main content and overall meaning to help the user to quickly understand the large volume of information.Text summarization problem belongs to several disciplines like computer science, multimedia, statistics, and cognitive psychology.Thus different dimensions can be used to classify document summarization.A summary can be either generic summary or query-relevant summary [1][2][3][4].In a generic summary, an overall sense of the document content is presented without any prior knowledge, on the other hand, the information presented in a query-relevant summary should have some relevance with a given query or topic [5].Also, text summarization methods can also be either extractive or abstractive.Extractive methods tend to select a subset of existing words, phrases, or sentences found in the original text to form the summary.In contrast, abstractive methods build an internal semantic representation and then create a summary that is closer to what a human might generate using natural language generation techniques.Such a summary might contain novel words that are not explicitly present in the original text.Moreover, the summary can be created either from a single-document or a multidocument collection depending on the number of documents to be summarized [3,6].Single-document summarization can only produce a shorter representation of one document, whereas multi-document summarization (MDS) can produce a summary of a set of documents.
The main contribution of this paper is to model the multi-document text summarization task as an optimization problem.The proposed model emphasizes the discovery of essential sentences that cover the main topic of the document collection while transcending the occurrence of redundant sentences.A binary-encoded genetic algorithm together with heuristic mutation and local repair operators is proposed to handle the modeled optimization problem.The organization of this paper is as follows.Section 2 reviews optimization based works which are most related to the approach proposed here.Section 3 and 4 introduce the details of the proposed mathematical formulation and modeling.The numerical experiments and results are presented in Section 5. Finally, conclusions and some possible extensions to the current work are given in Section 6.

Related Work
In literature, multi-document summarization approaches vary in their essence.Various extraction-based techniques have been proposed for generic text summarization.One of the popular extractive summarization methods is the centroidbased method [7].This paper briefly reviews only optimization based works which are most related to the approach proposed here.Text summarization can be categorized as a combinatorial optimization problem.The optimization based text summarization algorithms reported in literature are mainly classified as heuristic algorithms.Heuristic methods do not guarantee finding optimal solution within finite amount of time but rather they can provide acceptable and near-optimal solutions with a fraction of computation time.In [8], a method using latent semantic analysis is proposed to identify semantically important sentences for generation of a summary and selection of highly ranked sentences and different from each other for summarization.Other methods include Non-negative Matrix Factorization (NMF-based) topic specification [9,10,11] and Conditional Random Fields based (CRF-based) summarization [1].In [9], a multi-document summarization framework based on sentence-level semantic analysis and symmetric Non-negative Matrix Factorization is proposed.The relationships between sentences can be captured by sentence-level semantic analysis in a semantic manner and the similarity matrix can be factorized by symmetric Non-negative Matrix Factorization to obtain sentences groups that are meaningful for extraction.In [12], text summarization is modeled as a maximum coverage problem that aims at covering as many conceptual units as possible and avoiding redundancy in summarization and question-answering.The problem is formalized by positing a textual unit space, a conceptual unit space, and a mapping between them.McDonald [13] models text summarization as a knapsack problem.Text summarization is represented as a maximum coverage problem with the knapsack constraint in [14].In this work three algorithms are studied for global inference in the summarization of multi-document.It is found that an algorithm of dynamic programming that is based upon solutions to the knapsack problem satisfies the optimality in accuracy and scaling characteristics corresponding to both an exact algorithm and greedy algorithms.In addition to this, the compatibility of the knapsack and the greedy algorithms with arbitrary scoring functions that can be of great benefit to the performance is noticed.Shen et. al. [1] presents a framework based on Conditional Random Fields for generic document summarization to keep the merits of supervised and unsupervised approaches taking in consideration avoiding disadvantages of them.This approach treats the text summarization task as a sequence labeling problem.A feature that is common for all these works is that they all rank sentences based on classification models.Multidocument generic summarization is modeled in [15] as a budgeted median problem.This model covers the entire relevant part of the document cluster through sentence assignment and incorporates asymmetric relations between sentences in a natural manner.The work [10] proposes a Bayesian sentence-based topic model (BSTM) for multi-document summarization by making use of both the term-document and termsentence associations.It models the probability distributions of selecting sentences given topics and provides a principled way for the summarization task.In [16], document summarization is formalized as a multi objective optimization problem.In particular, four objective functions, namely information coverage, significance, redundancy and text coherence are involved.These four objective functions measure the generated summaries according to the cluster of semantically or statistically related core terms.In [17], an optimization-based method for opinion summarization based on the p-median clustering problem from facility location theory is proposed, in which content selection is viewed as selection of clusters of related information.A formulation for the widely used greedy maximum marginal relevance (MMR) algorithm as an integer linear programming is introduced in [18].In [19], text summarization of multi-document based on sentence-extraction is formalized as a discrete optimization problem and solved using an adaptive differential evolution algorithm.The approach is presented toward all of the three aspects of summarization: content coverage, redundancy and length.In [20], text summarization is modeled as an integer linear programming problem.The proposed model demonstrates that the summarization result depends on the similarity measure.A combination of the NGD-based and cosine similarity measures conducts to better result than their use separately.In [21], document summarization is modeled as a nonlinear 0-1 programming problem where an objective function is defined as Heronian mean of the objective functions defining content coverage and redundancy minimization.The optimization problem is solved using discrete particle swarm algorithm which is based on estimation of distribution algorithm.The work [22] formulated text summarization as a modified p-median problem taking in consideration four objectives: relevance, content coverage, redundancy minimization, and bounded length that are of great necessity to generate good summaries.A selfadaptive differential evolution algorithm is created to solve the proposed model.Multiple document summarizations are modeled in [23] as a Quadratic Boolean Programming problem which is a weighted combination of two objectives that are important to generate a good summary: content coverage and redundancy reduction.The optimization problem is solved using a modified differential evolution algorithm.In [24], Text summarization is formulated as linear and nonlinear optimization models which aims to balance between content coverage and redundancy reduction in the target summary simultaneously.A novel particle swarm optimization algorithm is developed to solve the optimization problem.Work in [25] proposes a constraintdriven multi-document summarization models enforcing diversity and maximum coverage which are modeled as a quadratic integer programming problem.The optimization problem is solved by using a discrete Particle Swarm Optimization algorithm.Paper [26] proposes a model which is an optimization-based for generic multi-document summarization.The proposed model describes content coverage and redundancy minimization in the target summary as relations in sentence to document, summary to document and relations between each pair of sentences in the document collection.An adaptive crossover that makes adjustment to the crossover rate according to the fitness of individuals is used to improve the differential evolution algorithm used to solve the optimization problem.

Problem Statement and Formulation Preliminaries
Several methodologies have been explored for text similarity; however, they are centered on four major categories.These are word co-occurrence/vector-based methods, corpus-based methods, hybrid methods, and descriptive feature-based methods [27].In text summarization, vector-based methods are commonly used [28].Let , , , … , represents distinct terms in a document collection.Cosine similarity is the most popular measure that evaluates text similarity between any pair of sentences being represented as vectors of terms.For a set of different terms composing sentences of a document collection , cosine similarity associates weight to term according to its magnitude in sentence .Cosine similarity metric can be formulated, according to term-frequency inverse-sentence-frequency scheme ( _ ), as [28]: where: : is the measure of how frequently a term occurs in a sentence , and log ⁄ is the measure of how few sentences contain the term .Intuitively, if a term does not exist in sentence , should be zero.Now, given two sentences , , … , and , , … , , the cosine similarity between these two sentences can be calculated as in . 2 : Quantitatively, the main content of a document collection being represented in , , , … , space, can be reflected by the mean weights of the terms in .Thus, for , , , … , vector, a mean vector , , … , can be computed.The coordinate of the mean vector can be calculated as [6]:

Problem statement and Formulation
The proposed text summarization problem is expressed here while considering three challenges:  Content Coverage: the main topic of the document collection should be covered by the generated summary.


Redundancy Reduction: similar sentences in the document collection should not be duplicated in the generated summary.


Length: summary should be of a bounded length Let be a document collection of documents, i.e. , … , .By the language of sentences, can be noted by |1 , where is the number of distinct sentences from the documents in .the aim of this paper is to generate a summary ⊂ that can satisfy the above three criteria.The first attempt, here, to model and to formulate text summarization problem is given in the following two definitions.

Definition 1 (Summary ). Let
∈ be a sentence to be included in the summary , then the content coverage, expressed by the similarity , between and the set of sentences in the document collection (represented by its mean vector should be maximized.On the other hand, the redundancy reduction, or quantitatively, the similarity , between any two sentences belongs to should be minimized.Now, to formalize our suggestion, the text summarization problem will be modeled using the following definition: Definition 2 (text summarization problem Φ ).Let ∈ 0,1 be a binary decision variable denoting the existence (1) or absence (0) of the sentence in (see Eq. 4).Also, let ∈ 0,1 be another binary decision variable relating to the existence of both sentences and in (see Eq. 5).Now, let |1 be a vector of such decision variables corresponding to sentences.Then for the vector , text summarization problem (see Eq. 6 & Eq.7) is a constrained maximization problem taking a combination of maximizing the content coverage (numerator) and minimizing information redundancy (denominator) In the second attempt, the text summarization problem is re-defined again by projecting the first criterion, i.e. content coverage in the light of text similarity.The proposed model hypothesizes a possible decomposition of text similarity into three different levels of optimization formula.First, aspire to global optimization; the candidate summary should cover the summary of the document collection.Then, to attain, more or less global optimization, the sentences of the candidate summary should cover the summary of the document collection.The third level of optimization is content with local optimization, where the difference between the magnitude of terms covered by the candidate summary and those of the document collection should be small.The summary and text summarization problem Φ can then be formulated as in definition 3 and 4, respectively.

Definition 3 (Summary ). Let
∈ be a sentence to be included in the generated summary , then three different semantics of coverage (summary level, sentence level, and term level) can be cooperated together to define content coverage criterion.Summary level is to be expressed by the degree of similarity , between the mean vector of a candidate summary and the center of the document collection .Sentence level is to be defined by the degree of similarity , between sentence and the mean vector of the document collection .Term level to be defined by the degree of similarity , between the mean vector of term in a candidate summary and its correspondence term in the center of the document collection .On the other hand, the redundancy reduction, or quantitatively, the similarity , between any two sentences belongs to should be minimized.Definition 4 text summarization problem Φ can be expressed as a constrained optimization problem taking a combination of maximizing the content coverage (numerator) and minimizing information redundancy (denominator).Content coverage is expressed by maximizing both , and , while simultaneously minimizing , .
As can be seen in Eq. 9, the magnitude of term in the candidate summary can be expressed by its impact, i.e. average of total weights of occurring in the sentences of .Likewise, the magnitude of term in can be computed by the average of total weights of occurring in the sentences of .Intuitively, the difference between these two magnitudes should be small over all terms of and .Moreover, the similarity in the summary level, i.e.
, is multiplied by 10 to unify its scale with values of the other two similarity levels expressed at the numerator.

The proposed Genetic Algorithm
Genetic algorithm (GA) is a population-based optimization algorithm with the aim of how to evolve a population of initial solutions toward better and better ones by means of some evolutionary operators.In the proposed GA, each genotype solution is represented by a fixed-length vector of size , where each gene value indicates the presence or absence of the corresponding sentence.Then, the whole search space for the proposed GA can be computed by the Cartesian product of presence/absence of all sentences, i.e.: ∏ 0,1 2 ... ( 10) The proposed GA can be described as a process formulated in an iterative function Ψ: → with Ψ , where is the population at iteration .The population starts with an initial random population and continues until a maximum number of iterations is reached.The evolution function Ψ in each iteration will be composed of three main operators: selection, crossover, and heuristic mutation, each of which is controlled by its control parameter.Formally noted as: By applying selection operator, , bad chromosomes are eliminated whereas good quality chromosomes that are fittest are copied to the next generation to improve the average quality of the population.Tournament selection has been adopted in this work.In tournament selection, only one individual from several randomly selected individuals is selected for the next generation if it is fittest.The number of randomly selected individuals, i.e. tournament size is determined by the control parameter Θ .
Uniform Crossover has been adopted.According to this type of crossover, each gene of each chromosome is created by randomly selecting respective gene from one of both parents.An equal chance is given to both parents to contribute in the chromosomes that are created from them [29].Crossover rate is determined by the control parameter Θ .
A heuristic mutation operator is proposed in this work.Here, the mutation operator is controlled by two parameters.The first parameter is the well-known mutation probability, , controlling the probability of mutation on each gene.The second parameter is mutation action, which controls the role of mutation on each mutated gene.Mutation action can be projected by the following similarity condition: For a given gene and for a random uniform variable ~ 0,1 , if is satisfied (i.e., ) then the similarity condition should be checked.The condition checks whether the similarity between the sentence and mean vector is more or less than the average similarity of sentences in the document collection .If it is satisfied, then the corresponding sentence, can be selected in the generated summary .Otherwise, it can be removed from the summary.Formally speaking, ∀ ∈ 1, … , ∧ …( 13) The best solution, ℙ * , of the final generation of GA can be selected as the result to the maximization problem.
However, the phenotype of the best solution may still suffer from violating the length constraint.i.e.: To this end, a local repair operator is proposed to handle the existence of more than constraint needs.Firstly, this repair operator removes from ℙ * those redundant sentences which have a high degree of similarity between them.Considering a similarity threshold 0.9 and two sentences and in ℙ * , one of them will be excluded from the final generated summary if their similarity is more than or equal to (see Eq. 17).Secondly, this operator will only handle the selection of high importance sentences in ℙ * .Each sentence belongs to ℙ * is ranked according to the formula in Eq. 18 to gain a corresponding score: Where , refers to the similarity of the centre of the generated summary (including sentence ) and the centre of document collection .On the other hand, , denotes the similarity between the generated summary (excluding sentence ) and the centre of document collection .The right term of the proposed formula is multiplied by 10 in order to unify the scale of the two terms.The basic idea behind the right term of the formula is to measure the impact of each of the sentences exist in the best phenotype summary.The sentence with the highest score has a great impact on the summary and it is of high importance whereas the sentence with the lowest score has a little impact on the final summary.The sentences are sorted in descending order and the high scored sentences are selected to be included in the final summary until the required length is reached.

Experiments
Qualitative evaluations of the proposed two models were made quantitatively based on the multi-document summarization datasets provided by Document Understanding Conference (DUC), particularly using DUC2002 dataset [30].A brief statistics of the dataset is given in Table-1

Evaluation metrics
The proposed work is quantitatively measured using Recall-Oriented Understudy for Gisting Evaluation evaluation metric [32]. is considered as the official evaluation metric for text summarization by DUC.It includes measures that automatically determine the quality of a summary generated by computer through comparison made between it and human generated summaries.The comparison is satisfied by counting the number of overlapping units, such as , word sequences, and word pairs between the summary generated by a machine and a set of reference summaries generated by humans.
is an Recall counting the number of matches of two summaries, and it is calculated as follows [32]: Where stands for the length of the , is the maximum number of co-occurring in candidate summary and the set of reference summaries.
is the number of in the reference summaries.
The similarity between reference summary sentence of length and candidate summary sentence of length is calculated using measure (also called which is denoted by ). evaluates the ratio between the length of the longest common subsequence of the two summaries , and the length of the reference summary as follows [32]: If the definition of is applied to summary-level, the union matches between a reference summary sentence, , and sentences of the candidate summary, which is denoted by ∪ , is taken.Given a reference summary of sentences containing a total of words and a candidate summary of sentences containing a total of words, then summary-level is calculated as follows [32]:

Results and Discussion
To evaluate the proposed models, a comparison with another related model should be performed.In this paper, the model proposed in [19] is used for comparison.This model formulates content coverage and redundancy reduction issues as in Eq. 26.For comparison fairness, model in [19] has been solved using GA algorithm proposed in this paper.A comparison between the three models is made using 2 and evaluation metrics.These evaluation metrics were calculated by comparing the summary generated by the three GA-based models against summaries generated by human.The reference summaries generated by experts are supported by DUC2002 dataset (each topic in DUC2002 dataset is supplied with a two human reference summaries generated by two different experts).[d061j, d062j, d063j, d064j, d065j, d066j, d067f, d068f, d069f, d070f].Table-2 presents some statistics that describe documents of these topics in order to give an identification of the search space size for the problem.scores in terms of average ( ) and standard deviation ( ) over all topics.From the results reported in Tables 3-5, one can easily see that the two proposed models perform better than the model proposed in [19].Moreover, inspecting content coverage objective into three distinct similarity subobjectives, as suggested in Φ , improve the overall quality of the generated summary.

CONCLUSION
The need for effective multi-document summarization techniques to extract the important information from a document collection becomes of necessity.A good summary should have the ability to keep the key sentences representing the main topic of the document collection while simultaneously reducing irrelevant and redundant ones from the whole collection.Two optimization models are introduced in this paper to satisfy content coverage and diversity in the document collection.An improved performance is reported by introducing the second model where text similarity has been decoupled along three dimensions: sentence to sentence similarity, sentence to document collection similarity and summary to document collection similarity.A genetic algorithm together with a heuristic mutation and a local repair operators have been proposed to solve the modeled problem.The performance of the proposed models shows improvement over the model proposed in [19].The results reported in this paper encourage us for further investigation study.The current interest is to take a further step towards capturing the essence of text summarization problem.Taking the benefit of implicit contradictory nature of both content coverage and content diversity, designing the text summarization problem can be modeled as a multi-objective optimization problem.Moreover, one of multi-objective evolutionary algorithms will be adopted to handle the formulated multi-objective problem.

,
The proposed models and the model introduced in [19] have been run on ten clusters from DUC2002 dataset

Table ( 1). Description of the DUC2002 dataset Description DUC2002 dataset
. Like all other related works, the documents in DUC2002 dataset are, first, preprocessed as follows:

Table - 3
and Table-4 present detailed average ROUGE scores in addition to the best and worst values for the 20 runs.In these tables, the best results obtained are shaded.