An Improved Distributed Association Rule Algorithm

All Distributed association rules mining (DARM) algorithms which bases on Apriori algorithm don't have an efficient message optimization technique, so they exchange numerous messages during the mining process which needs several distributed scan operations to the distributed warehouses or distributed databases to get the support values, also the performance of these DARM algorithms decreased with increasing communication cost especially when increasing the number of distributed mining sites as well as the itemsets to be mined become more larger . The aim of this work is to improve association rules in distributed data mining by proposing a new efficient method of distributed association rule mining, which reduce the average size of records transferred, datasets and messages transferred without need to any distributed scan to the distributed data warehouses or distributed databases to retrieve the values of the support values of these datasets. The results obtained from the proposed method prove that the proposed method is better than the existing algorithms by reducing communications costs, central storage requirements, enhance performance and achieves high degree of scalability compared with the existing algorithms.

An association rule is an expression A ⇒ B, where A and B are itemsets.The rule's support (S) is the joint probability of a transaction containing both A and B, and is given as S(A ⇒ B).The confidence of the rule is the conditional probability that a transaction contains B, given that it contains A and is given as S(A ∪ B)/S(A).A rule is frequent if its support is greater than min_sup and strong if its confidence is more than a userspecified minimum confidence (min_conf).Data mining involves generating all association rules in the database that have a support greater than min_sup (the rules are frequent) and that have a confidence greater than min_conf (the rules are strong) [6].The important measures for association rules, support (S) and confidence (C) can be defined as: Definition1: Support (S) Support(X, Y) = Pr(X ∪ Y) =count of (X ∪ Y) / Total transactions [8].
The support (S) of an association rule is the ratio (in percent) of the records that contain (X ∪ Y) to the total number of records in database.Therefore, if we say that the support of a rule is 5% then it means that 5% of the total records contains (X ∪ Y) [7].[8].For given number of records, confidence (C) is the ratio (in percent) of the numbers of records that contain (X ∪ Y), to the number of records that contain X. thus, if we say that a rule has a confidence of 15% it means that 85% of the records containing X also contain Y.The confidence of rule indicates the degree of correlation in the database between X and Y. Confidence is also a measure of rules strength.Mining consists of finding all rules that meet the user-specified threshold support and confidence [7].As there are two thresholds, we need two processes to mine the rules [8].The first step is to get the large itemsets.It finds all the itemsets whose support is larger than the support threshold.An itemset is the set of the items.Based on the large itemsets, we can generate the rules from the large itemsets, which is the second step.Rules that satisfy both a minimum support threshold (minsup) and a minimum confidence threshold (minconf) are called strong [9].It is an adaptation of the Apriori algorithm in the parallel case.At each iteration, it generates the candidate sets at every site by applying the Apriori-gen function to the set of large itemsets found at the previous iteration.Every site then computes the local support counts of all these candidate sets and broadcasts them to all the other sites.Subsequently, all the sites can find the globally large itemsets for that iteration, and then proceed to the next iteration.This algorithm has a simple communication scheme for count exchange.However, it also has the similar problems of higher number of candidate sets and larger amount of communication overhead.CD algorithm Divide the database evenly into horizontal partitions among all site and each site scans its local database partition to collect the local count of each item, then all sites exchange and sum up the local counts to get the global counts of all items and find frequent 1itemsets, then all sites generate candidate k-itemsets from the mined frequent (k-1) itemsets and each site scans its local database partition to collect the local count of each candidate k-itemset and all sites exchange and sum up the local counts into the global counts of all candidate k-itemsets and find frequent k-itemsets among them.The process will be Repeated with k = k + 1 until no more frequent itemsets are found [19].

2-4 The Proposed System
We present a new method that addresses the issue of discovering the most frequently occurring sets of items.Our method divides the database into partitions and discovers all large itemsets inside each partition.The following steps represent the proposed distributed association rule algorithm: Steps 1: For each partition (site) of distributed data warehouse find all unique itemsets with their frequency of occurrence and store them together in a table named "Local_House".Step 8: Generate all subsets for each of final global-supersets found in step seven, store distinct values of these subsets in a table named "Global-Itemsets".

PDF created with pdfFactory
Step 9: Compute the support value for the generated subsets in step8.
Step 10: Perform second pruning operation by deleting all itemsets that contain any of these subsets that have support < global_min_sup; where the support values for these subsets are computed from Global_house table.Store the remaining itemsets with their support in the "Global-Itemsets" table (the process of finding subset 's support does not need any remote scan from any site).
Step 11: Generate the association rules from table "Global-Itemsets" and store the strong rules (that have confidence >= min_conf) in a table named "distributed-associationrules".Figure (2-1) shows the flowchart of the proposed method.showing the effects of minimum support and number of partition change, we reported results corresponding to two, four and six partitions and to the different minimum support thresholds used, usually characterized by a difference of about one order of magnitude in execution time as shown in figures (3-1), (3-2).In our experimental we take an example based on medical diagnosis information system for distributed health care systems consisting of databases that contain patient data and mined knowledge from health care institutions.The file resembles transaction data: the first column consists of the patient's visit identification number, the second column contains a visit date, the third column contains a medical test identifier for the symptoms, the fourth column contains the patient gender, and the fifth column contains the age of the patients as shown in Table (3-3).It is obvious that the number of tables used by our new method is fixed and only one table is required, while in the CD algorithm the number of tables' increases linearly with the number of global itemsets-subsets increase, the chart (3-3) shows the total number of storage tables required by our new method and CD algorithms for different global itemset-subsets.

4-Conclusions
The proposed method has the following characteristics: Ø It reduces Communication cost.Since the transfer of huge data volumes over network might takes extremely much time and also requires an unbearable financial cost this is avoided by the method.Also the algorithm utilizes the network resources by minimizing message transfer among sites, so it needs only O(n) messages transfer for transferring local supersets and unique itemsets with their frequent of occurrence to the Warehouse Mining Server (WMS), where n is the number of distributed data warehouse sites .The process of finding the support of all itemsets in the data mining server does not need any remote scan (zero remote scan) or message transfer to any of distributed sites, because the support is computed locally inside WMS.

Ø
It achieves high performance.Since the computational cost of mining a central data warehouse is much bigger than the sum of the cost of analyzing smaller parts of the data warehouses, which could also be done in parallel.The efficiency of the algorithm increases with when the number of distributed sites increase but this leads to increases in the size of main memory used by it.Also the pruning process done by using the algorithm which adds extra improvement to the performance of the algorithm by using only two global pruning processes that enhance the search space to find the interesting association rules; especially when the number of itemsets becomes too huge by eliminating all of the uninteresting itemsets that lead to weak rules from search space of the algorithm.Apriori-based implementations.However it can extend the distributed data warehouse by adding unlimited new sites to it easily; also the algorithm can be executed on the whole distributed warehouse, set of sites or even on a single site.Another important feature of the proposed method is the ability to easily maintain the supersets for the database partitions.When a database or data warehouse is updated, the supersets for the updated database partitions should be updated too.When new partitions are appended to a database or data warehouse, then the supersets for the new partitions must be computed, however, none of the previously supersets need to be updated.Besides, computing the positive borders for a partition can be done fast, because the whole partition is likely to fit in main memory.

Ø
It is scalable and flexible.It achieves high scalability compared to PDF created with pdfFactory Pro trial version www.pdffactory.comEng.& Tech.Journal, Vol.28, No.

‫ﻤﺤﺴﻨﺔ‬ ‫ﻤﻭﺯﻋﺔ‬ ‫ﺘﺭﺍﺒﻁﻴﺔ‬ ‫ﻋﻼﻗﺔ‬ ‫ﺨﻭﺍﺭﺯﻤﻴﺔ‬
id ) and contains a set of PDF created with pdfFactory Pro trial version www.pdffactory.comEng.

An Improved Distributed Association Rule Algorithm 5704 Figure (2-1) flowchart of the proposed approach.
PDF created with pdfFactory Pro trial version www.pdffactory.comPDF created with pdfFactory Pro trial version www.pdffactory.comEng.& Tech.