Privacy Preserving for Data Mining Applications

The results of data Mining (DM) such as association rules, classes, clusters, etc, will be readily available for working team. So the mining will penetrate the privacy of sensitive data and makes the stolen of the knowledge resulted much more easily. The main objective of the proposed system is preserving the privacy of data mining, that will done by developing algorithms for modifying, encrypting and distributing the original data in the database to be mined. So we ensure the privacy of data (original data in database that will be mined) and the privacy of knowledge (the association rules extracted from mined database) even after the mining process has taken place. The problem that arises when confidential information can be derived from released data by unauthorized users can be solved


Introduction.
With the enormous amount of data stored in files, databases, and other repositories, it is increasingly important, if not necessary, to develop powerful means for analysis and perhaps interpretation of such data and for the extraction of interesting knowledge that could help in decision-making.In principle, data mining is not specific to one type of media or data.Data mining should be applicable to any kind of information repository.However, algorithms and approaches may differ when applied to different types of data.which, has all the information about the logging of user from and to the specified web, so this information must be secure and if be mined will give a good pattern for detect the intrusion and web usage.So, these two databases must be protected and must have complete privacy preservation.

Modifying
Data modification is used in order to modify the original values of a selected attribute of database that 552 needs to be released to the public and in this way to ensure high privacy protection.It is important that a data modification technique should be in concert with the privacy policy adopted by the administrators.Here three types of modification are proposed: • Replacement, which is accomplished by replacing of an attribute value by a new value (i.e., changing a 1-value to a 0-value, or adding noise).• Concatenation, which combines several values into a coarser category.
• Interchanging, which changes values of individual record with other one, such as replace the place of record 4 instead of record 10 and so on.

Encryption
Apply encryption algorithm for all values of selected attribute.This propsed research suggest encrypting the values by using symmetric encryption algorithm since it deal with static environment which has one parity, the administrators of the database, (here we will select twofish encryption algorithm [7]).
Twofish is a 128-bit block cipher that accepts a variable-length key up to 256 bits.The cipher is a 16-round Feistel network with a bijective F function made up of four key-dependent 8-by-8-bit S-boxes, a mixed 4-by-4 maximum distance separable matrix over GF(28), a pseudo-Hadamard transform, bitwise rotations, and a carefully designed key schedule.. Twofish can be implemented in hardware in 14000 gates.1.The protection of the first attribute (the identifier) will be done by the proposed replacement by changing the digits of the identifier and that by adding random digit.at the first of it and two digits at the end of it, the changing schema as here (0 to 3, 1 to 2, 2 to 4, 3 to 6, 4 to 8, 5 to 7, 6 to 0, 7 to 1, 8 to 9, 9 to 5).For example the first identifier 500 (in figure 1) will be 733 by changing and finally after adding noise will be 473368.the first transaction interchanged by the last one, the second transaction by the before last one, and so on until the middle of both (if the no. of transactions were even then there is no problem there are two middle and will be interchanged, but if the no.was odd the middle will be still in its place).4. The protection of the fifth attribute (marriage state) will be done by the proposed concatenation: if the state was marriage will take class I, if was divisor will take class II, if was worth will take class III, if was not marriage will take class IIII.

Mining results:
From applying the association rule data mining algorithm on the protected bank database, see fig.

Horizontally:
In a horizontally distributed database, the transactions are distributed among n sites.The global support count of an itemset is the sum of all the local support counts.An itemset X is globally supported if the global support count of X is bigger than the given proposed support s% of the total transaction database size.A k-itemset is called a globally large k-itemset if it is globally supported.

Mining results:
From applying the association rule data mining algorithm on the horizontal distributed web log in/out database, see figure (5 a and b), will obtain the following: From site (1) the miner group knows only the frequent itemsets of site 1, the same in site (2) so the privacy of the universal database is protected.In figure (5 a and b) the frequent itemsets are extracted by the administrators as in the previous section and the following association rule is an example: If (A=122.22.3.18) and (B= 80) and (E=tcp) and (G=2:30) Then (C=33.56.233.77)

Vertically:
Mining private association rules from vertically partitioned data, where the items are distributed and each itemset is split between sites, that done by finding the support count of an itemset.If the support count of such an itemset can be securely computed, then can check if the support is greater than the threshold, and decide whether the itemset is frequent.The key element for computing the support count of an itemset is to compute the scalar product of the vectors representing the sub-itemsets in the parties.Thus, if the scalar product can be securely computed, the support count can also be computed.The algorithm that computes the scalar product, as an algebraic solution that hides true values by placing them in equations masked with random values, is described in.The security of the scalar product protocol is based on the inability of either side to solve k equations in more than k unknowns.Some of the unknowns are randomly chosen, and can safely be assumed as private.

Mining results:
From applying the association rule data mining algorithm on the vertical distributed web log in/out database, see figure (6 a and b), will obtain the following: Eng.&Tech.Vol.26,No.5,2008Privacy Preserving for Data Mining Applications 552 From site (1) the miner group know only the frequent itemsets of site 1, the same in site (2) so the privacy of the universal database is protected.In figure (6 a and b) the frequent itemsets are extracted by the administrators as in the previous section and the following association rule is an example: .22.3.18) and (B= 80) and (E=tcp) and (G=2:30) Then (C=33.56.233.77)

3-The Implementation:
To demonstrate the idea of this research in more clearly we would present the real implementation of that system as follow: The first part of the implementation will concentrate on the privacy preserving by using modifications and encryption approaches which were explained in details in section (2.1), see figure (7).
In figure (7), see in the upper part of it the bank database with original data, down of it there is a command button called (modification and encryption), when it pressed then the program will display the bank database with modified and encrypted data.Then the encrypted secure database will be sent to the mining group.After arrival of the secure database to the miners then the miners will mine it and record and display the resulted encrypted association rules, see figure (8).
After recording (saving in a file) the encrypted association rules, then will be returned to the bank database administrators.The administrators will decrypt the association rules and begin to analyze them to extract the novel knowledge which support the performance of the applications related to their huge databases, see figure (9).
The second part of the implementation will concentrate on the privacy preserving by using distributing (fragmentation) approach which was explained in details in section (3.2), see figure (10).
In figure (10), display the original web log database, then the first command which is called (fragmentation) will display small interface to give the ability for administrators to select one of the two types of the distributing (horizontal and vertical), see figure (11).Then after selecting for example the vertical fragmentation the program will split the original database into two databases since the no. of attribute is few.After sending these two parts to two group miners will get from them two files.Each file consists of the frequent itemsets of its part.The original database 552 administrators will implement an algorithm to combine the two files of the frequent itemsets and extract the final frequent itemsets of the original complete database.Finally applying association rule algorithm on the final frequent itemsets to extract the novel knowledge, also see figure (10).

Conclusions.
In context with the results of the present study it can be concluded that: • The work presented here, indicates the importance of interest of researchers in the area of securing sensitive data and knowledge from malicious users.• The results of DM such as association rules, classes, clusters, …etc, will not be readily available for all team of work if take in our care the importance of support the DM algorithm with data protection for keeping the privacy of sensitive data.
Fig. (2) shows a database for web log in/out

Fig
Fig.(3) shows the databases after applying the three proposed modification techniques and the encryption algorithm on all the attributes of the bank database.Now will explain what happen for changes in the protected databases in the following steps:1.The protection of the first attribute (the identifier) will be done by the proposed replacement by changing the digits of the identifier and that by adding random digit.at the first of it and two digits at the end of it, the changing schema as here (0 to 3, 1 to (3), will obtain the following rules: If (E=I) and (F= less than 2) and (D= Ofecr4VLSvMOm1zVpR) then (G= more than uv31WsfzM7U) and (I=more than FfNU7VGnk8E=) So the above association rule (extracted knowledge) has a privacy because it is not understood by the group of mining and other users.the extracted knowledge by decryption the encrypted fields and return the values modified to its original values.The previous knowledge will mean the following: If (the person marraige) and (have less than 2 child ) and (live in Baghdad/rusafa) then (will has more than 67$ as income) and (the income will be more than 4555$) 2.2.The Distributing Algorithm.The proposed research suggest distributing the data as horizontal data distribution and vertical data distribution.Horizontal distribution refers to those cases where different database records reside in different places, see fig.(5), while vertical data distribution, refers to the cases where all the values for different attributes reside in different places, see fig.(6).

Figure ( 7
Figure (7): The implementation of mining the secure database by association rule algorithm.