Using a parser for steganography purpose

Steganography techniques or information hiding are concerned with hiding the existence of data in other cover media. Usually one hiding method is used to hide secret data in a cover text and that will reduce the security, so in this paper we will use more than one hiding method to hide secret data according to syntax grammars of cover text produced by a special parser. We represent the Arabic language grammar as logical terms and use B + tree for storing, and Augmented Transition Network( ATN) for parsing. The proposed technique was implemented using Visual Prolog and after testing according to the authorized measures that are used in this field The proposed approach has provide good security (transparency) for steganography


INTRODECTION
teganography is the art and science of hiding a message inside another message without drawing any suspicion to others so that the message can only be detected by its intended recipient [1].Steganography conceals the existence of S a message [2].But when information hiding is used, even if an eavesdropper snoops the transmitted object, he cannot surmise the communication since it is carried out in a concealed way.Steganography overcomes the limitation of unintelligible nature of the text by hiding message in an innocent looking object called cover.Figure (1) gives an outline where a detailed hierarchical classification of steganography is given [3].Our paper focus on Arabic text steganography.
The aim of this research is to improve security by using different information hiding methods for the same Cover-Text according to the structure of each sentence(Grammar) that produced by a special parser.
The rest of the paper is organized as fallow: section 2 will discuss text steganography, section 3 will discuss natural language processing, section 4 some related work are presented, section 5 describe the new proposed method and give an example, in section 6 a test and comparisons are presented and the last section conclude this work.

TEXT STEGANOGRAPHY
Steganography can be classified into image, text, audio and video steganography depending on the cover media used to embed secret data.Text steganography can involve anything from changing the formatting of an existing text, to changing words within a text, to generating random character sequences or using context-free grammars to generate readable texts [4].Text steganography is believed to be the trickiest due to deficiency of redundant information which is present in image, audio or a video file.The structure of text documents is identical with what we observe, while in other types of documents such as in picture, the structure of document is different from what we observe.Therefore, in such documents, we can hide information by introducing changes in the structure of the document without making a notable change in the concerned output [5].Storing text file require less memory and its faster as well as easier communication makes it preferable to other types of steganographic methods [6].

Figure(1) The Classification tree of Steganographyt
Text steganography can be broadly classified into three types (See Figure 2): Format based, Random and Statistical generation, and Linguistic methods.Linguistic Steganography's method will be focused on.Linguistic Steganography specifically considers the linguistic properties of generated and modified text, and in many cases, uses linguistic structure as the space in which messages are hidden [4].

Natural Language Processing
Natural language processing (NLP) or what is sometimes called computational linguistics is the area of knowledge where the text of a natural language such as Arabic is digitized and computed.It includes natural language understanding and generation.[7] Figure (3) illustrate the main component of the natural language processing [13].

Figure(3) The main components of Natural Language Processing
As shown in the figure there is some levels of analysis for understanding natural language processing: 1-Phonology examines the sounds that are combined to form language.This branch of linguistics is important for computerized speech recognition and generation.
2-Morphology is concerned with the components (morphemes) that make up words.These include the rules governing the formation of words, such as the effect of prefixes (un-, non-, anti-, etc.) and suffixes (-ing, -ly, etc.) that modify the meaning of root words.Morphological analysis is important in determining the role of a word in a sentence, including its tense, number, and part of speech.3-Syntax studies the rules for combining words into legal phrases and sentences, and the use of those rules to parse and generate sentences.This is the best formalized and thus the most successfully automated component of linguistic analysis.4-Semantics considers the meaning of words, phrases, and sentences and the ways in which meaning is conveyed in natural language expressions.5-Pragmatics is the study of the ways in which language is used and its effects on the listener.[8]

RELATED WORKS
The main issue in text steganography is the redundancy of data.However, the exploits of text orthographic characteristics of every language is different.Recently, in text steganography field it was specifically designed to exploit the specific characteristic of the target language.There have been several successful attempts to design text steganography based on the characteristic of their features, for example in these languages; English, Japanese, Korean, Chinese, Arabic [9].

Steganography Using Letter Points and Extensions [10]
In this method the authors use the pointed letters with extension to hold secret bit 'one' and the unpointed letters with extension to hold secret bit 'zero'.Note that letter extension doesn't have any effect to the writing content.It has a standard character hexadecimal code: 0640 in the Unicode system.In fact, this Arabic extension character in electronic typing is considered as a redundant character only for arrangement and format purposes.
The only bargain in using the extension is that not all letters can be extended with this extension character due to their position in words and Arabic writing nature.

Steganography Using Arabic Diacritics[11]
In this method, a diacritic Arabic text is used for hidden exchange of information.There are eight diacritics in Arabic text.The most frequent diacritics in Arabic text is "Fatha" and the probability of its occurrence is equal to the occurrence probability of other seven diacritics.
In this method, at the first the cover text is assumed to be a fully diacritical text.To hide a bit "1" a "Fatha" is kept and to hide a bit "0" a non "Fatha" diacritic is kept and other diacritics are removed.So in the stego text each "Fatha" represents "1" and each non "Fatha" diacritic represents a "0".The main advantage of this method is its high capacity.But the main disadvantage of this method is that it attracts the attention of the reader.This method also needs a fully diacritical text, but most of Arabic texts have no diacritic.

Particular Characters in Words[6]
Hiding information can be performed by selecting characters in certain words.This method can range from simple to very complicated depending on the specifications."In the simplest form, for example, the first words of each paragraph are selected in a manner that by placing the first characters of these words side by side, the hidden information is extracted".A more advanced example can be by selecting the first letter from the first word, second letter from the second word, third from the third, and so on, to hide the information in.

SHARP-EDGES METHOD [9]
The Sharp-Edges Stego's main architecture consists of two main modules, the hiding module and retrieving module.The hiding module requires a stego-key to randomly position the secret bit.The retrieving module requires the same stego-key as the sender.Once the key is recognized by the method, the method will retrieve the secret message based on the hard-coded reference table.

Letter Frequency Manipulation[14]
The basic principle of the proposed method is to divide the text into equal blocks so that each block may hide one binary digit of the data.One letter of the alphabet is chosen as the secret letter of the system.On the sender side, the frequencies of the secret letter in some of the text blocks are manipulated in accordance with the binary digits of the data.On the receiver side, the frequencies of the secret letter are counted in each block and compared with two preset values in order to determine whether the hidden binary digit is "one", "zero" or none.If the secret letter frequency is equal to the "one's preset value", the hidden binary digit is "one"; otherwise if the secret letter frequency is equal to the "zero's preset value", the hidden binary digit is "zero".

Proposed method
Basically the proposed method includes two stages: the sending stage and receiving stage, and the next sections will discuss these stages with some details.

Sending Stage
In this stage the secret text will be embedded in the cover text in order to generate the stego-text that will be send to the receiver, figure (4) will show the main components of this stage.

Figure (4) The architecture of proposed sending stage
The user interface responsible on interacting between the proposed system and the user in ease form( since we use a visual programming language), also the user can update the contents of the lexicon through the user interface by removing or adding a new Arabic words with it's information(type, specific type, number, sex,...,etc.).
The input to the proposed system in this stage will be the secret text that consists of sentences (a sentence is considered to be a set of words separated by a stop mark like: '.', ',', '?' ...etc.) and the sentence cutter is responsible on producing these sentences.The coding part of the proposed system is used for converting each secret sentence to a unique code by depending on the lexicon.(As in paper [12]).The other input to the proposed system is the Arabic Cover-Text that will be converted to a sentences using sentence cutter.Each sentence of the Cover-Text will be input to the Arabic morphology that responsible on extract the stem for Arabic word by removing its suffix and prefix and removing the changes that occur during adding these suffixes(according to the Alaalal and Alabdal rules of Arabic language).The other parts of the proposed system will be discussed with more details.

Lexicon
Lexicon is an important part in any linguistic system, and it is responsible for providing the system with its required information.The lexicon of the proposed method is represented using one database for Arabic words such that the stem is the key in its index tree and another database for Arabic language grammars.Also there are another two different databases that used by coder (as in [12]).

Parser
The output of the morphology(analyzes of each words) will be the input to the parser in order to get the right structure(grammar) of each sentence, we represent the Arabic language grammar as logical terms and use B + tree for storing , and Augmented Transition Network(ATN) for parsing (as in [12]), but we modify this parser such that each Arabic language grammar augmented with a code that represent the stego-method that will be used by steganography for hiding secret code within this cover sentence.

Determiner
The output from the parser(Arabic sentence grammar) will be the input to the determiner, and according to this input the determiner will decide the stegomethod(that will be used for hiding) and according to another features( such that if the cover text doesn't have harakat it will not choose harakat method for steganography).

Information Hiding
The actual information hiding is done here , in this part the secret code(generated by coder) will be embedded in a cover sentences according to the stego method determined by the determiner, such that each code will embedded in a cover Arabic sentence using a different hiding method.In our system we use four different hiding methods (that described is section (4).

Algorithm of the proposed method(SENDING STAGE)
* P Δ means space.

INPUT:
Secret Text : ST Cover Text : CT OUTPUT: Stego Text : T START 1-T="" 2-Convert ST to a list of sentences SL using sentence cutter; 3-Convert SL to a list of codes CL using coder(as in [12]); 4-Cut a sentence S from CT; 5-If S ≠ "Δ" * then begin 5.1 Convert S to a list of tokens L; 5.2 Analyze each word in L using morphology; 5.3 Parse S to get its grammar G using parser(as in [12]

Receiving Stages
In this stage the input (stego-text) will be processed in order to extract the secret message embedded in it, figure 5 show the main components of this stage.The sentence cutter, morphology, parser and determiner part of this stage perform the same tasks as in sending stage but reverse information hiding will extract the secret codes from stego text according to the stego method that determined by the determiner, these code will be input to the decoder that will retrieve the secret text by using the lexicon (as in [12]).the text issues, and when we use spaces it won't show because it use at the end of the lines that are not shown for the reader, and so on for the other methods.
Third the use of five methods of steganography on the same text (or might be on the same paragraph or sequential five sentences) without the knowing which grammar has been used for each information hiding methods, beside the similarity between the cover text and stego text.make eavesdropping more difficult and make our proposed method more secure than the other method which are used one method for hiding and using it on the whole text.
Table 5 show a comparison between our proposed method that use more than one hiding method for the same cover text and other methods that use only one.

Conclusion
1-Using more than one hiding method for the same text (depending on its grammar) will increase the security because it make it different to notice, and will cause less suspicious on the stego-text than using one method.2-Using one code for all the secret sentence will provide a high compression and will provide a good embedding capacity.3-Using the grammars of the cover text is a useful way to determine the hiding method that will be used.

Table(5): a comparison between the proposed method and other methods that
use one hiding method.

PROPOSED METHOD
(using more than one hiding method) OTHER METHODS (using one hiding method) 1-Use more than one information hiding methods on the same cover text.
1-USe one information hiding method.
2-Increase security since it's hard for unauthorized user to find out all methods of information hiding since they are more than one on the same cover text.
2-It use one method of hiding making easier for unauthorized user to find the secret text when he know this method.
3-It increase the embedding capacity since it use one code for a secret sentence.
3-Less than the proposed method since it use code for a word or sometimes even a bit.4-It will not cause suspicious since the stego text isn't follow the same pattern making hard for the unauthorized user to guess the right pattern for all the stego text.
4-Here the stego text follow one pattern so if the unauthorized user guess it, he will have the whole secret text.