Tuesday, June 4, 2019
Improving the Accuracy of Arabic DC System
Improving the Accuracy of Arabic DC SystemThe main goal of this explore is to ask and to develop the appropriate text collections, tools and procedures for Arabic scroll sort. The following specific objectives piss been set to achieve the main goalTo investigate the impact of pre butt oning tasks including normalization, stop sacred scripture removal, and stemming in up the truth of Arabic DC system.To introduce a novel technique for Arabic stemming in order to modify the accuracy of the text file potpourri system. The new algorithmic program for Arabic stemming tries to overcome the deficiencies in state-of-the-art Arabic stemming techniques and dealing with MWEs, foreign Arabized language and handling the majority of broken plural form forms to reduce them into their singular form.To handling Arabic text summarization technique as suffer reduction technique to eliminate the noise on the documents and select the most salient sentences to nominate the original document s.To explore the impact of different feature selection techniques on the accuracy of Arabic document classification and nominates and implements a new variant of confines Frequency Inverse Document Frequency (TFIDF) weighting methods that take into account the Copernican of the first appearance of a word and the compactness of the word which fecal matter be taken as factors that determine the important features in the document.To implement various classifiers and compares their performances.1.1.Problem StatementDespite the achievements in document classification, the performance of document classification systems is far from satisfactory. document classification tasks are characterized by natural addresss. This means DC is closely related to natural language processing (NLP) which require friendship of its subject matter. In general NL reveals many of syntactic and semantic ambiguities beside the complexities 45. In the context of DC, a researcher tries to address various prob lems arising from characteristics of documents in the process of feature extraction and feature representation or problems emanating from the classification algorithms. The following sections provide ideas on research problems.1.1.1. Preprocessing Text ProblemThe preprocessing stage is a challenge and affects positively or negatively on the performance of any DC system. Therefore, the improvement of the preprocessing stage for highly inflected language such as the Arabic language will enhance the skill and accuracy of the Arabic DC system. In spite of the lack of standard Arabic morphological analysis tools most of the preceding(prenominal) studies on Arabic DC have proposed the use of preprocessing tasks to reduce the dimensionality of feature vectors without comprehensively examining their contribution in promoting the effectiveness of the DC system. One of the challenges facing the researchers in Arabic document classification systems is the absence of a strong and an effective stemming algorithm. Arabic is morphologically a complex language 46, it uses both kinds of morphologies inflectional and derivational morphologies. Based on these types of morphology, a single word whitethorn yield hundreds or even thousands of variant forms 47. The importance of employ the stemming technique in the documents classification lies in that it makes the processes less dependent on particular forms of words and reduces the highly dimensionality of the feature space, which, in turn, enhance the performance of the classification system. In spite of the quick research conducted in other languages, Arabic language still suffers from the shortages of researchers and development. The state-of-the-art Arabic stemmers suffer from high stemming error-rates due to its understemming errors, overstemming errors, ignored the handling of multiword expressions (MWEs), broken plural forms, and Arabized words. Therefore, the limitations of the current Arabic stemming methods have moti vated this author to investigate a novel technique for Arabic stemming to be used in the extraction of the word roots of Arabic language in order to improve the accuracy of the document classification system in chapter 5.1.1.2. Highly Dimensionality of the Feature Spacehighly high dimensional features paces and large volumes of data problems occur in automatic document classification. High dimensionality problems arise because the number of features used in the classification process increases along with dimensionality of the feature vectors13, 15, 48, 49. Practical examples show that the number of features consisting the dimensionality could amount to thousands.A large number of features are irrelevant to the classification task and can be removed without affecting the classification accuracy for several reasons First, the performance of some classification algorithms is negatively affected when dealing with a high dimensionality of features. Second, an over-fitting problem may oc cur when the classification algorithm is trained in all features. Finally, some features are common and occur in all or most of the categories 50.In order to solve this problem, the feature vector dimensionality is required to be reduced without degradation of classification performance. It was important to extract the features with high discriminating power using various techniques. Text summarization, feature selection and feature weighting are common techniques and methods that are used in document classification to reduce the highly dimensionality of the feature space and to improve the efficiency and accuracy of the classification system. The term frequency (TF) weighted by inverse document frequency (IDF) which is abbreviated as TFIDF can partially solve the problem of variation in content and length in the documents but it cannot solve the problem of the distribution of the important words within the document. In general, the document is written in an organized manner to desc ribe its main matter(s). For example, the main topic for news articles may mentions at the surname and the first part of the document to draw the attention of the reader. Therefore, depending on the location, the document parts may have different degrees of contribution to the documents main topic(s) 51. In this thesis, we propose new feature weighting methods that treat the problem of the distribution of the important words within the document in chapter 6.In order to satisfy the objectives give tongue to in this research, the research questions of this study can be summarized asWhat are the impact of text preprocessing techniques such as normalization, stop word removal, and stemming in improving the performance of Arabic DC system? What are the available Arabic text preprocessing methods to be implemented in this research? What are their advantages and disadvantages? How to compare and improve their performance in order to improve the accuracy of the Arabic documents classific ation system?What are the Impact of feature reduction techniques on Arabic document classification? How to overcome the problem of the highly dimensionality of the feature space and the difficulty of selecting the important features for understanding the document?Which classification algorithms have the best performance when use on different representations of Arabic dataset?1.2.Research ContributionThis research focuses on exploring different preprocessing techniques, dimensionality reduction techniques and investigating their effect on Arabic document classification performance. more specifically, the main contributions of this thesis are as followsDemonstrate that using preprocessing task such as normalization, stop word removal, and stemming for Arabic datasets have a world-shattering impact on the classification accuracy, especially with complicated morphological structure of the Arabic language. Furthermore, we demonstrate that choosing appropriate combinations of preprocess ing tasks provides significant improvement on the accuracy of document classification depending on the feature size and classification techniques.In this thesis, we propose a novel stemmer for Arabic documents classification. The proposed stemmer attempts to overcome the weaknesses of root-based stemming technique and silly stemming technique, in addition to dealing with the majority of broken plural forms, MWEs, and foreign Arabized words. We compare the proposed stemmer with the well-known Arabic stemmers, including root-base stemming (Khoja stemmer) and light stemming (Larkey stemmer), to study its contribution in improving the classification system. The comparison is carried out for different datasets, classification techniques, and performance measures.Demonstrate that using document summarization technique help to improve the efficiency of Arabic document classification by reducing the highly dimensionality of the feature space without affecting the value or content of docume nts, then saving the repositing space and execution time for documents classification process.In this thesis, we investigate the impact of different feature selection techniques, namely, Information gain (IG), Goh and Low (NGL) coefficients, Chi-square Testing (CHI), and Galavotti-Sebastiani-Simi Coefficient (GSS) that have a significant impact on reducing the dimensionality of feature space and thus improve the performance of Arabic document classification system.In this thesis, we investigate the impact of feature representation schemas on the accuracy of Arabic document classification. The document usually consists of several parts and the important features that more closely associated with the topic of the document are appearing in the first parts or repeated in several parts of the document. Therefore, the proposed weighting methods take into account the important of the first appearance of a word and the compactness of the word which can be taken as factors that determine th e important features in the document.Unfortunately, there is no free benchmarking dataset for Arabic documents classification. One of the aims of this research is to compile dataset for Arabic documents classification that cover different text genres which will be used in this research and can be used in the future as a benchmark for computation linguistics researches including text mining, information retrieval. The dataset collected from several publish papers for Arabic document classification and from scanning the well-known and reputable Arabic websites. Compiling freely and publically available corpora is advancement step on the field of Arabic document classification.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.