Itemset Representation and Mining the Rules for Huntington’s Dataset

Association rule mining does not restrict to market basket application but it is also employed in many applications such as health, industrial, network domain and etc. In this paper, an association mining algorithm is applied to the health management domain. It helps in the decision making by producing the rules for the early detection of the disease. By checking the personal details and symptoms of the patient, association rule mining will help in prediction and diagnosing the disease at an early stage. The dataset used in this experiment is the Huntington Disease (HD) dataset, which is one of the rare diseases. The dataset needs to be stored in the memory for the computation and generation of rules. Storing the items in the memory will take 4 bytes if the array data structure is used. Furthermore, if the dataset is very large, storing each and every detail in the memory becomes speculative. It is also not cost-effective and consumes a lot of resources. One of the solutions is to present the itemset in such a way that the memory consumed is concise. The items are represented using the set representation that takes less time and memory as compared to the traditional methods. The dataset is mine using the Apriori Algorithm which produces only those itemsets which are more frequent or have a high probability of occurrence. The algorithm gives a prior knowledge of the frequent itemsets. Then, the rules will be generated from these frequent itemsets. The memory and time consumption using the set representation is compared with the array representation of itemsets.

dominantly prevalent for a person aged 20 years with the 5-10 per 100000 populations of the Caucasian people. Some of them have the inability of learning and behavioural difficulties at school. Huntington disease's principle sign is chorea. This disease spreads first to the muscles and then to the psychomotor processes which have been severely retarded. Patients will also experience the psychiatric symptoms [1, 2].

1-1-History
The Huntington's case was first reported by Waters in the late 1842 but it was until 1847, when the description of this disease was given by George Huntington which later become known as Huntington's chorea. It is a neurodegenerative disorder which is passed from generation to generation from families ranging from dementia to psychiatric disturbances [1]. It was until the ninetietheighties when the symptoms were more extensive with non-motor signs where it is then known as the Huntington's disease. In 1983, it was discovered that there was a link to the chromosome 4 and in 1993 a gene for this disease was found [3]. This is the first time where the diagnosis has been made. The disease contains repeat of the CAG (Cytosine (C), Adenine and Guanine). CAG is a trinucleotide that is the building block for DNA. It is also the codon for amino acid [1-3].

1-2-Symptoms
The symptoms of the Huntington's disease are disturbances relating to motor, psychiatric and cognitive. Other prevalent symptoms include loss of sleep and weight along with autonomic nervous system dysfunction. The disease is mostly common for the population of the age gap of 17-20 years. Death is mostly because of pneumonia which is then followed by suicide. Some of them are as follows:-

Motor Signs
The features of the motor changes are unwanted and involuntary movements which occurs in the finger, toes and other facial extremities. It also affects the daily life of a person and these are visible to the eye of bystanders. The person is nervous. His/her talking and walking is unstable which makes the person looks like he/she is drunk. Swallowing is another problem which causes choking in some patients. Patients also experiences dystonia which is characterized by slower movements, abnormal posture of the limbs or trunk. Hyperkinesia and hypokinaemia is also seen in patients which results in abnormal walking, standing. This causes ataxic gait which may lead to frequent fall [1,3].

Behavioural and Psychiatric Symptoms
These symptoms are common in the early stage of the disease. They impact the daily life a patient and has a highly negative impact on the family life. The signs are anxiety, low self-esteem which leads to depression. Suicide is prevalent at the onset of the disease. Obsession can occurs which may lead to frustration in patients. Psychosis may appear in the later stage [3,4].

Dementia
Cognitive decline is also one of the sign of the Huntington disease. Patients cannot separate activities that need attention and ones that needs to be ignored. They will be not able to organize their tasks in their day to day life. They are neither able to make adjustments nor have the peace of mind. These misjudgements may lead to complicated situation where the patients is not able to make decisions in that particular environment in which he is expected to do so. Memory and language becomes severely impaired [5].

Weight Loss
Huntington disease also causes weight loss. It is found that the repeated sequence of CAG chain is linked with the weight loss [6]. With the course of time, this illness becomes cachectic. It has also been found that the patient who imbibed supplements, drugs such as neuroleptics, anti-depressants shows an astonishing pattern. It has been seen that the patients who are at the hypokinetic stage are inclined to display more weight loss. Weight loss often, give into weakening which results in risk of evolving co-morbidity. This will facilitate the decline in the quality of the patient's life [7].

Sleep Complications
It has been found that the 90% of the people have sleep problems which have been diagnosed with HD. It also been reported that 48-80% patients experience nocturnal awakenings in the night. During the nights, the patients experienced more activity such as acceleration movements than the other normal people [1].

2-How the Disease Developed?
The disease starts with one parent who has Huntington disease. Accordingly, Huntington can be categorized into three parts: At risk, pre-clinical stage and clinical stage. The first stage is when the patient is diagnosed to carry the repeated CAG chain on chromosome 4. The stage ends when the patient is confirmed to carry the chain and then the other stages will start. The clinical stage is shown in the Table 1. During stress whether it is psychological or physical, the signs will start to manifest. The signs will deteriorate when the person becomes normal. In the past years, the first signs are motor signs. However, depending on the type of family and the doctor's experience, the diagnosis was suggested. Over the last 20 years, other signs have also been taken in consideration such as psychiatric and cognitive changes in patients. Patients experience burn out in their work space or depression [1].

Clinical Stages 3
Motor disturbances is more severe.
Nearly complete physical dependence is needed The patient needs complete care Death

3-Treatment
For clinical practice, antipsychotic drugs are used but their use is limited by complications such as Parkinson disease, difficulties in swallowing, impaired balance [8].

3-1-Treatment for Depression
Depression is the most frequent symptoms for the onset in Huntington disease [9]. The metabolic activity is lower in the basal ganglia is evident in the depressed patients in HD [10]. Antidepressants used for HD such as clozapine have been used for treatment of psychotic depression [11]. There are other positive results found for fluoxetine [12], amitryptiline [13], mirtazapine [14], isocarboxazid [15], phenelzine [15], amoxapine [16].

3-2-Treatment for Psychotic Symptoms
Psychotic is common in HD patients [17,18]. Risperidone shows some improvement for the treatment of the disease.

3-3-Behavioural Disorder
Lack of control, aggression, emotional dyscontrol and irritability are some of the symptoms for the behavioural disorder. These behaviours cause disturbances to the patient's family. This will also increase the crime cases especially for the male patients. Haloperidol was used to treat patient with irritability, depression and emotional outbursts. Olanzapine had showed improvement in treatment of anxiety, irritability and obsessions.

3-4-Dementia
Dementia is more prevalent in the clinical stage. Unsaturated fatty acids [19], minocycline [20] provide benefits in trials. There is no treatment of the dementia for the level 1 stage.

3-5-Other Psychotic Symptoms
Other symptoms such as compulsive or obsessive show deterioration on the neuropsychological tests. Olanzapine has also been used for treatment of obsession in HD [21,22]. A case was reported that anxiety was treated using diazepam, amitryptiline [23]. For hypomania in HD patients, propranolol is used [24].

4-1-Dataset: Transcription Profiling by Array of Human Lymphocytes from Moderate Stage Huntington's disease Patients
In the Huntington's disease (HD) dataset, the transcriptomic test was explored and the mRNA in peripheral blood cells were measured. The performance is analysed and the gene immediately early response 3 (IER3) shows a predominantly increase in HD samples of 32% compared to controls. This dataset is widely accessible from the website link: http://biogps.org/#goto=genereport&id=1017&show_dataset=E-GEOD-8762. The overall design of the experiment consists of samples from 12 (8 females and 4 males) HD patients at moderate stage and 10 (5 females and 5 males) matched controls samples.
In this paper, the dataset is being tested using the set and array representation for association rule mining. When analysing the clinical stage 1, the patients cannot be tested positive for HD. However, there are instances where the patients showing symptoms at the clinical stage 1 are confirmed to have HD. Manually, the physician can sometimes miss out the symptoms and prescribe wrong medications for the patients. The patient is very local and put full faith on the doctor. They will never cross question or question the doctor regarding the mediation. This is very common where the patients consume wrong medications which deteriorate the health of the patients. These are those cases that leads to the death of the patient. The total number of individuals taken for the experiment is 22. The personal details of the individuals are taken and their counts for CAG repeats are being analysed. The main essential sign of HD is the repeat of the CAG gene and so by checking the CAG gene repeat can help us to find out whether the patient is diagnosed with HD or not.

4-2-Dataset: Transcription Profiling of Human Blood from Huntington's disease Patients
In the second dataset, the transcription profile of human blood is analysed and studied. In this dataset, there are 31 samples of human blood which contains the affymetrix U133A expression levels for 17 Huntington's disease patients where 5 are presymptomatic and 12 symptomatic versus 14 are healthy controls. The dataset is available in the website http://biogps.org/dataset/E-GEOD-1751/transcription-profiling-of-human-blood-from-huntin/. This dataset contains the changes in blood mRNAs of human where there is a clear distinction between HD patient and the controls. These changes that occurs inside the mRNA expressions that clearly distinguish HD patients from the controls. These alterations in mRNA expression associate with how the disease progresses in the experiment. All these alternations may predict the onset of the disease at the clinical trial.
Although not much research has been in this field, the disease is very life threatening. HD is very rare and finding the samples for this disease is difficult. However, analysing this dataset by using the association rule mining algorithms will help in decision making. Manually, diagnosing hundreds of patients is difficult and sometimes, the doctors may miss out. Using association rule mining, finding the patients which are diagnosed with HD is much easier and faster as compared to manual work. Moreover, there may be some errors especially manual error. Hence, association rule mining is more efficient and effective.
Furthermore, if the datasets is very large and storing such large datasets will consume space. In the medical domain, the population is very large and keeps on growing. Storing the patient details and the symptoms is not feasible and cost effective.
There will be many attributes for storing the patient's both personal and medical details. In the memory, each element needs to be stored using a data structure. Using the array representation, each element consumes 4 bytes. In this way, if there are 10 attributes for dataset with 1000 individuals, the memory that needs to be reserve will be 10×4 (4 bytes each)= 40×1000 (1000 individuals) that will be 40,000 bytes ~ 40 kb. However, using set representation, each data will consume only 1 bit. The memory consumption will be 10×4=40 bits (4 bytes). For 1000 individuals, the memory consumption will be 40×1000=40000 bits ~ 5000 bytes or 5 kb.

5-Association Rule Mining Algorithm
Association rule mining mines the dataset for candidate generation and rule production. One of the most popular association rule mining algorithms is Apriori Algorithm. Apriori algorithm scans the dataset and finds the itemsets that are more frequent. The frequent itemsets are those itemsets that are most frequently occurring in the entire dataset called Large itemsets, . The large itemsets whose support count is more than the threshold is considered as candidate itemsets, and those itemsets whose support count is less than the threshold is pruned. These candidate itemsets, , are self joined to find the next candidate itemsets +1 This process is continued until no other large itemsets are found [25].
Rules from these itemsets are generated depending on the confidence value. The confidence metrics validates the rule. A rule A→B implies that whenever itemset A occurs the itemset B also occurs. The rule is generated for those candidate itemsets whose confidence value is more than the threshold value. In the medical domain, association rule is used for diagnosing a disease. If the person shows some few symptoms occurring at a particular time then from analysing the symptoms, we can conclude that the person is suffering from a certain disease [29,30].

7-Research Methodology
The research methodology can be explained with the help of the flowchart given in Figure 5.

8-Results and Discussion
The set and array representation are tested on the Huntington's first dataset. The dataset is mine using the Apriori algorithm. The performance of the set and array representation is tested with different ranging values of support and confidence. The performance of these representations is compared in terms of time and memory consumption. In the Table 2, with confidence= 1% and support value of 1, 2.5, and 5% the time and memory consumption of each representation is explored. The confidence value is changed to 2.5% and support values = 1, 2.5, and 5% and the performance of the algorithm is shown in Table 3. Again, in the Table 4, the confidence value is changed to 5% showing the performance of the itemsets representation. The graphs are created to show the comparison between array and set representation in Figure 6 to 8.   From the Tables 2 to 4 and Figures 6 to 8, it is seen that the set representation for Apriori Algorithm performs better in both time and memory consumption. It also observed that with varying the value of support and confidence, the time and memory consumption also changes.
In the Huntington's second dataset, the itemset representation is represented using array and set representation. The Apriori Algorithm is then used to mine the dataset. The dataset is mine using varying values of support and confidence. The performance is tested and measured in terms of memory and time consumption. The result of the time and memory consumption is given in the Tables 5 and 7.   The graph showing the comparison between array and set representation are given in Figure 9 to 11.   Tables 5 to 7 and Figures 9 to 11 it is observed that array representation consumes more time and memory as compared to set representation. In the set representation, it is seen that the memory consumption is almost half of what array representation takes. Similarly, the time consumption is also less than that of the array representation. It is also evident that with increase in the value of confidence and support value, the memory and time consumption for both the itemset representation also decreases.

9-Rules Generation
The dataset is mine using the Agarwal's Algorithm. With the varying values of confidence and support value, the rules generated show that there are cases where some common patient's symptoms can lead to Huntington's disease. However, most of the cases show that the patients having repeated CAG chain are prone to have Huntington's disease. The personal details of the patients do not signify that the patient will have the disease. According to the dataset, the crucial symptom is the repeated CAG chain signals that the patient has Huntington's disease. Huntington's symptoms are most mental disease and usually not easily detectable at an early stage. These symptoms are mostly invisible and not physical that can be easily spotted by the doctor. The doctor needs time to monitor and observe the behaviour of the patient. By using the rule generation algorithm, diagnosing the disease can be done at an early stage.

10-Conclusion
Diagnosing the disease at an early stage helps in the process of the treatment of the disease. In the early days, the doctor diagnose the disease depending on the symptoms of the patient. Since the Huntington's disease is very rare disease and mostly relating to the mental illness, it is usually not easily detectable. Moreover, diagnosing the disease manually will take time if the size of the dataset is large. There is also a probability that manual error can occur during the diagnosing process where the doctor may miss out on some crucial symptoms. Furthermore, if the dataset is very large it is not feasible for the doctor to manually check each and every patient. The association rule mining algorithm mines the dataset to help in the process of decision making providing better accuracy. Using association rule mining algorithm such as Agarwal's Algorithm, the time for diagnosing the disease is also reduced. The computation process using the set representation saves time since the computation uses bitwise operation. Most importantly, patient's life is saved where the disease can be diagnose at an earlier stage. The itemsets represented using set representation consumes only 1 bit whereas array representation takes about 4 bytes. This saves the memory consumption for storing and computation. This also saves the cost and resource for diagnosis and detection of disease especially when the dataset is very large. Hence, for an early detection of the disease and better accuracy of disease detection, set representation performs better in terms of time and memory consumption as compared to the array representation for mining the dataset.

11-1-Author Contributions
Conceptualization, C.K. and B.N.; writing-original draft preparation, C.K. and B.N.; writing-review and editing, C.K. and B.N. All authors have read and agreed to the published version of the manuscript.

11-3-Funding
The authors received no financial support for the research, authorship, and/or publication of this article.

11-4-Conflicts of Interest
The authors declare that there is no conflict of interests regarding the publication of this manuscript. In addition, the ethical issues, including plagiarism, informed consent, misconduct, data fabrication and/or falsification, double publication and/or submission, and redundancies have been completely observed by the authors.