Cluster Data Analysis with a Fuzzy Equivalence Relation to Substantiate a Medical Diagnosis

This study aims to develop a methodology for the justification of medical diagnostic decisions based on the clustering of large volumes of statistical information stored in decision support systems. This aim is relevant since the analyzed medical data are often incomplete and inaccurate, negatively affecting the correctness of medical diagnosis and the subsequent choice of the most effective treatment actions. Clustering is an effective mathematical tool for selecting useful information under conditions of initial data uncertainty. The analysis showed that the most appropriate algorithm to solve the problem is based on fuzzy clustering and fuzzy equivalence relation. The methods of the present study are based on the use of this algorithm forming the technique of analyzing large volumes of medical data due to prepare a rationale for making medical diagnostic decisions. The proposed methodology involves the sequential implementation of the following procedures: preliminary data preparation, selecting the purpose of cluster data analysis, determining the form of results presentation, data normalization, selection of criteria for assessing the quality of the solution, application of fuzzy data clustering, evaluation of the sample, results and their use in further work. Fuzzy clustering quality evaluation criteria include partition coefficient, entropy separation criterion, separation efficiency ratio, and cluster power criterion. The novelty of the results of this article is related to the fact that the proposed methodology makes it possible to work with clusters of arbitrary shape and missing centers, which is impossible when using universal algorithms.

these sources may be complex, and their formalization is sometimes very difficult. For example, the information obtained from patients is often characterized by vagueness: "severe pain" or "not severe pain," "weak" or "strong," "recently" or "some time ago." Patients are often unable to accurately recall the exact time of symptoms initial manifesting [1-3].
Thus, medical decision-making often involves the need to analyze a large amount of statistical information, which a priori is not always complete, explicit, or accurate. That is, there is considerable uncertainty in the initial medical data.
Medical Information Systems (MIS) [4][5][6] are invaluable resources that enable physicians to automatically obtain the comprehensive information necessary for the performance of their professional activities (including establishing diagnoses, describing the problem, and prescribing treatment or courses of rehabilitation) [7][8][9]. With the help of the core component of the MISthe Medical Decision Support System (MDSS)it is possible to collect, structure, store, systematize, analyze, and provide significant amounts of di-verse information on a wide range of processes and problems [10][11][12].
If the most relevant information identified through MDSS is used timely and reasonably, a wide variety of medical challenges can be addressed qualitatively. An effective means of detecting such information in large datasets accumulated by MDSS is the process of data mining aimed at identifying patterns and trends in the data. Mathematical tools to achieve this goal include a wide range of algorithms for classification, regression, clustering, prediction, and detection of sequences and associations [13][14][15].
Clustering algorithms are the best method to divide the data into separate groups with certain attributes and make specific conclusions and assumptions about each group. Thanks to the cluster research results, primarily using fuzzy clustering algorithms, where each data object belongs to different clusters with certain values of the fuzzy membership function, it is possible to view large amounts of medical data (including fuzzy data) and reduce it purposefully to effectively resolve the pressing problems of differentiation of significant and unnecessary information, simplifying its further processing [16][17][18].
Thus, the medical decision-making adequacy is due not only to the systematic accumulation of significant volumes of diverse and various (including semi structured or poorly formalized) medical statistical information for all types of processes and problems, but also to its proper analysis and processing, aimed for reasonable selecting data sets. The last makes it possible to determine the necessary tools for a specific medical problem to describe it, establish a diagnosis, and prescribe treatment [19][20][21].
In healthcare facilities, the efficiency of using periodically accumulated statistical information in decision support tasks determines the theoretical value of improved methods and algorithms designed to increase the objectivity and reduce the influence of human factors on the decision-making process, especially concerning ambiguity, incompleteness, and uncertainty associated with the initial information. Thus, it is relevant to solve a set of practical problems aimed at MDSS implementation. These include formalizing the problem solution of preparing a rationale for selecting the most appropriate medical diagnostic decision (MDD) from the list of recommended options. This issue represents the primary motivation for the present work to develop a methodology for preparing the abovementioned rationale through the clustering a large volume of statistical information stored in MDSS.

2-Literature Review
An essential role of automated MDSS is to prepare a rationale for selecting the most appropriate MDD for the patient from the list of recommended options [12,22]. The mandatory first step in performing this task was to analyze the initial set of diverse statistical data and select only that information array which is important or desirable for a particular medical purpose. Notably, the degree of confidence in the selected information and in the results of its further targeted use by the decision-maker largely depend on how logically and mathematically correctly this analysis will be done [23]. It is necessary to have an algorithmic apparatus appropriate to the task and its efficient use in applied medicine to achieve this goal.
The most effective mathematical tool for analyzing a large volume of statistical in-formation is clustering, especially for the uncertainty associated with initial medical data. There are many explicit and fuzzy clustering algorithms, each with its distinct advantages, disadvantages, and specific implementation details. There are hierarchical and genetic versions based on fuzzy clustering and others [24][25][26].
Explicit clustering algorithms subdivide the initial set of objects X into several disjointed subsets. In this case, any object from X belongs to only one cluster. Fuzzy clustering algorithms allow the same object to belong to several (or even all) clusters simultaneously, though with varying degrees. Fuzzy clustering is more natural than explicit clustering in real situations because objects that correspond precisely to one or another category or class are rarely found. A particular object may have some of the attributes, while another part may be absent. Thus, the membership of such an object to any class turns out to be fuzzy. Formulas for setting the membership functions of fuzzy variables in the general case [27] take the form (1). Modal values of membership functions coincide with the centers of clusters, as shown in Figure 1.     Then the set of candidate rules Ri, constructed based on all possible combinations of input and output fuzzy sets A1i, A2j, Bk, is formed. These fuzzy rules, following which the clustering is performed, have the following form:

R1:
IF (x1=A11) AND (x2=A21) ТО (y=B1), Thus, the set contains 27 fuzzy rules for each of which confidence coefficients are calculated corresponding to specific elements of the sample, and then the maximum values of the confidence coefficient are determined.
Methods of statistical information clustering in MDSS have already been widely used, and some examples of relevant research works are presented below. Thong et al. [28], developed a hybrid model that combines fuzzy clustering of images and intuitionistic fuzzy recommendation systems for medical diagnosis. The authors focused on improving the quality of medical diagnosis, and as a result, the accuracy of the hybrid model they developed was better than that of other relevant algorithms. The high accuracy of the hybrid model has been experimentally verified on the UCI machine learning reference dataset. The disadvantages of the proposed hybrid model are its limited application area related to image processing. Masulli and Schenone [29], developed a similar system for segmentation based on fuzzy clustering to support diagnosis in medical imaging. Due to noise, there is uncertainty in the medical imaging. In particular, the boundaries between tissues are not precisely defined, and the belonging to boundary regions is fuzzy. Thus, computer methods of uncontrolled fuzzy clustering prove to be particularly suitable for processing the decision-making process regarding the segmentation of multimodal medical images. The authors applied a widely used c-means algorithm as the basis for neural network-based clustering. The resulting solution is designed to work with images, and this defines the area of its use. Poczeta et al. [30] considers the task of processing multivariate medical data related to Parkinson's disease, for which the authors use fuzzy cognitive maps and k-means clustering. They used the k-means method to group the data and then constructed a separate fuzzy cognitive map for each cluster to improve the accuracy of predictions.
The range of fuzzy clustering algorithms is broad enough: fuzzy k-means algorithm, fuzzy c-means (FCM) algorithm, fuzzy decision trees, fuzzy Petri nets, fuzzy associative memory, fuzzy self-organizing maps, and others [31][32][33]. The k-means algorithm, the basis of a more advanced method of fuzzy c-means clustering [34,35], is fundamental. These algorithms became the basis for many other ones in this class, and they have enough multiprogram implementations, for example, the FCM algorithm built into MATLAB.
The k-means method works well when clusters are significantly separated compact clouds. It is effective for processing large amounts of data, but it is not applicable for detecting clusters of nonconvex shape or very different sizes. The fuzzy c-means clustering method can be seen as an improved k-means one: in it for each element in the considered set, the degree of its belonging to each of the clusters is calculated. The fuzzy c-means clustering method has limited application due to a significant disadvantagethe impossibility of correct partitioning into clusters when they have different variance on different dimensions (axes) of elements (for example, if the cluster is elliptical). FCM algorithm is an unsupervised fuzzy clustering method, which does not require human intervention in algorithm implementation. For the FCM algorithm, "c" is identical to "k" for k-means relating to the number of clusters. "F" is a fuzzy value referring to the incident degree. The disadvantage of the algorithm is that some initial parameters must be set. The invalid initial choice of parameters may affect the correctness of the clustering results. When the data sample set and the number of functions are large, the real-time performance of the algorithm is low.
Based on the above information-analytical review, the following hypotheses were formulated to achieve the aim of the study:

2-1-Hypothesis 1 (H1)
Simplicity, a high implementation speed, and the effectiveness of initial partitioning into clusters are the advantages of fuzzy clustering algorithms in solving many practical problems. However, their use in solving problems with the need to analyze large amounts of semi structured medical information in many cases provides unreasonable decisions. This is since insufficiently versatile tools of these algorithms fail to account for the fact that, usually, the form of clusters can be any, and cluster centres may be absent or unidentified. Thus, the procedures of partitioning objects into clusters are based only on identifying the interrelation between objects and cluster centres but not on the dependence of data objects on each other.

2-2-Hypothesis 2 (H2)
For the analysis of semi structured medical information, the use of an algorithm developed through the fuzzy clustering method, based on the fuzzy relation of equivalence, and generated by the properties of the data under study, seems promising [36]. This algorithm, in which the attribute relationship of the data under study is considered as fuzzy object relationships, makes it possible to identify clusters of arbitrary shapes productively. Selecting the best solution to the fuzzy clustering problem is performed without using additional information about the clusters.

2-3-Hypothesis 3 (H3)
When using the fuzzy clustering method based on the fuzzy relation of equivalence, its adjustment and adaptation for each specific type of medical diagnostic task is required. Furthermore, it may require the addition of other algorithms. Therefore, it is of interest to create a generalized methodology for preparing a rationale for making appropriate MDD based on the clustering of a large volume of statistical information stored in MDSS.

3-Research Methodology
The workability and efficiency of the fuzzy clustering algorithm based on the fuzzy equivalence relation make it possible to use for the hardware implementation of MDSS in many areas of the medical field [37]. The following procedure is aimed at ensuring efficiency when this method is used to analyze the statistical data required for making decisions in applied medicine. A flowchart explaining the methodology is shown in Figure 2. The proposed approach is based on the clustering of initial statistical data using a fuzzy equivalence relation and includes a mandatory sequence of steps:

3-1-Preliminary Data Preparation
The preparatory process involves the selection of the object set for analysis and attributes selection. It is essential that they clearly and fully reflect the considered set. During this stage, the medical technologies to be applied and the procedures involved will be formalized.

3-2-Establishment of Goals of Data Cluster Analysis
Possible goals include:  Determining the number of clusters and identifying their composition for determining cluster composition of the data under study;  Identifying the elements of the object set that are not part of the clusters (the deviations found show the pathology in the ongoing process);  Data preparation based on cluster analysis results to solve the problem of classifying and processing results.

3-3-Defining the Representation form of Results
The results of fuzzy clustering data analysis, depending on the type of data, can be represented as:  Simple enumeration (a universal method of representation where each cluster is identified by its elements);  Tables (the most appropriate way to represent the results of fuzzy clustering: the rows of the table correspond to data objects, columns indicate the clusters, and the values in table cells correspond with values of the membership function).

3-4-Data Normalization
Data normalization is the conversion of ordinal and categorical data into numerical values. When normalizing numerical data in the range of 0 to 1, all weighting coefficients must be equal when comparing data. Consequently, when the attribute weights are different, a single variable needs to be used to process the data. Data normalization is usually carried out based on peer reviews.

3-5-Criteria Selection for Assessing the Quality of Decisions
The aim is to assess the quality of fuzzy clustering results so that effective medically related decisions can be made. Therefore, partition coefficients, entropy partition criteria, partition efficiency coefficients, and cluster power criteria should be used.

3-6-Application of Data Fuzzy Clustering
A cluster analysis method based on fuzzy equivalence relation will be applied to medical statistics.

3-7-Analysis of the Results and Recommendations for Their Utilization in Further Work
A brute-force search of values from a given range of the number of clusters and calculating of criteria taken for analysis are carried out. Then the best partitioning is selected by analyzing the set of criteria extremums. The next operation is measuring either deviations or results preparation for classification, depending on the goals.

4-Results
Fuzzy clustering of medical data based on fuzzy equivalence relations under the proposed algorithm is carried out consistently according to the steps outlined below.

4-1-Step 1
Determination of the normal similarity measure by distance for each attribute ∈ of the set of all attributes P by the formula: where ( , )the distance between attributes pi and pj. Thus, in the process of calculating the normal similarity measure of attribute pi by distance for each attribute ∈ , fuzzy subsets of attributes similar to it are formed.

4-2-Step 2
Determination of relative similarity measure ( , ) of pair of attributes , ∈ regarding the third attribute ∈ of the set of all attributes P by the formula: where ( ) and ( )normal similarity measure relative to and .

4-4-Step 4
Determination of the fuzzy equivalence relation | | based on to the calculation results of the transitive closure of the fuzzy relation in the cycle by the formula: where ( , ); 2,..., ;

| |
Gradation of fuzzy equivalence relation creates many equivalence relations, and they all make it possible to partition the initial family into classes of equivalence. The size of detailed partitioning of the initial set P directly depends on the level of the relation. A more detailed partitioning of the set P corresponds to a higher level of relation.

4-6-Step 6: Selection of the Level of Fuzzy Equivalence Relation Li for Partitioning the Initial Set Into Clusters
Partitioning into clusters depends on the selected level of fuzzy relation Li; in this case, the number and composition of clusters change. According to the presented algorithm of fuzzy clustering using fuzzy equivalence relation, the best partitioning into clusters should be considered the result that meets the quality criteria of fuzzy clustering. To assess the quality of fuzzy clustering based on fuzzy equivalence relation, the following criteria and some of their modifications are most effective.

4-6-1-Partition Coefficient Kpc
Calculated by the formula: where P is the initial set of attributes; CLset of clusters; rijelement of fuzzy equivalence relation matrix | | . The maximum value of the coefficient Kpc=1 indicates the maximum uncertainty; therefore, the obtained partitioning is considered to be the worst.
It is also worth noting that when there are not enough clusters, the obtained value of the partition coefficient is inadequate for its range of values. In this case, it is reasonable to use a modified partition coefficient Kmpc calculated by the Equation 9. The essence of this modification is to move only its range of values.
In this case, the dependence of the modified partition coefficient Kmpc on the number of clusters resulted from the end of the partition coefficient range of values.

4-6-2-Entropy Partition Criterion Kep
Calculated by the formula: where Kmep is not linked to the number of clusters, so if the number of clusters is different, it can be used to compare the results of different clustering methods.

4-6-3-Partition Efficiency Coefficient
Partition efficiency coefficient Kpe is determined by the difference between the coefficient of intra-cluster differences Kpei and coefficient of cross-cluster differences Kpec by the formula:   (12) where: P is the initial set of attributes; pithe i-th component of the set P; pthe average value of pi components; CL the set of clusters; cjthe center of the j-th cluster ∈ ;the element of the fuzzy equivalence relation matrix | | ; ( , ) representing the distance between the two objects pi and cj.
A higher value of the coefficient Kpec corresponds to a more qualitative partitioningthat is, at the optimal value of the number of clusters, the value of Kpec tends to the maximum. Modified partition coefficient and entropy criterion have no link to the number of clusters. Therefore, using them, it is possible to assess the quality of clustering both on a large and a low number of clusters and obtain the results of assessments in the range [0, 1].

4-6-4-Cluster Power Criterion
Cluster power criterion is based on the concept of powerful cluster understood as a cluster considered being of practical use at some importance of the equivalence class in the fuzzy equivalence relation gradation. This quality assessment algorithm, using this criterion, is based on the concepts of equivalence relation level of powerful clusters and intermediate coefficient.
The above criteria make it possible to meet the ambiguous clustering problems effectively. For example (Figure 3), two clusters, clearly separated in two-dimensional attribute space X×Y, overlap when projected on the x-axis, with the result that one-dimensional analysis leads to the conclusion about the existence of one cluster. This makes it possible to set in space one cluster A1, whose center a1 does not correspond to any of the centers of two-dimensional clusters. A similar case of complete or partial overlapping of clusters may arise for the y-axis, thus depriving the possibility to correctly determine the number of clusters and coordinates of their centers without using the criteria of clustering quality assessment.

5-1-Main Findings of the Present Study
The discussion focuses on the results of the data analysis to prepare a rationale for MDD selection. The main result of using the proposed methodological approach of fuzzy clustering of medical statistical data based on fuzzy equivalence relations is the partitioning large volume clusters of statistical information stored in MDSS which corresponds to a particular clustering goal in conditions of the uncertain initial medical data. Such a goal could be, for example, preparing a rationale for MDD selection. To solve this problem, the methodological approach of fuzzy clustering of medical statistical data based on a fuzzy equivalence relation can be formalized more specifically as follows.
The medical statistical information used in MDSS for clustering is a health card (HC) for each patient, displaying his or her health status (e.g., the patient's body temperature and the results of blood and urine tests, etc.). Each HC is characterized by classification attributes (measure, value, patient characteristics, importance, and norms). The system must store the HC for each patient in a normalized form [38,39]. For fuzzy clustering, an initial set P is formed based on HC values for each patient.
The fuzzy clustering algorithm performs partitioning of the HC set by classifying attributes into clusters representing subsets of the initial set P. Based on the results of fuzzy clustering (set CL of HC clusters, membership matrix rij), it is possible to conduct MDD selection. MDSS must store MDD templates for each diagnosis. These templates are compiled based on the results of analysis of medical statistics by a panel. For each template, a set of HCs with certain values of weight coefficients ij  should be stored. If a patient's diagnosis is defined, the MDD corresponds to the template stored in the system. Otherwise, it is possible to get an MDD from a general list without linking it to a diagnosis. In this case, the MDD list will be much larger, but the accuracy of the proposed decision will be lower. In this regard, it is necessary to compare the value of the patient's HC weight coefficients rij, obtained by fuzzy clustering, with the weight coefficients of the templates. Then, based on this comparison, a ranked list of possible MDDs should be generated. A ranked list of possible MDDs is generated based on the similarity measure assessment between the patient's HC and HC of templates. The score ϕ(Sk), where S is the set of templates that specifies the MDD similarity measure of the current patient's health status (PHS), is determined by the following formula: where N is the number of concepts belonging to the PHS model; M is the number of concepts belonging to template Sk; rj is the importance of the concept in the patient's situation; is the importance of the concept in the template; and μ(Cj, Ci) is the similarity of the i-th and j-th concepts. A higher value of ϕ(Sk)corresponds to the template that is closer to the patient's situation and that has greater importance in the set of actions for the patient. The number ϕ(Sk), belonging to template Sk, is called a criterion score, and the generated scale is a criterion scale. Thus, the desired template will be a set of smaller templates that meet the condition max ∈ ( ).
When compared with other templates in the medical area, the maximum criterion score value templates are included in the list of selected templates in ranked order from highest to lowest value. The doctor selects the most appropriate option from the recommended list according to the patient's situation. Then, the patient's treatment method is formed according to the selected MDD.
The main result of using the proposed methodological approach to the analysis of initial medical data is the appropriate clustering of a large volume of statistical information stored in MDSS, including in the context of the uncertainty of the initial medical data.

5-2-Comparison with Other Studies
The formalized methodology of motivational base preparation proposed in this paper should provide a uniform choice of the most appropriate options of MDD based on the clustering of a large volume of statistical information stored in MDSS. The opportunity to work with clusters of arbitrary shape and missing centers provides an advantage over known universal algorithms.

6-Conclusion, Recommendation, and Future Direction
Computer technology is becoming an integral part of all areas of medicine and health care [40]. Decision support systems, which accumulate significant volumes of statistical information of a medical nature, make it possible to obtain in automated mode only the information that is required to provide a motivational basis for selecting the most appropriate MDD option for a particular patient from a recommended list.
The most effective mathematical tool for selecting (from the entire array of accumulated data) information suitable for a specialist, especially in the uncertainty of the initial medical data, is clustering. Among the wide range of known algorithms of explicit and fuzzy clustering, the most suitable to solve practical problems related to the need for analysis of poorly formalized and semi-structured information is the algorithm developed through the fuzzy clustering method and based on the fuzzy relation of equivalence generated by the properties of the data under study. It has proven its effectiveness in solving many practical problems.
The mathematical apparatus implemented within this fuzzy clustering algorithm, based on fuzzy equivalence relation, forms the basis of the proposed methodological approach to the initial statistical data analysis necessary to make medical decisions. This approach consistently implements the following procedures: preliminary data preparation, goal selection of data cluster analysis, definition of the resulting representation form, data normalization, selection of decision quality evaluation criteria, application of fuzzy clustering of data, assessment of the sample results, and their use in further work.
The formalized methodology of motivational base preparation proposed in this paper should provide a uniform choice of the most appropriate options of MDD based on the clustering of a large volume of statistical information stored in MDSS. The application of the proposed methodology and algorithms has a limitation due to the following disadvantage the inability to correctly partition into clusters when they have significantly different variance in different dimensions. Thus, eliminating this disadvantage determines the prospects for further research to improve the approaches outlined in this paper.

6-1-Strengths and Limitations
It should be noted that, when implementing and using the proposed methodology to process medical data, the following should be taken into account:  Statistical medical data for analysis should be preliminarily checked for outliers and incorrect elements by experts in the field of data engineering;  The variance across different dimensions (axes) in medical data clusters should not differ significantly (approximately no more than 25%).

7-2-Data Availability Statement
The data presented in this study are available in article.

7-3-Funding
Selected findings of this work were obtained under the Grant Agreement in the form of subsidies from the federal budget of the Russian Federation for state support for the establishment and development of world-class scientific centers performing R&D on scientific and technological development priorities (internal number 00600/2020/56890) dated November 13, 2020, No. 075-15-2020-929.

7-4-Acknowledgements
The authors are grateful to Professor L. Chervyakov for a careful discussion of this paper.

7-5-Ethical Approval
The article follows the guidelines of the Committee on Publication Ethics (COPE) and involves no studies on human or animal subjects. Consent to participate is not applicable, since the research doesn't involve studies on humans.

7-6-Conflicts of Interest
The authors declare that there is no conflict of interests regarding the publication of this manuscript. In addition, the ethical issues, including plagiarism, informed consent, misconduct, data fabrication and/or falsification, double publication and/or submission, and redundancies have been completely observed by the authors.