Abstract
This paper proposes a new fuzzy classifier based on reinforcement learning. A fuzzy rule based classification system is a special type of fuzzy modeling where its output is a discrete crisp value. The main challenging issue in designing fuzzy classifiers is constructing fuzzy rule base. Here, each fuzzy rule is considered as an agent who has to select the suitable class between candidate classes. It is considered a weight for each candidate class in each rule. These weights are adjusted using the proposed reinforcement learning algorithm. For each sample of training data, if the final result is true, the winner rule (agent) is rewarded and some other rules are punished based on the criteria which are defined in this paper. If the result is false, the winner rule is punished and the rules with high firing strength that have selected correct class are rewarded. Moreover, the input membership functions of rules are adjusted regarding the defined criteria which depend on punishment frequency of rules. The proposed approach is assessed on some UCI datasets. We compare our ideas in comparison with conventional reward and punishment scheme and multi-layer perceptron network. The experimental results show that our proposed approach outperforms both mentioned approaches in the terms of quality of classification and precision.
Introduction
In recent two decades, fuzzy rule-based systems have been extensively applied to pattern classification area since they can achieve good trade-offs between the accuracy of results and rule interpretability [18]. Basically, the design of these systems consists of finding a compact set of fuzzy if-then classification rules which have the capability of modeling input–output behavior of the system. This behavior is typically available as a set of input-output sample pairs. Since the construction of a rule base in a specific pattern classification problem using expert knowledge is difficult; many various approaches have been tried to extract fuzzy rules automatically from numerical data. This includes neuro-fuzzy techniques [9–11], clustering methods[13, 15], genetic algorithms [3, 5], data mining techniques [2, 6], and reward and punishment based learning algorithms [7, 12].
The reward and punishment based learning algorithm utilizes the concepts inspired by reinforcement learning which is used in various issues of machine learning like feature selection [4], learning classifier [17], and web pages ranking algorithm [16].
There are two main steps to construct fuzzy rules from numerical data: 1) identifying the structure of fuzzy classifier which includes partitioning the feature space to fuzzy subspaces, determining input membership functions and selecting antecedent and consequence parts of each rule for an initial rule base, and 2) adjusting the parameters of the classifier.
The grid partitioning can be used over each dimension of input patterns to generate primary fuzzy rules [6, 18]. Alternatively, multi-dimensional fuzzy sets may be generated by applying a clustering algorithm to feature space [1]. Antecedent fuzzy sets may have pre-specified linguistic values with fixed homogeneous membership functions for each axis of the feature space or may be defined by domain experts [3].
Generating a set of candidate rules for each class of a classification problem is the first step of constructing an initial rule-base for that problem. The second step is selecting a given number of rules from each class according to a specific metric [8].
Having this initial rule base, it is usual to improve the performance of the classifier by adjusting input membership functions in various methods; or determining rule weights as a simple mechanism. It is also useful to apply a rule pruning procedure to enhance the efficiency and comprehensibility of the classifier [12, 18]. The learning of rule weights can partially replaced by modification of membership functions of antecedent fuzzy sets; but, adjusting rule weights as a single parameter per rule is much easier and faster [18].
A reward and punishment algorithm was presented in [12]. In this algorithm (we called it as basic algorithm), the fuzzy rules compete among themselves in terms of fuzzy reasoning to be a decision maker. The winner rules will be encouraged by reward if their decision for each sample of train dataset is correct, and they will be punished if they classify the sample as the member of a wrong class.
In this method [12], for each sample the reward or punishment is used to modify the weight of the winner rules. It means this algorithm adjusts only one rule in each step. We are motivated to adjust the parameters of some other rules after receiving each reward or punishment, too. For example, the loser rules in rules competition with correct class label when the sample is classified wrong, or some rules with very high firing strength that lost correctly when the classification result is true, can be adjusted too. In this way, we can adjust these rules before they visit their own samples, which their class label is as same as the consequence of these rules. These accelerate learning.
On the other hand, sometimes adjusting rule weights in reward and punishment algorithm cannot enhance the results well and since the basic reward and punishment method does not adjust the antecedent parameters of rules; it causes rule base to swing in its decisions. These conditions motivate us to modify fuzzy input subspaces to cope the mentioned challenge.
The main contributions of the paper are as follows: 1) Adjusting the parameters of some loser rules which can indirectly affect the classification performance, 2) identifying rules which swing in their decisions and improving their behavior through modifying their antecedent part, 3) Defining three criteria in adjusting antecedents’ parameters of rules.
In this paper, we propose a new fuzzy classifier based on reward and punishment (reinforcement learning) algorithm to adjust the parameters of antecedent membership functions as well as fuzzy rule weights. Some new criteria are defined to criticize the behavior [12] of each fuzzy rule in training phase.
The rest of this paper is organized as follows. First in Section 2, the structure of a fuzzy rule based classification system, fuzzy reasoning techniques, and the basic reward and punishment algorithm for weight adjusting are explained. In Section 3, the proposed algorithm is proposed. Section 4 includes the experiments and comparison results. Finally, concluding remarks are given in Section 5.
Fuzzy rule based classification systems
Generation of a fuzzy rule based classification system consists of two main procedures: 1) a fuzzy rule generation procedure which determines data base, and constructs rule base, 2) a classification procedure with a reasoning method [8]. The data base describes the meaning of fuzzy sets associated to each linguistic label. Each rule in rule base specifies one of fuzzy subspaces using defined fuzzy sets in its antecedent part. The reasoning method provides the manner to classify patterns. In the following, we describe the structure and function of these procedures and introduce the basic algorithm which is improved in this paper.
The structure of rule base and fuzzy reasoning methods
A widely used type of fuzzy rule for ann-dimensional classification problem is as follows:
where X = [x1, x2, ⋯ , x n ] is an input feature vector, C ∈ {C1, C2, ⋯ , C M } is the label of consequent category, A jk is the fuzzy set associated to x k , CF j is the weight of rule R j , and N is the number of fuzzy rules in the rule base.
In order to classify a sample like X
p
= [xp1, xp2, ⋯ , x
pn
] belongs to the dataset, first, the degree of compatibility of the pattern with each rule (α
A
jk
) is calculated. Then, the firing strength of the rule (μ
j
) is specified using the given T-norm operator as an “and” operator. If the product is used as the T-norm operator, μ
j
will be calculated as Equation (2).
To determine the category of the input sample, there are two different reasoning methods with respect to the strength of fired rules and their weights: 1) single winner and 2) weighted vote [9, 14].
In the case of single winner, the input sample is classified as a member of the class which appears in the consequence of winner rule (R
w
). This consequence named C
w
and determined as Equation (3):
There are two particular conditions in this method: 1) if a unique rule cannot be identified as a winner for an input pattern by using Equation (3), the fuzzy system does not classify that pattern at all. 2) The system classifies those input patterns which could stimulate at least one of fuzzy rules.
In the case of weighted vote reasoning method, all the fired rules participate to classify the input pattern. The voting portion of each rule defines as the product of its firing strength and its weight. At the end, the input sample is classified in the class which the total strength of the vote for that is higher than others. The recent strength can be calculated as Equation (4) where RB is the fuzzy rule base and Class (R
j
) is the label of the class appears in consequence of rule R
j
.
Usage of reward and punishment concepts to adjust fuzzy rule weights has been offered in [12]. This learning algorithm can be applied in both of single winner and weighted vote reasoning methods. Here, the first method is used.
After classifying the trained sample X
p
in single winner method, the decision maker rule will be rewarded if the selected class is true; otherwise it will be punished. This reward or punishment affects the weight of winner rule directly by increasing or decreasing it. To reward or punish the winner rule, the Equations (5) and (6) are used respectively:
where η1 and η2 are the positive learning rates which are used to increase or to decrease the weight of rule (R w ).
In the single winner reasoning method, the weight of winner rule is the only parameter that is adjusted according to classification results. However, if the result of classification is wrong, not only the decision-maker (winner) rule should be punished but also other rules (called as group ‘A’) which could determine the true class, should be punished either. In fact, the disability of these rules is caused that the wrong rule be the winner. So, this group of rules should be reinforced till they could be winner for the input samples of their class.
In the other hand, when the classification of an input sample has been done correctly, another group of rules with high firing strength (called as group ‘B’) should receive a little punishment so that we have a safe margin for winner rule in this situation. In fact, this punishment causes difference between score of winner rule (μ w (X p ) . CF w ) and score of rules in group ‘B’, is increased. A criterion is needed to identify these groups of rules. The members are rules that their firing strengths are higher than the defined threshold, β (e.g. a given percent of the winner rule’s firing strength). This threshold should be between 70% and 100% in order to create groups ‘A’ and ‘B’ with appropriate rules; otherwise the parameters for lots of rules will be changed and the initial boundaries computed for classes in initial rule base will be completely distorted.
The Equations (7) and (8) are used to encourage (increase weights) and to punish (decrease weights) the group ‘A’ and the group ‘B’, respectively:
Where η3 and η4 are the learning rate, R a and R b is a member of group ‘A’ and group ‘B’, respectively.
The following example clarifies this grouping method. Consider the classification problem of Fig. 1 which has a 2-dimensional feature space and 4 class labels. Suppose the following rules are the portion of an initial rule base which has been created by the training samples and the weight of all rules initialized by 1 at the beginning of learning:
R1: If x1 is A1 and x2 is B1 then C1
R2: If x1 is A1 and x2 is B2 then C3
R3: If x1 is A2 and x2 is B1 then C2
R4: If x1 is A3 and x2 is B1 then C4
R5: If x1 is A2 and x2 is B2 then C2
Consider the selected sample shown in Fig. 1, the rule R1 with wrong consequence C1 will be the winner and it classifies this sample incorrectly! In this condition, based on our idea R1 and rules in Group ‘A’ should be reinforced, where group ‘A’ can be included rules R3 and R5 depending the amount of threshold. The higher the threshold value, the fewer rules can be included in group ‘A’ and vice versa. So, if a high value is used, Group ‘A’ will only include rule R3 and it should be reinforced till it could be winner for this input sample.
Generally, adjusting only rule weights cannot enhance the accuracy of classification to the desired accuracy. Indeed, if the default subspaces utilized in rules cannot cover the training data very well, to reach high accuracy modifying the antecedent part of rules is suggested. For this matter, we define two metrics for each rule (S . T 1 ): i) the number of punishments of the rule, ii) the fluctuation number of its classification result.
To obtain the number of punishments, each rule will have a counter. It will be increment if the associated rule becomes a member of group ‘B’ or it makes a wrong classification decision as a winner rule. If the value of this counter exceeds a predefined threshold, the process of learning will switch to antecedent modification phase. This counter updates as follows:
where pc R k is the punishment counter of rule R k , k = 1, …, N. The algorithm utilized Equation (5) till Equation (8) for update weights of rules and the above defined punishment criteria for adjusting the antecedent of rules, called Reward and Punishment Classifier with Punishment Criterion (RPCwPC).
In the continuous, second defined metric is described. Sometimes, the decision of the rule swings between correct and wrong result; in another word, this rule sometimes wins the competition correctly (it leads to correct class) and sometimes its conquest lead to classify the input pattern in a wrong class. The more swinging in result makes it more necessary to modify the antecedent part. It is possible to use a distinct counter to determine the fluctuation of each rule. This counter is defines as follows:
where sc R k is the swing counter of rule R k , k = 1, …, N. The algorithm utilized Equation (5) till Equation (8) for update weights of rules and the above defined swing criteria for adjusting the antecedent of rules, called Reward and Punishment Classifier with Swing Criterion (RPCwSC).
It is possible that the fluctuation of a rule decreases after modifying its antecedent part in several times. So, it may be useful to utilize a forgetting factor in order to reduce the effect of initial fluctuation. In this way, each rule can change the status of its operation between weight adjusting and antecedent modification according to visited input patterns. The Reward and Punishment Classifier with Forgetting Swing Criterion (RPCwFSC) is the name used for a classifier with this defined criterion. The forgetting swing counter for rule R k , k = 1, …, N (fsc R k ) is defined as Equation (11) where 0 < γ ≤ 1 is the forgetting factor.
Respecting three defined criteria, 3 proposed algorithms switch to antecedent modification phase for some rules. In this phase, all the membership functions (except “don’t care” ones) of these fuzzy rules are shifted in an allowed interval to decrease or increase their firing strength. If the training input sample belongs to the class of rule, the shifting is done so that the rule’s firing strength is increased. Vice versa, if the training input sample does not belong to the class of rule, shifting is done to decrease the rule’s firing strength.
The shifting depends on the percentage of the distance between the value of the sample and the membership function center in each dimension as Equation (12) where mfcR k ,i is the centre of membership function in dimension i of rule R k , k = 1, …, N which changes of its value make the corresponding membership function shift along its own feature axis and 0% < λ ≤ 100 % is the shift rate.
All notations used in this algorithm are summarized in Table 1. The flowchart of general proposed algorithm is seen in Fig. 2. This flowchart is independent of the switch criteria.
Similar to the work done in [6–8, 18], 14 different membership functions for each dimension of feature space (Fig. 3) have been considered. Whereas various combinations of these fuzzy sets create plenty of rules (14 n distinct rules where n is the number of dimensions of feature space); a “don’t care” fuzzy set is introduced which its value is equal to 1 in whole interval of input dimensions. Then, all possible rules are created so that they do not have a “don’t care” membership function in at most two ones of n input dimensions.
Assume that m labeled patterns X p from M classes are given. So, the consequence of each rule is determined considering the highest value of the summation of its firing strength for the samples with the same class label in the training set:
where A
j
⇒ ClassC
j
is the expression of rule R
j
as an association rule from the field of data mining [18], the confidence of the rule is computed as Equation (14).
Lastly, to construct the initial rule base, the generated candidate rules with same consequent classes are sorted in descending order according to Equation (15) and a predefined number of them are selected for each class. In our experiments, we constructed a rule base by selecting 100 candidate rules from each class. Equation (15) ranks the candidate rules of each class based on difference of confidence levels.
In order to assess the performance of the proposed algorithm, we used three datasets available from UCI ML repository 2 . The statistical information of these datasets is seen in Table 2.
The samples in each dataset are randomly divided into training and testing sets with a ratio of 2 to 1. The attributes of the problem are normalized into a real number in the interval [0,1]. The determined amounts of the parameters of our proposed algorithm are available in Table 3 for each dataset. The proposed algorithm is not very sensitive in respect of its individual parameters. We inspired the values used in [12] to initialize the learning rate parameters. A designer can find a suitable combination of these parameters just by experiencing their different combinations a few times.
In rule generation phase, the number of created rules is 252 for Glass dataset, 300 for Iris, TAE, and Seeds datasets, and 200 for Sonar dataset. The training process is repeated 500 times per experiment. To implement MLP neural network we used MLP as it is a well-known neural network.
In our primary experiments for MLP with a single hidden layer, besides increasing the number of neurons, the results were not acceptable. Therefore, a MLP with two hidden layers were suggested. So, we create a three layered structure with 30 and 10 hidden neurons in each hidden layer for Glass, Iris, Sonar, and Seeds datasets and a two layered network with 5 hidden neurons for TAE dataset. The evaluation of the experiments is done by the average of evaluation accuracy over 10 independent runs. Table 4 shows the comparison results.
As seen in Table 4, in all experiments, the results of our proposed algorithm are better than the basic reward and punishment method. This improvement occurs because the proposed algorithm uses the reinforcement signal to adjust some rules parameters whereas the basis algorithm only adjusts one rule for each sample.
Moreover, our algorithm prevents the swing cases in the training. It modifies these rule weights and monitors their results during training phase to adjusting their antecedent if needed.
The methods RPCwPC and RPCwSC achieve more favorable result for Glass dataset. The RPCwPC increases the accuracy about 1.5% and the RPCwSC enhances it about 10% in comparison to basic reward and punishment method. The RPCwFSC has the same result as the basic method. In fact, because of the value of forgetting factor set to 0.9, this approach never arrive to antecedent modification phase and its operation is close to the basic algorithm. In comparison to MLP, RPCwSC has better accuracyabout 8% .
In Iris dataset, RPCwSC and RPCwPC can improve the accuracy of classification and increase it about 8% more than the basic algorithm, but in general, MLP presents better result.
Applying RPCwSC on TAE dataset resulted higher accuracy about 4% in comparison to basic reward and punishment method and MLP. The effect of forgetting factor in RPCwFSC is observable in TAE and Sonar experiments where the accuracy is enhanced about 2% more than the basic algorithm and MLP, respectively. In Seeds dataset, RPCwSC can dominate others in increasing the accuracy of classification and make it about 4% more than the basic algorithm.
Figure 4 shows how the input membership functions have been changed after running the RPCwSC algorithm in Glass dataset. The more number of membership functions for a specific dimension of feature space indicates the more complex discriminate boundary between classes in the dimension.
In this paper, a reward and punishment based learning algorithm was proposed for adjusting the parameters of a fuzzy rule based classification system. The single winner reasoning method was improved. In the proposed algorithm, the parameters of another group of rules are adjusted more than the rule which affects the classification result directly. The members of this group participate to classify input samples indirectly.
The RPCwPC, RPCwSC, and RPCwFSC are the versions of the proposed algorithm. They choose one of both, weight adjusting and antecedent modification phases. This choice is based on numbers of punishments, fluctuation of each rule in training samples classification and its discounted type. The result of applying this algorithm on some UCI datasets shows its particular version RPCwSC can increase the accuracy of results significantly.
However, the proposed approach works quite well, but of course, there are more sophisticated options which can be tested in future developments. The future work can apply this algorithm to improve weighted vote reasoning technique. It can utilize reward and punishment formulas to modify antecedent parts, and can adapt some pruning techniques to compact the final fuzzy rule base.
