Rival penalized competitive learning

From Scholarpedia

Lei Xu (2007), Scholarpedia, 2(8):1810. revision #37114 [link to/cite this article]

Curator: Dr. Lei Xu, Dept. Computer Science & Engineering, Chinese University of Hong Kong

Rival penalized competitive learning (RPCL) is a development of competitive learning in help of an appropriate balance between participating and leaving mechanisms, such that an appropriate number of agents or learners will be allocated to learn multiple structures underlying observations. Moreover, RPCL has been further extended into a general divide-conquer learning framework for multi-agents to divide and conquer a complicated task.

Contents

Classical competitive learning: from WTA to FSCL

Figure 1: Competitive learning for clustering analysis and dead unit problem.
Enlarge
Figure 1: Competitive learning for clustering analysis and dead unit problem.

Proposed decades ago, classical competitive learning (CCL) is usually used for making data clustering by a number of units or agents. Each agent is featured by a parametric point \theta_j=m_j that moves among the observation samples and seeks to be located at the center of a cluster of samples, such that several points collectively perform clustering analysis. As a sample x_t comes, each m_j evaluates a degree of its suitability on representing x_t by an individual value or criterion \varepsilon_t(\theta_j)=\Vert x_t- m_j \Vert^2, and competes to be allocated to represent x_t. The competition is guided globally by the winner-take-all (WTA) policy

(1)
p_{ j, t}= \begin{cases} 1, & \mbox{if}\quad j=c, \\  0, & \mbox{otherwise;}\end{cases} \qquad c=arg \ min_{ j} \varepsilon_t(\theta_j).

Then, each m_j is updated to further reduce \varepsilon_t(\theta_j) by

(2)
m_j^{new}= m_j^{old}+ \eta p_{ j, t} (x_t- m_j^{old}).

When m_1, m_2 are initialized as in Fig.1(a), as a sample x comes, m_1 becomes the winner and then moves a little bit towards x. Next, as another sample x comes in Fig.1(b), m_2 becomes the winner and then moves similarly. Finally, m_1 and m_2 each converge to one clusters' center as in Fig.1(c).

However, if m_1, m_2 are initialized as in the top part of Fig.1(d), m_2 always becomes the winner and finally moves the middle point between two clusters, while m_1 never wins and becomes dead. This is known as the under-utilized or dead unit problem, featured by a WTA based competitive learning (Rumelhart et al, 1985; Grossberg, 1978; Hecht-Nielsen, 1987), which makes the task of clustering analysis badly performed. This problem comes from unequal chances for the dead agents to participate competition due to their unfavorable initial locations. Thus, it is desired that every agent should be initialized to have an equal chance to join competition, which is actually difficult to implement without a priori knowledge on the entire clustering structure, while such a knowledge is exactly the goal of this competitive learning.

One way to tackle this difficulty is a so called frequency sensitive competitive learning (FSCL) (Ahalt, et al, 1990), in help of modifying \varepsilon_t(\theta_j) into

Figure 2:   Samples of four clusters.
Enlarge
Figure 2: Samples of four clusters.
Figure 3:    Three dead units by CCL (double click it to see animation).
Enlarge
Figure 3: Three dead units by CCL (double click it to see animation).
(3)
\varepsilon_t(\theta_j)=\alpha_j \Vert x_t- m_j \Vert^2,

where \alpha_j is the frequency that the j-th agent won in past. Illustrated in Fig.2 is an example of four clusters, the CCL resulted in three dead units in Fig.3, while the FSCL makes all the four units placed to the cluster's centers instead of becoming dead, as shown in Fig.4.

Figure 4:   FSCL makes all the four units placed (double click it to see animation).
Enlarge
Figure 4: FSCL makes all the four units placed (double click it to see animation).
Rival penalized competitive learning
Enlarge
Figure 5: When k is pre-assigned to 5. the frequency sensitive mechanism also brings the extra one into data to disturb the correct locations of others (double click it to see animation).

Such a sprit of reducing the winning rates of frequent winners can also be implemented in alternative ways, e.g., conscience (Desieno, 1988), leaky learning (Rumelhart et al, 1985; Grossberg, 1978), convex bridge (Hecht-Nielsen, 1987), and neighbor sharing in Kohonen network (Kohonen,1982, 1995).

Rival penalized competitive learning

Figure 6:  RPCL and automatic allocation.
Enlarge
Figure 6: RPCL and automatic allocation.

The way of reducing the winning rates works when the number k of clusters is known. Unfortunately, it is usually unknown. Not only it is no doubt that both CCL and FSCL will result in bad performances when k is given a smaller number, but also FSCL fails when k is pre-assigned to a value larger than an appropriate one. As illustrated in Fig.5, the frequency sensitive mechanism will bring the extra agents into data to disturb the correct locations of other agents. Thus, how to determine k is a critical difficulty too.

One solution is provided under the name of Rival Penalized Competitive Learning (RPCL), proposed in the early 1990's for clustering analysis and detecting lines in an image (Xu, Krzyzak, & Oja, 1992 & 93). As illustrated in Fig.6, not only the winner is learned but also its rival (i.e., the second winner) is repelled a little bit from the sample to reduce a duplicated information allocation. As a sample x comes, m_1 is the winner and m_2 is its rival in Fig.6(a), m_1 is moved towards x by a small step size, while m_2 is repelled away from x by a much smaller step size. Similarly in Fig.6(b), m_3 is the winner and moved towards x, while m_2 is its rival and repelled away from x.

Figure 7:  Rival penalized mechanism makes extra agents driven far away(double click it to see animation).
Enlarge
Figure 7: Rival penalized mechanism makes extra agents driven far away(double click it to see animation).

More precisely, the RPCL differs from FSCL in that eq.(2) is implemented with eq.(1) replaced by

(4)
p_{ j, t}= \begin{cases} 1, & \mbox{if} \, j=c, \\  -\gamma, & \mbox{if}\,  j= r, \\ 0, & \mbox{otherwise}, \end{cases} \begin{cases} c=arg \ min_{ j} \varepsilon_t(\theta_j), \\  r=arg \ min_{ j \ne c } \varepsilon_t(\theta_j),\end{cases}

where \gamma approximately takes a number between 0.05 and 0.1 for controlling the penalizing strength. This is an empirical heuristics, while there still lacks a theoretical guide on how to determine an appropriate \gamma, which will be further discussed at the end of this paper.

As illustrated in Fig.6(c), m_1 and m_3 finally converge to clusters' centers, while m_2 is driven far away from samples. Shown in Fig.7 is an experimental result that improves its FSCL counterpart in Fig.5, with a disturbing agent driven away from samples. Adding this rival penalized mechanism makes extra agents driven far away from data as long as the number of agents is initially given to be larger than the number of clusters to be detected. In other words, RPCL can allocate an appropriate number of learning agents to detect a correct number of clusters or objects automatically during learning. This automatic allocation is a favorable feature that both CCL and FSCL as well as the conventional clustering algorithms (e.g., k-means) do not have. In the past decade, a number of applications of RPCL have been made (Acciani, et al, 2003; Nair, 2003; Nakkrasae & Sophatsathit, 2004).

A general framework for divide-conquer learning

CCL, FSCL, and RPCL can be investigated systemically from the perspective of a general framework for learning based problem solving. That is, multi-agents coordinately divide a complicated task into a number of subtasks and each subtask is mainly conquered by an individual agent such that the entire task can be well solved collectively by these multi-agents (Xu, 2007). In the following, we introduce the four basic ingredients of this general framework.

  • Agent's structure and initialization Each agent needs a hardware S_j that is able to handle typical problem solving tasks. This S_j is usually featured by a pre-designed structure with a set \theta_j of parameters to be determined via learning. The simplest example is that S_j is given by a point \theta_j=m_j. Another example is its probabilistic extension that S_j is given by a Gaussian distribution G(x|m_j,\Sigma_j) with a priori probability \alpha_j (Xu, 1998). Even further, G(x|m_j,\Sigma_j) can be extended to any parametric distribution p(x| \theta_j). At the beginning, it is desired that every \theta_j is initialized such that every agent has an equal chance to enter learning. However, it is usually difficult to do so without an enough piece of a priori knowledge.
  • Individual criterion As an extension of the previous \varepsilon_t(\theta_j)= \Vert x_t- m_j \Vert^2, each agent has an individual value \varepsilon_t(\theta_j) to evaluate its performance per sample x_t observed. Typically, this performance is measured by considering an error
(5)
\! e_t(\theta_j)=x_t- x_t(\theta_j),

where \!x_t(\theta_j) is a reconstruction or a representation of x_t by S_j. One typical instance of such a measure is simply the least square error \varepsilon_t(\theta_j) =\Vert e_t(\theta_j) \Vert^2. For a simple case \!\theta_j=m_j, we have \!x_t(\theta_j)=m_j and \varepsilon_t(\theta_j)=\Vert x_t- m_j \Vert^2. Another type of instances is given in the probabilistic forms (Xu, 1998)

(6)
\varepsilon_t(\theta_j)= -\begin{cases} \ln{[\alpha_j p(x| \theta_j)]},  & (a)\, Type\, I, \\  \ln{ p(x| \theta_j)^{\alpha_j}},  & (b)\, Type\, II, \end{cases}

where \!p(x| \theta_j) includes \! p(e_t|\psi_j) or equivalently \!p(x_t|x_t(\theta_j), \psi_j) as a special case. Type II can be regarded as a further extension of eq.(3) since it returns equivalently to eq.(3) when \!p(x| \theta_j)=G(x|m_j,I). Given an equal priori \alpha_j=1/k, both Type I and Type II degenerate to be equivalent to a negative likelihood \!\varepsilon_t(\theta_j)=-\ln{p(x| \theta_j)}. Other criteria are referred to (Xu, 2007).

  • Global principle Extending eq.(1), agents are coordinated via a global principle that combines individual criteria \{ \varepsilon_t(\theta_j)\}_{j=1}^k. The global principle can be observed in two forms. One is a criterion for evaluating the overall performance jointly by all the agents, e.g., a simple one is the following weighted sum
(7)
\varepsilon(\theta)=\sum_{t=1}^N \varepsilon_t(\theta),\varepsilon_t(\theta) =\sum_{j=1}^k p_{ j, t} \varepsilon_t(\theta_j).

The other form is an allocation scheme on a per sample basis

(8)
p_{ j, t}= ALOC[\{{\varepsilon_t(\theta_j) }\}_{j=1}^k]

according to which x_t is assigned to the j-th agent with a weight p_{ j, t}.

  • Updating rule As a further extension of eq.(2), the activated agents are modified to adapt x_t according to the allocation by p_{ j, t}, e.g.,
(9)
\theta_j^{new}= \theta_j^{old}+ \eta p_{ j, t}\delta \theta_j, \ \delta \theta_j=Reduce(\varepsilon_t(\theta_j^{old})),  \  \theta_j^{old}, \theta_j^{new} \in D(\theta_j),

where \eta >0 is a learning step size, and D(\theta_j) is the domain of \theta_j, within which \theta_j may need to satisfy certain constraints. Moreover, Reduce(\varepsilon_t(\theta_j)) denotes an operator that acts on \theta_j^{old} and x_t to yield a direction along which \varepsilon_t(\theta_j) reduces within a neighborhood region of \theta_j^{old}. When \varepsilon_t(\theta_j) is differentiable, Reduce(\varepsilon_t(\theta_j)) can be given by -\nabla_{  \theta_j}\varepsilon_t(\theta_j) or -{\mathcal P}\nabla_{  \theta_j}\varepsilon_t(\theta_j) on D(\theta_j), where \mathcal P is a positive definite matrix. Different types of Reduce(\varepsilon_t(\theta_j)) and different ways of controlling \eta >0 will jointly lead to updating rules with difference convergence performances.

In this framework, a conquering action is performed via a updating rule by eq.(9), while a dividing action is performed via an allocation scheme by eq.(8). In addition to the problem of parameter learning, i.e., estimating unknown parameters in all the agents, it also needs to select an appropriate k of agents to be allocated, such that not only each agent performs best according to its individual criterion but also a best overall performance is achieved according to its global principle. In the literatures of statistics and machine learning, the task of determining the number k belongs to what is usually called model selection.

In past decades, several typical model selection theories have been proposed for model selection, such as AIC and extensions (Akaike, 1974; Bozdogan & Ramirez, 1988; Cavanaugh, 1997), Bayesian approach related criteria under the names of BIC (Schwarz, 1978), MML (Wallace, 1999), and MDL (Rissanen, 1986), the cross validation based criteria (Stone, 1978; Rivals & Personnaz, 1999). In implementation, model selection has to be made in a two-phase procedure. At the first phase, a number of candidate models are enumerated, and the unknown parameters in each candidate are estimated by the maximum likelihood (ML) principle. At the second phase, a criterion is obtained from one of the above theories and used on every candidate with estimated parameters to get the best candidate. Unfortunately, a two-phase implementation is very expensive to compute. Moreover, these theories can only provide a rough estimation or even a wrong estimate especially on a small size of samples.

Automatic model selection via coordinated mechanisms

As illustrated in Fig.8(a), in addition to allocating agents and conquering each subtask individually via determining all unknown parameters in every agent, an appropriate combination of a global principle and individual criteria actually results in automatic model selection during parameter learning, due to a coordination of two opposite mechanisms, namely one for each agent to participate and the other for each agent to leave.

The WTA allocation by eq.(1) is obtained via minimizing the global criterion by eq.(7) with respect to \!p_{ j, t} under the constraints \sum_j p_{ j, t} =1, \ p_{ j, t} \ge 0. It leads to the previous CCL with a local criterion \varepsilon_t(\theta_j)=\Vert x_t- m_j \Vert^2. As mentioned early, it has the dead unit problem because agents do not have an equal chance to participate. This problem can be solved via a participating mechanism that lets each agent to have an equal initial chance to participate, through either or both of its local criterion and the global criterion.

For the FSCL that bases still on the WTA allocation by eq.(1), its local criterion by eq.(3) provides a participating mechanism that penalizes those frequent winners and compensates those constant losers. Alternatively, we can still use \varepsilon_t(\theta_j)=\Vert x_t- m_j \Vert^2 but minimize the global criterion by eq.(7) under the following constraint

(10)
\sum_j p_{ j, t} =1, \ p_{ j, t} \ge 0, \ and \ p_{ j, t} \propto e^{- {\varepsilon_t(\theta_j)  }}, \ j\in \mathcal{C}_{\kappa},

which leads to the following soft-allocation:

(11)
\begin{array}{l} p_{ j, t}= \begin{cases} \frac{e^{- {\varepsilon_t(\theta_j)  }}} {\sum_{i \in \mathcal{C}_{\kappa} }  e^{-{\varepsilon_t(\theta_i) }  }}, &  \mbox{if }j\in \mathcal{C}_{\kappa}, \\ 0, & otherwise; \end{cases}  \\ \mathcal{C}_{\kappa}= \{ j :  \ \varepsilon_t(\theta_j)  \mbox{ among the } \kappa \mbox{ smallest ones of } \{ \varepsilon_t(\theta_j) \}_{j=1}^k \}, \end{array}

where every agent in \mathcal{C}_{\kappa} gets a chance p_{ j, t}>0 to participate per sample coming.

However, all the above cases are lack of a mechanism for an incapable or extra agent to leave. As a result, they will still fail when the number of agents in consideration is larger than an appropriate one. As previously introduced, RPCL improve FSCL via penalizing a rival by eq.(4), which provides a mechanism to drive an incapable or extra agent away. An appropriate balance between this leaving mechanism and the participating mechanism, as illustrated in Fig.8(a), leads to a favorable RPCL nature of automatic model selection, with the balance controlled by the de-learning rate \gamma in eq.( 4).

Instead of providing a participating mechanism via a local criterion by eq.(3), a balance between the mechanisms for participating and leaving can also be obtained by the following generalized RPCL allocating scheme:

(12)
p_{ j, t}=q_{ j, t}- \gamma \pi_{ j, t}, \ q_{ j, t} \ge 0, \sum_{j}  q_{ j, t} =1 \ and \ \pi_{ j, t} \ge 0, \sum_{j}  \pi_{j, t} =1,

where q_{ j, t} provides a sharing allocation similar to that by eq.(11), while \pi_{ j, t} provides a penalizing scheme that distributes a penalizing strength \gamma over agents. One extreme case is eq.(4) with q_{ j, t} by eq.(1) and \pi_{ j, t}=1 at j=r and \pi_{ j, t}=0 for j\ne r. One another extreme case \pi_{ j, t}=1/k.

Figure 8:   (a) coordinated mechanisms, (b)  rewarding vs penalizing via variances.
Enlarge
Figure 8: (a) coordinated mechanisms, (b) rewarding vs penalizing via variances.

Alternatively, we can get a leaving mechanism via an appropriate local criterion too, e.g., by eq.(6) instead of using a global allocating policy such as either of eq.(4) and eq.(12). Without losing generality, still using the WTA by eq.(1), we consider Type I in eq.(6) with a special Gaussian p(x| \theta_j)=G(x|m_j, \sigma_j^2I). It follows from that \varepsilon_t(\theta_j)=-\ln{\alpha_j }+ 0.5 [{\Vert x_t-m_j\Vert^2 / \sigma_j^{2}}+d\ln{(2\pi \sigma_j^2)}] varies with \sigma_j^{2} as shown in Fig.8(b), where d is the dimension of x. If the j-th agent wins a lot of samples, \sigma_j^{2} will become larger than a threshold, not only \varepsilon_t(\theta_j) will increase but also the gap between different agents reduces. In other words, the competing ability of agents that won too much in past is self-penalized and the competing ability of weaker agents is compensated for a relative boosting, which thus provides a participating mechanism similar to \alpha_j in eq.(3). This similarity can also be observed from the term 0.5  {\Vert x_t-m_j\Vert^2 / \sigma_j^{2}} that \sigma_j^{-2}\propto \alpha_j takes a role similar to \alpha_j in eq.(3).

Also, when the j-th agent won too fewer samples with its \sigma_j^{2} dropping below certain threshold, not only \varepsilon_t(\theta_j) will increase but also the corresponding gap between different agents increases. As a result, an incapable/extra agent is self-penalized to fade out, which also provides a leaving mechanism that is different from the RPCL one by eq.(4). Moreover, as the competing ability of the j-th agent decreases, \alpha_j decreases and the other \alpha_i will relatively increase. That is, the term -\ln{\alpha_j } enhances this self-penalizing featured leaving mechanism. As a result, the local criterion intentionally seeks an appropriate balance between its participating mechanism and leaving mechanism, with an automatic model selection nature.

For Type II in eq.(6), we have \varepsilon_t(\theta_j)=0.5 \alpha_j[{\Vert x_t-m_j\Vert^2 /\sigma_j^{2}} +d\ln{(2\pi \sigma_j^2)}] that varies with \sigma_j^{2} in a way similar to Fig.8(b). Thus, the above discussion applies similarly. The difference is that \alpha_j takes a role similar to that in eq.(3), which enhances a participating mechanism instead of a leaving mechanism.

Categorization on Gaussian mixture

Since the coordinating nature between a participating mechanism and a leaving mechanism is actually featured via a combination of a specific individual criterion \varepsilon_t(\theta_j) and a specific instance of global allocating scheme by eq.(8), here we categorize different combinations to get an overlook on algorithms that fall in the proposed divide-conquer learning framework. Several typical examples on Gaussian mixture are listed in Fig.9.

It follows from the first row with p_{ j, t} by eq.(1) that varying \varepsilon_t(\theta_j) leads us to CCL, FSCL, and an extension from CCL to a special case of Bayesian Ying Yang Learning or shortly BYY-CL. Then, we go to the third row with p_{ j, t} by eq.(11) at \kappa =k. The block (3,1) is a typical example of Type I in eq.(6), with p_{ j, t} by eq.( 11) becoming the Bayes posterior allocation:

(13)
p_{ j, t}= { \alpha_jp(x| \theta_j) / p(x|\Theta)}, \  p(x|\Theta)=\sum_{i=1}^k \alpha_ip(x| \theta_i).
Figure 9:    Categorization: allocating policy versus individual criteria.
Enlarge
Figure 9: Categorization: allocating policy versus individual criteria.

From \nabla_{\theta_j} \ln{p(x|\Theta)}= - p_{ j, t} \nabla_{\theta_j} p(x| \theta_j), we observe that the updating by eq.(9) is equivalent to making an ascent updating of the likelihood on p(x|\Theta). That is, the block (3,3) leads to an adaptive version of the well known EM algorithm for the maximum likelihood (ML) learning on Gaussian mixture with p(x| \theta_j)=G(x|m_j, \Sigma_j) (Redner & Walker, 1984; Xu & Jordan, 1996). Solving the previously mentioned dead unit problem due to p_{ j, t} by eq.(1), p_{ j, t} by eq.(13) enhances the participating mechanism of each agent, which is along a direction similar to that of FSCL. Still, k needs to be pre-specified since it is well known that the ML learning is not good at model selection, especially on a small size of samples. Actually, the participating mechanism of each agent may be over-enhanced such that the leaving mechanism discussed in Fig.8(b) fails to balance. Moreover, the block (3,1) is equivalent to its degenerated case at \Sigma_j= \sigma^2I and \alpha_j=1/k, where the leaving mechanism in Fig.8(b) is switched off completely. Alternatively, the block (3,2) is equivalent to a degenerated case of Type II in eq.(6) at \Sigma_j=I, which modifies FSCL with a further enhancement on each agent's participating mechanism, and thus may result in a further imbalance over leaving mechanism.

Lessening the chance of introducing an excess enhancement on each agent's participating mechanism, the second row enhances the participating mechanism by eq.(11) on a subset of agents per sample comes. Being \!\kappa maximum posteriors, the sharing allocation by eq.(11) with \!\kappa < k is thus referred as \kappa-map sharing shortly. As a result, the second row behaves between those of the first row and the third row. For the last two blocks or particularly the block (2,3), with an appropriate \kappa value the leaving mechanism in Fig.8(b) and the participating mechanism via p_{ j, t} by eq.(11) may reach a good balance with an automatic selection on k.

On the other hand, the 4th row also introduces a leaving mechanism via penalizing the rival via p_{ j, t} by eq.(4). The block (4,2) is exactly the previously discussed RPCL, while the block (4, 1) switches off the frequent sensitive mechanism and the block (4, 3) is added with the combined mechanisms in Fig.8(b). Moreover, the 4th row is further extended into the 5th row via the generalized RPCL scheme by eq.(12).

Finally, it is noted that not only the block (1,3) and block (2,3) but also the RPCL one by eq.(4) and its extension by eq.(12) (i.e., the block (3,3) and block (4,3)) can be obtained from the Bayesian Ying Yang Learning. Moreover, the criterion by eq.(7) can also be derived from one of its special cases (Xu, 2006).

Mixture of experts, RBF networks, and object detection

RPCL is applicable to tasks that are featured by learning multiple structures underlying samples or detecting multiple objects among observations. Typical examples include not only a finite mixture (particularly Gaussian mixture) based density estimation and clustering analysis but also those mixture-of-experts or RBF nets based supervised learning, as well as learning based detection of multiple objects on an image, which are briefly introduced as follows.

  • Mixture-of-experts (ME). It targets at learning an input-output map y=f(x|\theta_j^f) or its probabilistic version p(y |x, \theta_j) from a set of pairs \{x_t,y_t\}_{t=1}^N, which is usually in a form p(y |x,\theta_j)=G(y|f(x|\theta_j^f), \rho^{2}_jI) that represents a regression function f(x|\theta_j^f) under a Gaussian noise with a variance \rho^{2}_j. A gating net p(j|x, \phi) will coordinate experts to perform function approximation type tasks (e.g., prediction, control, ..., etc) via
(14)
\begin{matrix} &&  p(y |x, \theta)=\sum_{j=1}^k p(j|x, \phi)G(y|f(x|\theta_j^f), \rho^{2}_jI), \\ \\  &\mbox{or}& E[y|x]=\sum_{j=1} p(j|x, \phi) f(x|\theta_j^f), \end{matrix}

which has been widely studied and applied. The original ME model (Jacobs, et al, 1991; Jordan & Xu, 1995) is featured by the maximum likelihood (ML) learning on p(y |x, \theta), but its gating model p(j|x, \phi) makes the ML learning difficult to be implemented by the EM algorithm. An alternative ME model (Xu, Jordan, & Hinton, 1995) has been proposed with the ML learning easily implemented by an EM algorithm, in help of the following gating model:

(15)
p(j|x, \phi)={ \alpha_j p(x|\phi_j)/ \sum_{i=1}^k \alpha_i p(x|\phi_i)}.

Also, RPCL can be applied to this alternative ME further with the number of experts determined automatically during learning, simply in help of implementing eq.(9) with the following measure:

(16)
\varepsilon_t(\theta_j)= -\begin{cases} \alpha_j \ln{[p(x|\phi_j)G(y|f(x|\theta_j^f), \rho^{2}_jI)]}, & \mbox{Type I}, \\ \ln{[\alpha_j p(x|\phi_j)G(y|f(x|\theta_j^f), \rho^{2}_jI)]}, & \mbox{Type II}.\end{cases}
  • RBF nets. In particular, when p(x|\phi_j)=G(x|\mu_j,\Sigma_j), the alternative ME actually becomes an extension of a typical family of RBF nets, which is simplified into an extended normalized RBF net at the following special case
(17)
f(x|\phi_j)=w_j^T x +c_j, \ \alpha_j={|\Sigma_j|/\sum_{i=1}^k |\Sigma_i|},

which further degenerates into a usual normalized RBF net (Moody, & Darken, 1989), simply by setting w_j=0. In other words, an adaptive EM algorithm for the ML learning can be directly applied to replace the conventional but suboptimal two stage approach for RBF net learning. Moreover, with the constraint \alpha_j={|\Sigma_j|}/{\sum_{i=1}^k |\Sigma_i|} removed, we can implement RPCL by eq.(9) and eq.(14), further with the number of basis functions determined automatically during learning.

  • Object detection. This is a classic task in the field of pattern recognition and computer vision. Shown in Fig.6 is a comparative experiment on detecting eleven elliptic blood cells on a real image. After edge detection, the classical Hough transform (HT) (Hough, 1962), the Randomized Hough transform (RHT) (Xu, Oja, & Kultanen, 1991; Xu & Oja, 1993), and RPCL based object detection are used on the resulted edge image to detect ellipses. It can be seen that the RPCL approach outperforms RHT significantly though the newly developed RHT has significantly outperformed HT already (Liu, Qiao, & Xu, 2006). The approaches have also been extended to detect objects in a shape without a parametric expression but via a template in a pre-designed specific shape. The details for implementing RPCL based object detection are referred to (Xu, 2007).


Figure 10:   A comparative experiment on a real image.
Enlarge
Figure 10: A comparative experiment on a real image.

Open problems

The RPCL learning was heuristically proposed on a bottom level (i.e., a level of learning dynamics or updating mechanisms), while Bayesian Ying-Yang learning is proposed and developed on a global level of guiding parameter learning and model selection by a learning principle or theory in a top-down manner. Both approaches are featured by automatic model selection ability during parameter learning. Recently, certain links between the two approaches have been built up (Xu, 2001 &02; Ma, Wang, & Xu, 2004), with convergence behavior of RPCL is interpreted qualitatively in a top-down way by the BYY harmony learning. However, challenges still remain at least in two aspects.

One is on theoretical analysis of RPCL itself. A classic competing process is already a highly nonlinear process. How to balance the participating mechanism and leaving mechanism such that an appropriate k can be selected is a key challenge, with two open problems as follows:

  • Though many empirical studies have been made on RPCL, little progress has been achieved on its mathematical convergence analysis. Letting the learning step size \eta in eq.(9) to gradually reduce to zero in some way, e.g., according to (Robbins & Monro, 1951), the learning process will eventually freeze. However, there lacks theoretical analysis on its correct selection ability (i.e., whether a correct number of agents can be allocated) and the convergence (i.e., whether each agent will converge to a good representation of an individual substructure or object), as well as the convergence rate, ..., etc.
  • Actually, the above problems relate to closely the problem of how to control the penalizing strength \gamma in eq.(4). It has no penalizing effect if \gamma is too small, while the learning will oscillate or diverge if \gamma is too larger. Currently, \gamma is selected heuristically by experiments. It may be related to the dimension of data space, the number of substructures, the distributions of data samples, etc, but little theoretical results are available yet.

The other is on choices listed in Fig.9. Two interesting topics are as follows:

  • Further comparative studies on features of eac item, especially RPCL, BYY-RPCL, BYY-CL, as well as those in the second row.
  • How the \kappa affect the performances ? how to select an appropriate value for \kappa ?

Moreover, it also deserves to make a further comparative study on Type I and Type I learning in eqn.(6).

External Links

Author's website (http://www.cse.cuhk.edu.hk/~lxu/)


References

  • Acciani, G., et al (2003), "A feature extraction unsupervised neural network for an environmental data set", Neural Networks 16: 427-436.
  • Ahalt, SC., et al (1990), "Competitive learning algorithms for vector quantization", Neural Networks 277-291.
  • Akaike, H (1974), "A new look at the statistical model identification", "IEEE Tr. Automatic Control" 19: 714-723.
  • Bozdogan, H & Ramirez, DE (1988), "FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC", "Psychometrika" { 53} (3): 407-415.
  • Cavanaugh, JE (1997), "Unifying the derivations for the Akaike and corrected Akaike information criteria", "Statistics & Probability Letters ", {33}: 201-208.
  • Desieno, D (1988), "Adding a conscience to competitive learning", Proc. of IEEE int. conf. on Neural Networks, Vol.I, pp.117-124.
  • Grossberg, S (1987), "Competitive learning: from iterative activation to adaptive resonance", Cognitive Science, 11: 23-63.
  • Hecht-Nielsen, R (1987), "Counterpropagation networks", Applied Optics 26: 4979-4984.
  • Hough, PVC "Method and means for recognizing complex patterns", U.S. Patent 3069654, Dec.18, 1962.
  • Jacobs, RA., et al (1991), "Adaptive mixtures of local experts", Neural Computation 3: 79-87.
  • Jordan, MI & Xu, L (1995), "Convergence results for the EM approach to mixtures of experts", Neural Networks 8: 1409-1431.
  • Kohonen, T (1982), "Self-organized formation of topologically correct feature maps", Biological Cybernetics 43: 59-69.
  • Liu, ZY, Qiao, H, & Xu, L (2006), "Multisets Mixture learning based Ellipse Detection", Pattern Recognition 39: 731-735.
  • Ma, J, Wang, T, & Xu, L (2004), "A gradient BYY harmony learning rule on Gaussian mixture with automated model selection", Neurocomputing 56: 481-487.
  • Moody, J & Darken, J (1989) "Fast learning in networks of locally-tuned processing units", Neural Computation 1: 281-294.
  • Nair, TM., et al, (2003) "Rival penalized competitive learning (RPCL): a topology-determining algorithm for analyzing gene expression data", Computational Biology and Chemistry 27: 565-574.
  • Nakkrasae, S, & Sophatsathit, P (2004), "An rpcl-based indexing approach for software component classification" International J. of Software Engineering and Knowledge Engineering 14: 497-518.
  • Redner, RA & Walker, HF (1984), "Mixture densities, maximum likelihood, and the EM algorithm", SIAM Review 26: 195-239.
  • Rissanen, J (1986), "Stochastic complexity and modeling"," Annals of Statistics", 14 (3): 1080-1100.
  • Rivals, I & Personnaz, L (1999) "On Cross Validation for Model Selection", " Neural Computation" 11: 863-870.
  • Robbins, H., & Monro, S., (1951), "A stochastic approximation method", Ann. Math. Statist., 22, 400-407.
  • Rumelhart, DE & Zipser, D (1985), "Feature discovery by competitive learning", Cognitive Science 9: 75-112.
  • Schwarz, G (1978), "Estimating the dimension of a model", "Annals of Statistics" 6: 461-464.
  • Stone, M (1978), "Cross-validation: A review", "Math. Operat. Statist. " 9: 127-140.
  • Xu, L (2007), "A Unified Perspective and New Results on RHT Computing, Mixture Based Learning, and Multi-learner Based Problem Solving", (in press), " Pattern Recognition", 40: 2129-2153.
  • Xu, L (2007), "A Trend on Regularization and Model Selection in Statistical Learning: A Perspective from Bayesian Ying Yang Learning", an invited chapter in "Studies in Computational Intelligence 63: Challenges to Computational Intelligence", Duch, W. and Mandziuk, J., 365-405.
  • Xu, L (2002), "BYY Harmony Learning, Structural RPCL, and Topological Self-Organizing on Unsupervised and Supervised Mixture Models", Neural Networks 15: 1125-1151.
  • Xu, L (2001), "Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, ME-RBF Models and Three-Layer Nets", International J. of Neural Systems 11: 3-69.
  • Xu, L (1998), "Rival Penalized Competitive Learning, Finite Mixture, and Multisets Clustering", Proc. of IJCNN98, Anchorage, Vol.II, pp.2525-2530.
  • Xu, L & Jordan, MI (1996), "On convergence properties of the EM algorithm for Gaussian mixtures", Neural Computation 8(1): 129-151.
  • Xu, L, Jordan, MI, & Hinton, GE (1995), "An Alternative Model for Mixtures of Experts", in Cowan, Tesauro, and Alspector, eds., Advances in Neural Information Processing Systems 7, MIT Press, 633-640.
  • Xu, L, Krzyzak, A, & Oja, E (1993), "Rival Penalized Competitive Learning for Clustering Analysis, RBF net and Curve Detection", IEEE Tr. on Neural Networks 4: 636-649.
  • Xu, L, & Oja, E (1993), "Randomized Hough Transform (RHT): Basic Mechanisms, Algorithms and Complexities", Computer Vision, Graphics, and Image Processing: Image Understanding 57: 131-154.
  • Xu, L, Krzyzak, A & Oja, E (1992), "Unsupervised and Supervised Classifications by Rival Penalized Competitive Learning", Proc. of 11th Intl. Conf. on Pattern Recognition, Hauge, Netherlands, Aug.30-Sept.3, 1992, Vol.I, pp.672-675.
  • Xu, L, Oja, E & Kultanen, P (1990), "A New Curve Detection Method: Randomized Hough transform (RHT)", Pattern Recognition Letters 11: 331-338.

Internal references

See also


Lei Xu (2007) Rival penalized competitive learning. Scholarpedia, 2(8):1810, (go to the first approved version)
Created: 31 July 2006, reviewed: 17 August 2007, accepted: 31 August 2007
Action editor: Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia
For authors