Restricted Boltzmann machine: Recent advances and mean-field theory
Aurélien Decelle1,2,?(), Cyril Furtlehner2
1Departamento de Física Téorica I, Universidad Complutense, 28040 Madrid, Spain 1TAU team INRIA Saclay & LISN Université Paris Saclay, Orsay 91405, France
This review deals with restricted Boltzmann machine (RBM) under the light of statistical physics. The RBM is a classical family of machine learning (ML) models which played a central role in the development of deep learning. Viewing it as a spin glass model and exhibiting various links with other models of statistical physics, we gather recent results dealing with mean-field theory in this context. First the functioning of the RBM can be analyzed via the phase diagrams obtained for various statistical ensembles of RBM, leading in particular to identify a compositional phase where a small number of features or modes are combined to form complex patterns. Then we discuss recent works either able to devise mean-field based learning algorithms; either able to reproduce generic aspects of the learning process from some ensemble dynamics equations or/and from linear stability arguments.
Received: 30 September 2020
Accepted manuscript online:
Fund: *AD was supported by the Comunidad de Madrid and the Complutense University of Madrid (Spain) through the Atracción de Talento program (Ref. 2019-T1/TIC-13298).
Cite this article:
Aurélien Decelle, Cyril Furtlehner Restricted Boltzmann machine: Recent advances and mean-field theory 2021 Chin. Phys. B 30 040202
Fig. 1.
Bipartite structure of the RBM.
Fig. 2.
Angle between the reference basis given by the data and the moving one given by the RBM shown on the up left panel. Equivalence with the motion of a pendulum is indicated on the left bottom panel. Solution of Eqs. (18)–(20) of two coupled modes in the linear RBM (right panel).
Fig. 3.
From Ref. [63]. Overlap with different patterns when varying the dilution factor p (named d on the figure) at low temperature. Left: a case with 3 patterns where we can observe how at small dilution, only one pattern is fully retrieved while the second and third ones appear for larger dilution. Right: a case with 6 patterns where the figure is zoomed in the high dilution region where the branching phenomenon is occurring and all the overlaps converge toward the same value.
Fig. 4.
Left: the phase diagram of the model. The y-axis corresponds to the variance of the noise matrix, the x-axis to the value of the strongest mode of w. We see that the ferromagnetic phase is characterized by having strong mode eigenvalues. In this phase, the system can behave either by recalling one eigenmode of w or by composing many modes together (compositional phase). For the sake of completeness, we indicate the AT region where the replica symmetric solution is unstable, but for practical purpose we are not interested in this phase. Right: an example of a learning trajectory on the MNIST dataset (in red) and on a synthetic dataset (in blue). It shows that starting from the paramagnetic phase, the learning dynamics brings the system toward the ferromagnetic phase by learning a few strong modes.
Fig. 5.
On this artificial dataset, we observe that eigenvalues that follow are learned and reach the threshold indicated by Eq. (15). In the inset, the alignment of the first four principal directions of the matrix uα of the SVD of w and of the dataset. In red, we observe that the likelihood function is increasing each time that a new mode emerges.
Fig. 6.
Left: the learning curves for the modes wα using an RBM with (Nv,Nh) = (100,100) learned on a synthetic dataset distributed in the neighborhood of a 20d ellipsoid embedded into a 100d space. Here the modes interact together: the weaker modes push the stronger ones higher, and they all accumulate at the top of the spectrum, as explained in Subsection 3.2. Right: a scatter plot projected on the first two SVD modes of the training (blue) and sampled data from the learned RBM (red) for a problem in dimension Nv = 50 with two condensed modes. We can see that the learned matrix u captures relevant directions and that the RBM generates data perfectly similar to the one of the training set.
Fig. 7.
Left: figure from Ref. [69], the value of wi for each visible site of a RBM with 3 hidden nodes trained on the dataset of the 1D homogeneous Ising model with periodic boundary condition. We see three similarly peak shaped potentials with a decreasing magnitude of similar order for the three. Each peak intends to reproduce the correlation pattern around a central node, and therefore cannot reproduce the translational symmetry of the problem. Right: figure from Ref. [69], the position of the three peaks as a function of the number of training epochs. We observe that the peaks diffuse while repelling each others. The diffusion aims at reproducing the correlation patterns of the translational symmetry, while the repelling interaction ensures that two peaks will not overlap.
Fig. 8.
A subset of the MNIST dataset.
Fig. 9.
Left: the first 10 modes of the MNIST dataset (top) and the RBM (bottom) at the beginning of the learning. The similarity between most of them is clearly visible. Right: 100 random features of the RBM at the same moment of the learning. We can see that most features correspond to a mode of the dataset when comparing with the left-top panel.
Fig. 10.
The column represents respectively (i) the first hundred learned features, (ii) the histogram of distances between the binarized features :W± 1 = sign(W), and (iii) 100 samples generated from the learned RBM. The first row corresponds to the beginning of the learning when only one feature is learned. Looking at the histogram, we see that most of the features have a high overlap. Also, the MC samples are all similar to the learned features. On the second row, the RBM has learned many features, and therefore the histogram is wider but still centered at zero. The MC sampling however is only capable of reproducing one of the learned features. On the last row the learning is much more advanced. The features tend to be very localized and the samples correspond now to digits.
Fig. 11.
(a) Singular values distribution of the initial random matrix compared to the Marchenko–Pastur law. (b) As the training proceeds we observe singular values passing above the threshold set by the Marchenko–Pastur law. (c) Distribution of the singular values after a long training: the Marchenko–Pastur distribution has disappeared and been replaced by a fat tailed distribution of eigenvalues mainly spreading above threshold and a peak of below-threshold singular values near zero. The distribution of eigenvalues does not get close to any standard random matrix ensemble spectrum.
Fig. 12.
Top: figure taken from Ref. [73], the samples taken from the permanent chain at the end of the training of the RBM. The first two lines correspond to samples generated using PCD, the second two lines to samples obtained using the P-nMF approximation, and the last two, using P-TAP. Bottom: 100 features obtained after the training, we can see that they are qualitatively very similar to the ones obtained when training the RBM with P-TAP.
Fig. 13.
Top panel: Results for a RBM of size (Nv,Nh) = (1000,500) learned on a synthetic dataset of 104 samples having 20 clusters randomly located in a sub-manifold of dimension d = 15. The learning curve for the eigemodes wα (left) and the associated likelihood function (right-red) together with the number of obtained fixed points at each epoch. We can see that, before the first eigenvalue is learned there is one single fixed point, then as modes are learned, the number of fixed points increases. Bottom panel: Results for an RBM of size (Nv,Nh) = (100,50) learned on a synthetic dataset of 104 samples having 11 clusters randomly defined a sub-manifold of dimension d = 5. On the left, the scatter plot of the training data together with the position of the fixed points projected on the first two directions of the SVD of w. On the right, the projection along the third and fourth axes. The results are shown after learning 5 modes, where 16 fixed points are found (in fact more than the number of hidden clusters.
Altmetric calculates a score based on the online attention an article receives. Each coloured thread in the circle represents a different type of online attention. The number in the centre is the Altmetric score. Social media and mainstream news media are the main sources that calculate the score. Reference managers such as Mendeley are also tracked but do not contribute to the score. Older articles often score higher because they have had more time to get noticed. To account for this, Altmetric has included the context data for other articles of a similar age.