Please wait a minute...
Chin. Phys. B, 2021, Vol. 30(4): 040202    DOI: 10.1088/1674-1056/abd160
Special Issue: SPECIAL TOPIC — Machine learning in statistical physics
GENERAL Prev   Next  

Restricted Boltzmann machine: Recent advances and mean-field theory

Aurélien Decelle1,2,?(), Cyril Furtlehner2
1Departamento de Física Téorica I, Universidad Complutense, 28040 Madrid, Spain
1TAU team INRIA Saclay & LISN Université Paris Saclay, Orsay 91405, France
Abstract  

This review deals with restricted Boltzmann machine (RBM) under the light of statistical physics. The RBM is a classical family of machine learning (ML) models which played a central role in the development of deep learning. Viewing it as a spin glass model and exhibiting various links with other models of statistical physics, we gather recent results dealing with mean-field theory in this context. First the functioning of the RBM can be analyzed via the phase diagrams obtained for various statistical ensembles of RBM, leading in particular to identify a compositional phase where a small number of features or modes are combined to form complex patterns. Then we discuss recent works either able to devise mean-field based learning algorithms; either able to reproduce generic aspects of the learning process from some ensemble dynamics equations or/and from linear stability arguments.

Keywords:  restricted Boltzmann machine (RBM)      machine learning      statistical physics  
Received:  30 September 2020      Accepted manuscript online: 
Fund: *AD was supported by the Comunidad de Madrid and the Complutense University of Madrid (Spain) through the Atracción de Talento program (Ref. 2019-T1/TIC-13298).

Cite this article: 

Aurélien Decelle, Cyril Furtlehner Restricted Boltzmann machine: Recent advances and mean-field theory 2021 Chin. Phys. B 30 040202

Fig. 1.  

Bipartite structure of the RBM.

Fig. 2.  

Angle between the reference basis given by the data and the moving one given by the RBM shown on the up left panel. Equivalence with the motion of a pendulum is indicated on the left bottom panel. Solution of Eqs. (18)–(20) of two coupled modes in the linear RBM (right panel).

Fig. 3.  

From Ref. [63]. Overlap with different patterns when varying the dilution factor p (named d on the figure) at low temperature. Left: a case with 3 patterns where we can observe how at small dilution, only one pattern is fully retrieved while the second and third ones appear for larger dilution. Right: a case with 6 patterns where the figure is zoomed in the high dilution region where the branching phenomenon is occurring and all the overlaps converge toward the same value.

Fig. 4.  

Left: the phase diagram of the model. The y-axis corresponds to the variance of the noise matrix, the x-axis to the value of the strongest mode of w. We see that the ferromagnetic phase is characterized by having strong mode eigenvalues. In this phase, the system can behave either by recalling one eigenmode of w or by composing many modes together (compositional phase). For the sake of completeness, we indicate the AT region where the replica symmetric solution is unstable, but for practical purpose we are not interested in this phase. Right: an example of a learning trajectory on the MNIST dataset (in red) and on a synthetic dataset (in blue). It shows that starting from the paramagnetic phase, the learning dynamics brings the system toward the ferromagnetic phase by learning a few strong modes.

Fig. 5.  

On this artificial dataset, we observe that eigenvalues that follow ?sα?2>σv2 are learned and reach the threshold indicated by Eq. (15). In the inset, the alignment of the first four principal directions of the matrix uα of the SVD of w and of the dataset. In red, we observe that the likelihood function is increasing each time that a new mode emerges.

Fig. 6.  

Left: the learning curves for the modes wα using an RBM with (Nv,Nh) = (100,100) learned on a synthetic dataset distributed in the neighborhood of a 20d ellipsoid embedded into a 100d space. Here the modes interact together: the weaker modes push the stronger ones higher, and they all accumulate at the top of the spectrum, as explained in Subsection 3.2. Right: a scatter plot projected on the first two SVD modes of the training (blue) and sampled data from the learned RBM (red) for a problem in dimension Nv = 50 with two condensed modes. We can see that the learned matrix u captures relevant directions and that the RBM generates data perfectly similar to the one of the training set.

Fig. 7.  

Left: figure from Ref. [69], the value of wi for each visible site of a RBM with 3 hidden nodes trained on the dataset of the 1D homogeneous Ising model with periodic boundary condition. We see three similarly peak shaped potentials with a decreasing magnitude of similar order for the three. Each peak intends to reproduce the correlation pattern around a central node, and therefore cannot reproduce the translational symmetry of the problem. Right: figure from Ref. [69], the position of the three peaks as a function of the number of training epochs. We observe that the peaks diffuse while repelling each others. The diffusion aims at reproducing the correlation patterns of the translational symmetry, while the repelling interaction ensures that two peaks will not overlap.

Fig. 8.  

A subset of the MNIST dataset.

Fig. 9.  

Left: the first 10 modes of the MNIST dataset (top) and the RBM (bottom) at the beginning of the learning. The similarity between most of them is clearly visible. Right: 100 random features of the RBM at the same moment of the learning. We can see that most features correspond to a mode of the dataset when comparing with the left-top panel.

Fig. 10.  

The column represents respectively (i) the first hundred learned features, (ii) the histogram of distances between the binarized features :W± 1 = sign(W), and (iii) 100 samples generated from the learned RBM. The first row corresponds to the beginning of the learning when only one feature is learned. Looking at the histogram, we see that most of the features have a high overlap. Also, the MC samples are all similar to the learned features. On the second row, the RBM has learned many features, and therefore the histogram is wider but still centered at zero. The MC sampling however is only capable of reproducing one of the learned features. On the last row the learning is much more advanced. The features tend to be very localized and the samples correspond now to digits.

Fig. 11.  

(a) Singular values distribution of the initial random matrix compared to the Marchenko–Pastur law. (b) As the training proceeds we observe singular values passing above the threshold set by the Marchenko–Pastur law. (c) Distribution of the singular values after a long training: the Marchenko–Pastur distribution has disappeared and been replaced by a fat tailed distribution of eigenvalues mainly spreading above threshold and a peak of below-threshold singular values near zero. The distribution of eigenvalues does not get close to any standard random matrix ensemble spectrum.

Fig. 12.  

Top: figure taken from Ref. [73], the samples taken from the permanent chain at the end of the training of the RBM. The first two lines correspond to samples generated using PCD, the second two lines to samples obtained using the P-nMF approximation, and the last two, using P-TAP. Bottom: 100 features obtained after the training, we can see that they are qualitatively very similar to the ones obtained when training the RBM with P-TAP.

Fig. 13.  

Top panel: Results for a RBM of size (Nv,Nh) = (1000,500) learned on a synthetic dataset of 104 samples having 20 clusters randomly located in a sub-manifold of dimension d = 15. The learning curve for the eigemodes wα (left) and the associated likelihood function (right-red) together with the number of obtained fixed points at each epoch. We can see that, before the first eigenvalue is learned there is one single fixed point, then as modes are learned, the number of fixed points increases. Bottom panel: Results for an RBM of size (Nv,Nh) = (100,50) learned on a synthetic dataset of 104 samples having 11 clusters randomly defined a sub-manifold of dimension d = 5. On the left, the scatter plot of the training data together with the position of the fixed points projected on the first two directions of the SVD of w. On the right, the projection along the third and fourth axes. The results are shown after learning 5 modes, where 16 fixed points are found (in fact more than the number of hidden clusters.

[1] Goodfellow I, Bengio Y, Courville A, Bengio Y 2016 Deep learning 1 Cambridge MIT Press
[2] Mehta P, Bukov M, Wang C H, Day A G R, Richardson C, Fisher C K, Schwab D J 2019 Physics Reports 810 1
[3] Ronneberger O, Fischer P, Brox T 2015 In International Conference on Medical image computing and computer-assisted intervention 234 241 Springer
[4] Carrasquilla J, Melko R G 2017 Nat. Phys. 13 431
[5] Smolensky P 1986 In Parallel Distributed Processing 1 Rumelhart D, McLelland J 194 281 MIT Press
[6] Hinton G E 2002 Neural Computation 14 1771
[7] Ackley D H, Hinton G E, Sejnowski T J 1985 Cognitive Science 9 147
[8] LeCun Y, Bottou L, Bengio Y, Haffner P 1998 Proc. IEEE 86 2278
[9] Le Roux N, Bengio Y 2008 Neural Computation 20 1631
[10] Montfar G 2016 Restricted boltzmann machines: Introduction and review. In Information Geometry and Its Applications IV 75 115 Springer
[11] Salakhutdinov R, Hinton G 2009 Deep Boltzmann machines. In Artificial intelligence and statistics 448 455
[12] Krizhevsky A, Hinton G et al. 2009 Learning multiple layers of features from tiny images. Technical report Citeseer

[13] Yasuda M, Tanaka K 2009 Neural Computation 21 3130
[14] Cho K, Ilin A, Raiko T 2011 Improved learning of Gaussian-Bernoulli restricted Boltzmann machines. In International conference on artificial neural networks 10 17 Springer
[15] Yamashita T, Tanaka M, Yoshida E, Yamauchi Y, Fujiyoshii H 2014 To be Bernoulli or to be Gaussian, for a restricted Boltzmann machine. In 2014 22nd International Conference on Pattern Recognition 1520 1525 IEEE
[16] Hjelm R D, Calhoun V D, Salakhutdinov R, Allen E A, Adali T, Plis S M 2014 NeuroImage 96 245
[17] Hu X, Huang H, Peng B, Han J, Liu N, Lv J, Guo L, Guo C, Liu T 2018 Human brain mapping 39 2368
[18] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y 2014 Generative adversarial nets. In Advances in neural information processing systems 2672 2680
[19] Yelmen B, Decelle A, Ongaro L, Marnetto D, Tallec C, Montinaro F, Furtlehner C, Pagani L, Jay F 2021 PLoS genetics 17 e1009303
[20] Zhang N, Ding S F, Zhang J, Xue Y 2018 Neurocomputing 275 1186
[21] Cho KyungHyun, Raiko T, Ilin A 2011 Enhanced gradient and adaptive learning rate for training restricted Boltzmann machines In ICML
[22] Tang Y C, Sutskever I 2011 Data normalization in the learning of restricted Boltzmann machines Department of Computer Science University of Toronto Technical Report UTML-TR-11-2
[23] Hopfield J J 1982 Proc. Natl. Acad. Sci. 79 2554
[24] Amit D J, Gutfreund H, Sompolinsky H 1985 Phys. Rev. A 32 1007
[25] Amit D J, Gutfreund H, Sompolinsky H 1985 Phys. Rev. Lett. 55 1530
[26] Amit D J, Gutfreund H, Sompolinsky H 1987 Annals of Physics 173 30
[27] Rosenblatt F 1958 Psychological Review 65 386
[28] Gardner E 1988 J. Phys. A: Math. Gen. 21 257
[29] Gardner E, Derrida B 1988 J. Phys. A: Math. Gen. 21 271
[30] Mzard M, Parisi G, Virasoro M 1987 Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications 9 World Scientific Publishing Company
[31] Carreira-Perpinan M A, Hinton G E 2005 On contrastive divergence learning In Aistats 10 33 40 Citeseer
[32] Tieleman T 2008 Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning 1064 1071
[33] Fischer A, Igel C 2014 Pattern Recognition 47 25
[34] Karakida R, Okada M, Amari S I 2014 Analyzing feature extraction by contrastive divergence learning in rbms. In Deep learning and representation learning workshop: NIPS
[35] Karakida R, Okada M, Amari S I 2016 Neural Networks 79 78
[36] Decelle A, Fissore G, Furtlehner C 2018 J. Stat. Phys. 172 1576
[37] Decelle A, Fissore G, Furtlehner C 2017 Europhys. Lett. 119 60001
[38] Berlin T H, Kac M 1952 Phys. Rev. 86 821
[39] Stanley H E 1968 Phys. Rev. 176 718
[40] Decelle A, Furtlehner C 2020 J. Phys. A: Math. Theor. 53 184002
[41] Genovese G, Tantari D 2020 J. Phys. A: Math. Theor. 53 094001
[42] Nijman M J, Kappen H J 1997 International Journal of Neural Systems 8 301
[43] MacKay D J C, David J C 2003 Information theory, inference and learning algorithms Cambridge university press
[44] Bishop C M 2006 Pattern recognition and machine learning Springer
[45] Rose K, Gurewitz E, Fox G C 1990 Phys. Rev. Lett. 65 945
[46] Kloppenburg M, Tavan P 1997 Phys. Rev. E 55 2089
[47] Akaho S, Kappen H J 2000 Neural Computation 12 1411
[48] Barra A, Bernacchia A, Santucci E, Contucci P 2012 Neural Networks 34 1
[49] Mzard M 2017 Phys. Rev. E 95 022117
[50] Shimagaki K, Weigt M 2019 Phys. Rev. E 100 032128
[51] Decelle A, Hwang S, Rocchi J, Tantari D 2019 arXiv:1906.11988
[52] Hyv?rinen A, Oja E 2000 Neural Networks 13 411
[53] Yuuki Y, Tomu K, Muneki Y 2000 The Review of Socionetwork Strategies 13 253
[54] Hahnloser R H R, Sarpeshkar R, Mahowald M A, Douglas R J, Seung H S 2000 Nature 405 947
[55] Teh Y W, Hinton G E 2001 Rate-coded restricted Boltzmann machines for face recognition. In Advances in neural information processing systems 908 914
[56] Nair V, Hinton G E 2010 Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) 807 814
[57] Barra A, Genovese G, Sollich P, Tantari D 2018 Phys. Rev. E 97 022310
[58] Tubiana J, Monasson R 2017 Phys. Rev. Lett. 118 138301
[59] Huang H P 2017 J. Stat. Mech.: Theor. Exper. 2017 053302
[60] Tubiana J 2018 Restricted Boltzmann machines: from compositional representations to protein sequence analysis PhD thesis, ENS Thse de doctorat dirige par Monasson, Rmi et Cocco, Simona Physique Paris Sciences et Lettres
[61] Agliari E, Barra A, Tirozzi B 2019 J. Stat. Mech.: Theor. Exper. 2019 033301
[62] Hartnett G S, Parker E, Geist E 2018 Phys. Rev. E 98 022116
[63] Agliari E, Barra A, Galluzzi A, Guerra F, Moauro F 2012 Phys. Rev. Lett. 109 268101
[64] Agliari E, Barra A, Galluzzi A, Isopi M 2014 Neural Networks 49 19
[65] Wemmenhove B, Coolen A C C 2003 J. Phys. A: Math. Gen. 36 9617
[66] Huang H P 2018 J. Phys. A: Math. Theor. 51 08LT01
[67] Kirkpatrick S, Sherrington D 1978 Phys. Rev. B 17 4384
[68] Amari S I 1977 Biol. Cybern. 26 175
[69] Harsh M, Tubiana J, Cocco S, Monasson R 2020 J. Phys. A: Math. Theor. 53 174002
[70] Hukushima K, Nemoto K 1996 J. Phys. Soc. Jpn. 65 1604
[71] Desjardins G, Courville A, Bengio Y, Vincent P, Delalleau O 2010 Parallel tempering for training of restricted Boltzmann machines. In Proceedings of the thirteenth international conference on artificial intelligence and statistics 145 152 Cambridge MIT Press
[72] Chako T, Muneki Y 2016 J. Phys. Soc. Jpn. 85 034001
[73] Gabri M, Tramel E W, Krzakala F 2015 Training restricted Boltzmann machine via the Thouless-Anderson-Palmer free energy. In Advances in neural information processing systems 640 648
[74] Tramel E W, Gabri M, Manoel A, Caltagirone F, Krzakala F 2018 Phys. Rev. X 8 041006
[75] Thouless D J, Anderson P W, Palmer R G 1977 Philosophical Magazine 35 593
[76] Plefka T 1982 J. Phys. A: Math. Gen. 15 1971
[77] Georges A, Yedidia J S 1991 J. Phys. A: Math. Gen. 24 2173
[78] Maillard A, Foini L, Castellanos A L, Krzakala F, Mzard M, Zdeborov L 2019 J. Stat. Mech.: Theor. Exp. 2019 113301
[79] Tramel E W, Manoel A, Caltagirone F, Gabri M, Krzakala F 2016 Inferring sparsity: Compressed sensing using generalized restricted Boltzmann machines. In 2016 IEEE Information Theory Workshop (ITW) 265 269
[80] Fissore G, Decelle A, Furtlehner C, Han Y F 1912.09382 2019 arXiv:
[81] Huang H P, Toyoizumi T 2015 Phys. Rev. E 91 050101
[82] Lage-Castellanos A, Mulet R, Ricci-Tersenghi F, Rizzo T 2013 J. Phys. A: Math. Theor. 46 135001
[83] Ricci-Tersenghi F 2012 J. Stat. Mech.: Theor. Exp. 2012 P08015
[84] Nguyen H C, Berg J 2012 J. Stat. Mech.: Theor. Exp. 2012 P03004
[85] Huang H P, Toyoizumi T 2016 Phys. Rev. E 94 062310
[86] Huang H P 2020 Phys. Rev. E 102 030301
[87] Salakhutdinov R, Murray I 2008 On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning 872 879
[88] Krause O, Fischer A, Igel C 2020 Artificial Intelligence 278 103195
[89] Yale A, Dash S, Dutta R, Guyon I, Pavao A, Bennett K P 2020 Generation and evaluation of privacy preserving synthetic health data Neurocomputing
[1] Prediction of lattice thermal conductivity with two-stage interpretable machine learning
Jinlong Hu(胡锦龙), Yuting Zuo(左钰婷), Yuzhou Hao(郝昱州), Guoyu Shu(舒国钰), Yang Wang(王洋), Minxuan Feng(冯敏轩), Xuejie Li(李雪洁), Xiaoying Wang(王晓莹), Jun Sun(孙军), Xiangdong Ding(丁向东), Zhibin Gao(高志斌), Guimei Zhu(朱桂妹), Baowen Li(李保文). Chin. Phys. B, 2023, 32(4): 046301.
[2] Variational quantum simulation of thermal statistical states on a superconducting quantum processer
Xue-Yi Guo(郭学仪), Shang-Shu Li(李尚书), Xiao Xiao(效骁), Zhong-Cheng Xiang(相忠诚), Zi-Yong Ge(葛自勇), He-Kang Li(李贺康), Peng-Tao Song(宋鹏涛), Yi Peng(彭益), Zhan Wang(王战), Kai Xu(许凯), Pan Zhang(张潘), Lei Wang(王磊), Dong-Ning Zheng(郑东宁), and Heng Fan(范桁). Chin. Phys. B, 2023, 32(1): 010307.
[3] The coupled deep neural networks for coupling of the Stokes and Darcy-Forchheimer problems
Jing Yue(岳靖), Jian Li(李剑), Wen Zhang(张文), and Zhangxin Chen(陈掌星). Chin. Phys. B, 2023, 32(1): 010201.
[4] Data-driven modeling of a four-dimensional stochastic projectile system
Yong Huang(黄勇) and Yang Li(李扬). Chin. Phys. B, 2022, 31(7): 070501.
[5] Machine learning potential aided structure search for low-lying candidates of Au clusters
Tonghe Ying(应通和), Jianbao Zhu(朱健保), and Wenguang Zhu(朱文光). Chin. Phys. B, 2022, 31(7): 078402.
[6] Quantum algorithm for neighborhood preserving embedding
Shi-Jie Pan(潘世杰), Lin-Chun Wan(万林春), Hai-Ling Liu(刘海玲), Yu-Sen Wu(吴宇森), Su-Juan Qin(秦素娟), Qiao-Yan Wen(温巧燕), and Fei Gao(高飞). Chin. Phys. B, 2022, 31(6): 060304.
[7] Voter model on adaptive networks
Jinming Du(杜金铭). Chin. Phys. B, 2022, 31(5): 058902.
[8] Evaluation of performance of machine learning methods in mining structure—property data of halide perovskite materials
Ruoting Zhao(赵若廷), Bangyu Xing(邢邦昱), Huimin Mu(穆慧敏), Yuhao Fu(付钰豪), and Lijun Zhang(张立军). Chin. Phys. B, 2022, 31(5): 056302.
[9] Quantum partial least squares regression algorithm for multiple correlation problem
Yan-Yan Hou(侯艳艳), Jian Li(李剑), Xiu-Bo Chen(陈秀波), and Yuan Tian(田源). Chin. Phys. B, 2022, 31(3): 030304.
[10] Dynamical learning of non-Markovian quantum dynamics
Jintao Yang(杨锦涛), Junpeng Cao(曹俊鹏), and Wen-Li Yang(杨文力). Chin. Phys. B, 2022, 31(1): 010314.
[11] Quantitative structure-plasticity relationship in metallic glass: A machine learning study
Yicheng Wu(吴义成), Bin Xu(徐斌), Yitao Sun(孙奕韬), and Pengfei Guan(管鹏飞). Chin. Phys. B, 2021, 30(5): 057103.
[12] Quantum annealing for semi-supervised learning
Yu-Lin Zheng(郑玉鳞), Wen Zhang(张文), Cheng Zhou(周诚), and Wei Geng(耿巍). Chin. Phys. B, 2021, 30(4): 040306.
[13] Stability analysis of hydro-turbine governing system based on machine learning
Yuansheng Chen(陈元盛) and Fei Tong(仝飞). Chin. Phys. B, 2021, 30(12): 120509.
[14] Fundamental band gap and alignment of two-dimensional semiconductors explored by machine learning
Zhen Zhu(朱震), Baojuan Dong(董宝娟), Huaihong Guo(郭怀红), Teng Yang(杨腾), Zhidong Zhang(张志东). Chin. Phys. B, 2020, 29(4): 046101.
[15] Machine learning in materials design: Algorithm and application
Zhilong Song(宋志龙), Xiwen Chen(陈曦雯), Fanbin Meng(孟繁斌), Guanjian Cheng(程观剑), Chen Wang(王陈), Zhongti Sun(孙中体), and Wan-Jian Yin(尹万健). Chin. Phys. B, 2020, 29(11): 116103.
No Suggested Reading articles found!