Training Neural Networks to Accurately Determine Energies of Structures Outside of the Training Set Using Agglomerative Clustering
Carlos A. Barragan and Michael N. GrovesVolume 12, Issue 1 (January 2021), pp. 32–38
https://doi.org/10.22369/issn.2153-4136/12/1/5BibTeX
@article{jocse-12-1-5, author={Carlos A. Barragan and Michael N. Groves}, title={Training Neural Networks to Accurately Determine Energies of Structures Outside of the Training Set Using Agglomerative Clustering}, journal={The Journal of Computational Science Education}, year=2021, month=jan, volume=12, issue=1, pages={32--38}, doi={https://doi.org/10.22369/issn.2153-4136/12/1/5} }
Machine learning has accounted for solving a cascade of data in an efficient and timely manner including as an alternative molecular calculator to replace more expensive ab initio techniques. Neural networks (NN) are the most predictive for new cases that are similar to examples in their training sets; however, it is sometimes necessary for the NN to accurately evaluate structures not in its training set. In this project, we quantify how clustering a training set into groups with similar geometric motifs can be used to train a NN so that it can accurately determine the energies of structures not in the training set. This was accomplished by generating over 800 C8H7N structures, relaxing them using DFTB+, and grouping them using agglomerative clustering. Some of these groups were assigned to the training group and used to train a NN using the pre-existing Atomistic Machine-learning Package (AMP). The remaining groups were evaluated using the trained NN and compared to the DFTB+ energy. These two energies were plotted and fitted to a straight line where higher R2 values correspond to the NN more accurately predicting the energies of structures not in its training set. This process was repeated systematically with a different number of nodes and hidden layers. It was found that for limited NN architectures, the NN did a poor job predicting structures outside of its training set. This was improved by adding hidden layers and nodes as well as increasing the size of the training set.