Steroid identification via deep learning retention time predictions and two-dimensional gas chromatography-high resolution mass spectrometry
Untargeted steroid identification represents a great analytical challenge even when using sophisticated technology such as two-dimensional gas chromatography coupled to high resolution mass spectrometry (GC × GCHRMS) due to the chemical similarity of the analytes. Moreover, when analytical standards, mass spectral and retention index databases are not available, compound annotation is cumbersome. Hence, there is a need for the development of retention time prediction models in order to explore new annotation approaches. In this work, we evaluated the use of several in silico methods for retention time prediction in multidimensional gas chromatography. We use three classical machine learning (CML) algorithms (Partial Least Squares (PLS), Support Vector Regression (SVR) and Random Forest Regression (RFR)) and two deep learning approaches (dense neural network (DNN) and three-dimensional convolutional neural network (CNN)). Whereas molecular descriptors were utilized for the CLM and DNN algorithms, three-dimensional molecular representation based on the electrostatic potential (ESP) was studied as input data as is for the CNN. All the developed models showed similar performances with Q2 values over 0.9. However, among all CNN showed the best performance, resulting in average retention time prediction errors of 2% and 6% for the first and second separation dimension, respectively. Additionally, only the three-dimensional ESP representation coupled with CNN was able to extract the stereochemical information crucial for the separation of diastereomers. The combination of retention time prediction and high-resolution mass spectral data applied to clinical samples enabled the untargeted annotation of 12 steroid metabolites in the urine of new-borns.