2021, Vol.50, No.11

1887-1891

A set of electronic-structure informatics (ESI) descriptors was applied to classify modes of toxic action (MoA) of 245 phenol derivatives to the growth of Tetrahymena pyriformis. The extreme gradient boosting classification (XGBC) method was performed for classification where the synthetic minority oversampling technique (SMOTE) was employed to overcome the imbalance in the MoA classification data available in the literature. Performance evaluation of the machine learning procedure shows that it is able to classify the molecular set into the five MoA classes. The feature importance analysis suggests electronic factors determining characteristics for each MoA class.

In silico prediction of toxicity of molecules is one of the most challenging technologies for drug development.1 Potentially demanded tasks are to make an a priori prediction of potential toxic effects and their endpoints.2 For quantitative prediction of the latter, it is desired to have information of possible modes of toxic actions (MoA), which is defined as a set of physiological and behavioral reactions that characterizes a type of adverse biological response.3,4 This classification is different from mechanism for toxicity, which is defined as the biochemical process underlying a given MOA.4

Recently, we have reported regression modeling of toxicity exhibited by phenol derivatives.5 This type of molecule is one of the most widely used chemicals in commercial products, but should be carefully handled because of its hazardous characteristics.6 For example, phenolic additives used for preservatives, disinfectants, sunscreens, and light stabilizers are known to have estrogenic activity causing endocrine disruption.7,8 The widespread exposure to chlorophenols has led to development of an immunoassay: 2,4,5-trichlorophenol is one of such congeners with significant toxic effects.9 In our previous study,5 we applied a set of descriptors that can be numerically evaluated by using electronic-structure calculations. Since structurally similar molecules were expected to exhibit toxicity in a similar MoA, the toxicity of 33 alkylphenols to Tetrahymena pyriformis were investigated using the multiple linear regression model.5

In order to extend such an in silico study, it seems indispensable to develop a computational method to classify MOA for given toxicity since different factors are expected important and regression modeling should be done separately for each MoA. For phenol derivatives, it has been known that they exhibit toxicity via several MoA.10

Schultz et al. tried to classify the phenol derivatives into five MoA classes on the basis of their molecular structures.11 Their suggested criteria for classification are summarized in Table S1: the functional group in a phenolic ring may result in five MoA, which are called polar narcotic (PN), weak acid respiratory uncoupler (WARU), pro-electrophile (PE), soft electrophile (SE), and pro-redox cycler (PRC) classes.

There are some methods of classifying compounds according to MoA.12 The most frequently used and conventional one is the rules-based method.13 Several quantitative structure-activity relationships (QSAR) classification studies for MoA prediction have also been performed so far.10,13,14 To our knowledge, no in silico studies have been successful in the MoA classification.10,13,14 This would be partly due to the fact that the number of molecules (reference data) is not uniform among the five MoA classes, i.e. the reference data is seriously imbalanced. In addition, the endpoint of toxicity may or may not be a linear function of descriptors (explanatory variables), suggesting that an extended machine learning modeling should be applied.

In this paper, we apply the extreme gradient boosting (XGB) method to classification of MoA,15 hereafter referred to as XGBC, with the descriptor set which has been recently suggested by the present authors and is called “electronic-structure informatics” (ESI) descriptor.5,16,17 The XGB algorithm is a supervised machine-learning method based on ensemble learning and decision tree algorithms. It has been known to be powerful with some advantages in prediction speed and performance in comparison with other machine learning methods for classification and regression.15,18

In order to overcome the imbalance problem in MoA classification, we herein apply synthetic minority oversampling technique (SMOTE).19 Although several methods have been reported for QSAR studies,20,21 SMOTE can be considered as one of the most successful methods in the literature. In this paper, the dataset after the SMOTE procedure is called “a balanced dataset”.

The phenol derivatives referred to in this paper are those in Cronin et al.22 Originally 250 compounds were listed, but we excluded 4 iodo compounds because their electronic structure would be appreciably influenced by spin-orbit coupling, which is a typical relativistic effect, due to iodine. The relativistic effect would appreciably influence descriptors related to electronic energy levels. 4-Methylcyanophenol was also eliminated because of ambiguity of its compound name.22,23 In addition, we reassigned the MoA class of tetrachlorocatechol from PE class to PRC class by referring to the MoA classification scheme originally suggested by Schultz et al.11 Thus, the dataset in this present study contains 245 molecules among which 169, 26, 27, 18, and 5 phenol derivatives are assigned to the PN, PE, SE, WARU, and PRC classes, respectively.

In this work, the ESI descriptors were evaluated through the electronic-structure calculations using the Gaussian 16 program.24 The definition of the descriptors is summarized in Table S2, and the calculated values are tabulated in Tables S3 and S4 in Supporting Information. The density functional theory (DFT) method with the M06-2X functional and the 6-31G(d,p) basis set was employed to optimize geometries of the molecules and to calculate their properties. The vertical ionization potential (IPv), vertical electron affinity (EAv), and vertical transition energy to the spin triplet state (ΔSTv) were calculated by using the ΔSCF method. The reorganization energies (λS0→C, λC→S0, λS0→A, λA→S0, λS0→T1, λT1→S0) were evaluated by optimizing the geometries in all the electronic states including the ground (S0), ionized (C), electron attached (A), and lowest spin-triplet (T1) states. The electrostatic solvation energy in aqueous solution (ΔEsolv) was evaluated with the polarizable continuum model (PCM) method. As a related molecular property, dipole moment was also taken into account.

The molecular volume (Vmol) was calculated through Monte Carlo sampling implemented in Gaussian 16 where the isodensity surface was referred to for the definition of the molecular surface. Since the molecular volume fluctuated depending on the run, we averaged the results after repeating the evaluation ten times. The box size containing a molecule was determined by referring to the isodensity surface of atoms. The plane of the box was defined by using the Mathematica 11 software.25 The size of the box is represented using the three lengths (B1, B2, B3) of the edges of the box in the descending order. We also included the molecular mass (M) in the descriptor because it was considered significant in the previous studies.5

The descriptors for electronic similarity in the UV/vis spectra and (energy) density of electronically excited states were evaluated as spectral overlaps where the reference molecule was 2,3,4,5-tetrachlorophenol whose pIGC50 (= −log IGC50 where IGC50 is the 50% inhibition growth concentration) to Tetrahymena pyriformis was the largest (2.710) in Cronin’s paper.22 The excited-state calculation was performed using the time-dependent DFT (TDDFT) method with the same basis set as the ground state where the low-lying 30 excited states were calculated. The spectral overlap was calculated through convolution by Gaussian functions on each peak. Since the number of the calculated excited states are limited to 30, we also evaluated spectral similarity of density of molecular orbitals, which is known as “density of states (DOS)” in solid state physics. The descriptors representing vibrational similarities were evaluated for the infrared (IR) spectra and the density of vibrational states.

The XGBC calculations were implemented with the in-house code developed with the scikit-learn package in Python 3.7.1. Before classification, its hyperparameters15,26 were optimized for “a balanced dataset” with the Bayesian optimization method. The optimized values are summarized in Table 1.

Table
Table 1. Summary of the optimized hyperparameters.15,26
Table 1. Summary of the optimized hyperparameters.15,26
Hyperparameter Range Optimal value
eta [0,1] 0.87
gamma [0,1] 1.62
max_depth [1,10] 6
min_child_weight [0,10] 1
subsample [0,1] 0.85
colsample_bytree [0,1] 0.15

Chawla et al. proposed the SMOTE method and successfully used it to overcome imbalance in a dataset.19 It synthesizes (i.e. generates) new descriptor values from its neighboring data points in order to increase the number of the data in the minority class until it becomes equal to that in the majority class.27 As an example to see evolution of data points in the SMOTE procedure, we show the data points for the 245 molecules and those with the synthesized data in Figures 1a and 1b, respectively, where IPv and EAv were chosen as the descriptors defining the chemical space. Figure 1b indicates all the data points including both the original and synthesized ones after applying the SMOTE algorithm.

In applying XGBC, the MOA classification data were split into the training and test sets. The former was used to construct a classification model, while the latter was to evaluate its performance. In this study, we randomly separated the data into training and test sets with the ratio of 4:1. For reference, the numbers of molecules with this ratio are summarized in Table 2.

Table
Table 2. The number of molecules (NOM) in the training and test sets for each MOA when the data splitting is done with the ratio of 4:1. In the implementation, each NOM was rounded to an integer. The values in the parentheses indicate the NOM in the best XGBC model in the 100 trials.
Table 2. The number of molecules (NOM) in the training and test sets for each MOA when the data splitting is done with the ratio of 4:1. In the implementation, each NOM was rounded to an integer. The values in the parentheses indicate the NOM in the best XGBC model in the 100 trials.
MOA Training set Test set Total
PN 135.20(136) 33.80(33) 169
PE 20.95(20) 5.05(6) 26
SE 21.53(21) 5.47(6) 27
WARU 14.27(16) 3.73(2) 18
PRC 4.05(3) 0.95(2) 5
Total 196(196) 49(49) 245

Since arbitrary data splitting may generate the data sets defining different applicability domain for modeling, the procedure was repeated 100 times, and the XGBC was performed to each divided dataset. The predictive performance of XGBC model was assessed by the following four parameters defined by eqs (1)–(4), respectively: overall prediction accuracy, precision (positive predictive value), recall (sensitivity), and the F1-score (weighted average of precision and recall value).28

\begin{equation} \textit{accuracy} = (\textit{TP} + \textit{TN})/(\textit{TP} + \textit{TN} + \textit{FP} + \textit{FN}) \end{equation}
(1)
\begin{equation} \textit{precision} = \textit{TP}/(\textit{TP} + \textit{FP}) \end{equation}
(2)
\begin{equation} \textit{recall} = \textit{TP}/(\textit{TP} + \textit{FN}) \end{equation}
(3)
\begin{equation} \textit{F1-score} = 2(\textit{precision})(\textit{recall})/(\textit{precision} + \textit{recall}) \end{equation}
(4)
where TP, TN, FP, and FN indicate true positive, true negative, false positive, and false negatives, respectively.

The average numbers of molecules for each MOA class in the test set are summarized in a confusion matrix shown in Table 3, while the assessment parameters for the performance, which were evaluated using Table 3, are tabulated in Table 4. The corresponding data for the training set are available in Tables S5 and S6 in Supporting Information. Table 4 indicates that the present computational approach is able to accurately classify the phenol derivatives into the five MOA classes.

Table
Table 3. The confusion matrix for the average number of molecules of the test set classified into each of the experimentally assigned and computationally predicted MOAs. A value in a parenthesis indicates the number of molecules experimentally assigned to a given MOA class.
Table 3. The confusion matrix for the average number of molecules of the test set classified into each of the experimentally assigned and computationally predicted MOAs. A value in a parenthesis indicates the number of molecules experimentally assigned to a given MOA class.
Table
Table 4. Assessment parameters for XGBC evaluated from the confusion matrix shown in Table 3 showing the averaged result over 100-time data divisions into the training and test sets.
Table 4. Assessment parameters for XGBC evaluated from the confusion matrix shown in Table 3 showing the averaged result over 100-time data divisions into the training and test sets.
MOA accuracy precision recall F1-score
PN 0.94 0.99 0.94 0.97
PE 0.96 0.79 0.96 0.86
SE 0.97 0.89 0.97 0.93
WARU 0.99 0.97 0.99 0.98
PRC 0.97 1.00 0.97 0.98

Among the five MOA classes, the PRC class has only five molecular members. It may reduce the accuracy, precision, and sensitivity of the multiple classification models so that some previous studies have commonly excluded this minority class.2931 In this study, this minority class was also included in modeling together with the four major classes. Tables 3 and 4 show that the molecules in the PRC class were successfully classified.

In order to establish a classification model, the best model was chosen after the 100 trials in data division. Table 5 shows the confusion matrix for the best model. The number of molecules corresponding to the true negatives and the true positives are small.

Table
Table 5. The confusion matrix for the number of molecules classified into each of the structurally assigned and computationally predicted MOAs. The best classification model was chosen after the 100 trials, and the model was applied to classify all (245) molecules. A value in a parenthesis indicates the number of molecules experimentally assigned to a given MOA class. A value in an angular bracket is a result obtained without using the SMOTE algorithm.
Table 5. The confusion matrix for the number of molecules classified into each of the structurally assigned and computationally predicted MOAs. The best classification model was chosen after the 100 trials, and the model was applied to classify all (245) molecules. A value in a parenthesis indicates the number of molecules experimentally assigned to a given MOA class. A value in an angular bracket is a result obtained without using the SMOTE algorithm.

The calculated assessment parameters derived from Table 5, which are summarized in Table 6, indicate that all the scores are better than the average over the 100 trials (Table 4) and sufficiently high for accurate classification. When the SMOTE algorithm was not employed, we failed to have molecules assigned to the PRC class in prediction as shown in Tables 5 and 6.

Table
Table 6. Assessment parameters for XGBC evaluated from the confusion matrix shown in Table 5. A value in an angular bracket is a result obtained without using the SMOTE algorithm. The precision and F1-score values are not available for PRC because TP = 0 and FP = 0.
Table 6. Assessment parameters for XGBC evaluated from the confusion matrix shown in Table 5. A value in an angular bracket is a result obtained without using the SMOTE algorithm. The precision and F1-score values are not available for PRC because TP = 0 and FP = 0.
MoA accuracy precision recall F1-score
PN 0.96
〈0.99〉
0.99
〈0.99〉
0.96
〈0.99〉
0.98
〈0.99〉
PE 0.92
〈0.98〉
0.92
〈0.92〉
0.92
〈0.92〉
0.92
〈0.92〉
SE 1.00
〈0.99〉
0.87
〈0.96〉
1.00
〈0.96〉
0.93
〈0.96〉
WARU 1.00
〈0.98〉
1.00
〈0.94〉
1.00
〈0.83〉
1.00
〈0.88〉
PRC 1.00
〈0.98〉
1.00
〈N.A.〉
1.00
〈0.00〉
1.00
〈N.A.〉

It is noted that the classification criteria by Schultz et al. were based on the structural features characterized through experimental observations. In contrast, the ESI descriptor employed herein essentially represents electronic features of molecules including the size parameters that can be related to potential energy surfaces for intermolecular interactions. The present success in classification indicates that numerical description of structural characteristics of molecules would be possible even with the ESI descriptor.

The feature important analysis was performed to know the determining factors for classification. This was done after preliminary selection of descriptors on the basis of the correlation matrix (Table S10): the redundant descriptors were eliminated by referring to pair correlation larger than 0.5. Then, the contribution of the remaining ESI descriptors to F1-score was numerically evaluated to XGBC model.

The result of feature important analysis is summarized in Figure 2. It indicates that the most important descriptor is EAv, while the second one is ΔEsolv. These descriptors are essentially similar to those suggested by Cronin et al:22 the lowest unoccupied molecular orbital (εLUMO) and octanol-water partition coefficient (log P) were shown to be important. Figure 2 also suggests some other parameters related to structural softness (λS0→A, λS0→C) and size parameters (B2, Vmol) would play important roles as classification factors.

In summary, we have performed in silico classification of the 245 phenol derivatives using the XGBC method with the ESI descriptors. Most of the 21 ESI descriptors were numerically evaluated using the DFT-level of electronic-structure calculations. The imbalanced dataset of the reference data was overcome by using the SMOTE technique. The obtained classification model was successful to accurately classify the 245 phenol derivatives into five MoA classes suggested by Schultz et al.11 Although computationally demanding, the smaller size and clear physical meanings of the ESI descriptor set would be the advantages over other descriptor sets.

We are grateful to Takafumi Inoue and Toshihiro Ideo for his assistance in a part of the descriptor calculation. This work was partially supported by Grant-in-Aid for Scientific Research on Innovative Areas (2601): π-System Figuration (26102015 to MS) from MEXT, Japan.

Supporting Information is available on https://doi.org/10.1246/cl.210453.