![]() |
|
It was readily apparent from earlier computational models for estrogens that development of robust models to predict receptor binding, applicable over a range of structure and activity, would require a greatly improved dataset for training and testing [Perkins, 2001 #1332; Perkins, 2001 #1720]. Specifically, existing models for ER-RBA, while yielding a statistically robust cross-validated q², have not yet been validated in terms of their ability to screen or predict the broad diversity of chemical structures that may be active. The data to accomplish such a validation are presently inadequate. Our chemical structure diversity models revealed that more data and more consistent data, spanning a broader range of chemical structure diversity were needed. Accordingly, our current modeling efforts center on the selection of properly designed training sets. To accomplish this, modeling efforts are supported by experimentation aimed at providing data for training and validation of the computational models. There are two essential criteria to assuring that data are appropriate for training and validation, data quality and structural diversity. More specifically:
Validation of data to be used for calibrating and testing models is essential. Our models are designed to predict an outcome for a specific mechanism, or specific set of mechanisms. Corruption of the model can occur when a training set contains chemicals that act by different mechanisms than the rest of the training set. Inclusion of data for chemicals containing active impurities, for weak chemicals below an assay's sensitivity range, or just plain bad assay data points, can similarly corrupt models. We have found that cross-laboratory, cross-assay, and cross-species correlations are a fast and effective way to cull data that would otherwise numerically corrupt the training set [Perkins, 2001 #1332; Perkins, 2001 #1720]. Data validation through the data mining process just described is, however, a time-consuming and arduous process, unless the relevant data is unified and readily available in a relational database.