Impact of Benchmark Dataset Topology on VS Validation Results
We introduced Refined Nearest Neighbor analysis methods for the analysis of chemcial datasets in the paper:
Rohrer, S.G.; Baumann, K.
Impact of Benchmark Data Set
Topology on the Validation of Virtual Screening Methods: Exploration
and Quantification by Spatial Statistics.
J. Chem. Inf. Model., 2008, 48, 704-71
doi: 10.1021/ci700099u (Open Access)
It
was shown, that datasets with a "clumpy" topology in in descriptor
space bias validations of virtual screening methods towards
over-optimistic results. This lead to the rationale, that datasets
without bias, i.e. Maximum Unbiased Validation (MUV) Datasets, should
have a non-clumpy, spatially random topology.
Both the original
findings about the impact of dataset topology and the rationale behind
MUV dataset design are summarized quite concisely in a talk Knut
Baumann gave at the EuroQSAR 2008 at Uppsala: Slides.
Please refer to muv@tu-bs.de if you have any questions or suggestions.