MUV -Home
Download the MUV Datasets
Create your own MUV Datasets
Spatial Statistics Toolbox for Matlab
MUV Theory

Maximum Unbiased Validation (MUV) Datasets for Virtual Screening

Many benchmark experiments of virtual screening techniques are biased by the composition of the benchmark datasets. Such benchmark dataset bias usually has two main causes:

1. The actives are to similar to each other ("analogue bias" [1]) regarding low-dimensional molecular properties.
2. The actives are too dissimilar from the decoys ("artificial enrichment" [2]) also regarding low-dimensional molecular properties.

Research from our lab recently introduced two tools for the statistical analysis of such benchmark dataset bias in chemical datasets:

1. "Simple Descriptors" are a vectorized form of a compound's sum-formula, the respective counts of H-Bond acceptors, H-Bond donors, chiral centers and ring systems and the compounds logP. These descriptors capture all low-dimensional molecular properties associated with analogue bias and artificial enrichment.[3]

2. Refined Nearest Neighbor analysis based on a simple descriptor representation of the respective datasets can be utilized to characterize the topology of the datasets, i.e. their clustering behavior and their relative position in chemical space.[3]

It was shown, that benchmark dataset bias is associated with "clumpy" topologies, i.e. datasets that form clusters in descriptor space and are separated from the decoys. Refined Nearest Neighbor analysis also provides tools for quantifying the "clumpiness" of datasets: The scalar ΣS has values smaller than 0, whenever a dataset is clumpy and values larger that 0, if it is dispersed. Values of ΣS near 0 indicate the special state of spatial randomness, i.e. a spatially randon distribution of actives and decoys in simple descriptor space.

It is obvious, that a collection of benchmark datasets, in which all datsets exhibit spatial randomness in simple descriptor space would be especially advantageous for validating VS methods.

1. A random distribution of actives and decoys regarding simple descriptors prevents low-dimensional molecular properties from distorting validation results.
2. Datasets that do not favor ligand based virtual screening methods because of trivial features of the molecules used for benchmarking greatly facilitate the comparison of ligand based VS methods and docking applications.

Based on these findings, the MUV collection of benchmark datasets was designed based on PubChem bioactivity data.[4] PubChem has several advantages for the design of VS benchmark datasets:

1. All data, including molecular structures and bio-assay readouts is publicly available.
2. Compounds tested inactive in the bio-assays are available as experimentally validated decoys.
3. The compound sets tested in the NIH roadmap effort are very diverse.
4. The majority of tested compounds is drug-like.

However, PubChem also has downsides as a datasource: Due to its origin from HTS, the bioactivity data in PubChem is notoriously noisy. Therefore, the MUV design workflow first ensures the maximum level of confidence by purging all bioactivity datasets from compounds, whose assigned bioactivity might be subject to doubts. The remaining datasets are obtimized spatially to generate the final MUV datasets.

MUV Workflow 1. Pairs of bio-assays were extracted from PubChem. Here the requirement was that the bioactivity against the same protein target was first determined in a high-throughput primary screen and then in a follow-up, low-throughput confirmation screen.
Inactives from the primary screens were used as "Potential Decoys" (PD), actives from the confirmation screens as "Potential Actives" (PA). Potential actives were further required to provide associated dose-response data and EC50 values.
2. From the datasets of potential actives, all compounds with doubtable bioactivities were purged by an array of automatic filters.
This included compounds with suspicious Hill-Slopes, compounds hitting in an unusually high number of assays in PubChem (i.e. frequent hitters) and compounds that are known to exhibit undesirable interference with optical detection methods.

3. Actives not adequately embedded in the available decoys were removed from the datasets. This is essential because no dataset of decoys can be designed, that prevents such actives from artificial enrichment.

4. Subsets of k=30 actives were extracted for each dataset with a common spread measured by the Refined Nearest Neigbor analysis figure ΣG.

5. Subsets of d=15000 decoys were extracted for each dataset with a common separation from the actives measured by the Refined Nearest Neigbor analysis figure ΣF.

In the first version, MUV contains 17 datasets of actives and corresponding decoys with the following properties:

Target Mode of Interaction Target Class Prim. Assay (AID) Confirm. Assay (AID) Assay-Type Actives (original dataset) Decoys (original dataset) Actives (MUV) Decoys (MUV) Scaffolds (MUV)
S1P1 rec. Agonists GPCR 449 466 Reporter Gene 223 55395 30 15000 28
PKA Inhibitors Kinase 524 548 Enzyme 62 64814 30 15000 27
SF1 Inhibitors Nuclear Receptor 525 600 Reporter Gene 213 64550 30 15000 24
Rho-Kinase2 Inhibitors Kinase 604 644 Enzyme 67 59576 30 15000 27
HIV RT-RNase Inhibitors RNase 565 652 Enzyme 370 63969 30 15000 27
Eph rec. A4 Inhibitors Rec. Tyr. Kinase 689 689 Enzyme 80 61480 30 15000 29
SF1 Agonists Nuclear Receptor 522 692 Reporter Gene 75 63683 30 15000 30
HSP 90 Inhibitors Chaperone 429 712 Enzyme 91 63481 30 15000 27
ER-a-Coact. Bind. Inhibitors PPI 629 713 Enzyme 221 84656 30 15000 26
ER-ß-Coact. Bind. Inhibitors PPI 633 733 Enzyme 194 84984 30 15000 28
ER-a-Coact. Bind. Potentiators PPI 639 737 Enzyme 64 84947 30 15000 28
FAK Inhibitors Kinase 727 810 Enzyme 110 96070 30 15000 28
Cathepsin G Inhibitors Protease 581 832 Enzyme 65 62007 30 15000 24
FXIa Inhibitors Protease 798 846 Enzyme 70 218421 30 15000 21
FXIIa Inhinbitors Protease 800 852 Enzyme 99 216795 30 15000 24
D1 rec. Allosteric Modulators GPCR 641 858 Reporter Gene 226 54292 30 15000 24
M1 rec. Allosteric Inhibitors GPCR 628 859 Reporter Gene 231 61477 30 15000 29

Assay IDs refer to bio-assays in PubChem, which were used for the assignment of bioactivities. 

Because of their spatial properties in simple descriptor space, MUV datasets are a tool for the unbiased validation of virtual screening techniques. The datasets are available for download here.

You can also generate your own MUV datasets based on your in-house bioactivity data. The necessary tools can be found here.


[1] Good, A. & Oprea, T.
Optimization of CAMD techniques 3. Virtual screening enrichment studies: a help or hindrance in tool selection?
J. Comput.-Aided Mol. Des., 2008, 22, 169-178
doi: 10.1007/s10822-007-9167-2
[2] Verdonk, M. L.; Berdini, V.; Hartshorn, M. J.; Mooij, W. T. M.; Murray, C. W.; Taylor, R. D. & Watson, P.
Virtual screening using protein-ligand docking: avoiding artificial enrichment.
J. Chem. Inf. Comput. Sci., 2004, 44, 793-806
doi: 10.1021/ci034289q
[3] Rohrer, S.G.; Baumann, K.
Impact of Benchmark Data Set Topology on the Validation of Virtual Screening Methods: Exploration and Quantification by Spatial Statistics.
J. Chem. Inf. Model., 2008, 48, 704-71
doi: 10.1021/ci700099u (Open Access)
[4] Rohrer, S.G.; Baumann, K.
Maximum Unbiased Validation (MUV) Datasets for Virtual Screening Based on PubChem Bioactivity Data
J. Chem. Inf. Model., in press