MUV

MUV -Home
Download the MUV Datasets
Create your own MUV Datasets
Spatial Statistics Toolbox for Matlab
MUV Theory
FAQ
header

Frequently Asked Questions about MUV

1.Q.When generating MUV datasets from in-House data, why do you need two datasets (actives and decoys) for each target?        
A.MUV is based on HTS data and therefore actives and decoys were determined specifically for each target. Most VS benchmark datasets use separate datasets of actives, but a common pool of decoys for each target. Of course you can also do this for MUV datasets. Just rename your common file of decoys to 'target1_decoys.sdf', 'target2_decoys.sdf', etc. and copy them into the same directory as your potential actives. The MUV workflow will then select a subset of decoys from your common pool specifically for each dataset of actives that fulfills the MUV design criteria.
2.Q.When generating MUV datasets from in-House data, what happens, if you set options.targetG to a very low value, e.g. 0?
A.Very low values of Sigma(G) correspond to very dispersed, diverse datasets of actives. The row exchange algorithm employed in MUV dataset generation will try its best to select a subset of actives form your data that comes as close as possible to this level of dispersion. If your original dataset simply has not enough spread, i.e. no subset can be possibly chosen from this dataset that fulfills the target value of Sigma(G), the algorithm will simply choose the subset with the maximum level of dispersion. There might be applications, where maximally diverse subsets are actually suited for benchmarking of VS methods.
However, there are two issues about maximally dipsersed subsets:
- They can be generated much more effectivly using the well established Kennard-Stone algorithm. (Implementation: spst_kennardstone)
- Depending on your original datasets, the DIFFERENCES in topology between maximally diverse datasets might be even greater than between the original datasets. Thereby topology gets even more impact on your validation than before!
3.Q.Are refined nearest neighbor analysis results (Sigma(G), Sigma(F), Sigma(S)) from datasets with very different distributions of compounds comparable?
A.Yes! Thats exactly what the whole stuff is all about!
4. Q.How many actives and decoys do I need in my orignal datasets to ensure a good success of MUV design?
A. Short answer: as many as possible.
But seriously: the sizes of benchmarking datasets vary according to the tested applications. Datasets for docking are usually smaller that those for ligand based approaches due to the increased computational cost of docking programs.
The optimization algorithms in the MUV design worflow need some "manouevering space", that is, you need to supply more compounds than are later selected for the MUV datasets. As a rule of thumb, the ratio between the size of the pool of potential actives and selected actives should be at least 5/3. For decoys, we have made good experiences with a ratio of roughly 4/1.
The MUV datasets were designed with original datasets of the following sizes:
Pool of potential actives: >50 compounds  --> Selected actives: 30 compounds
Pool of potential decoys: >60000 compounds  --> Selected decoys: 15000 compounds

Be aware, that larger numbers of actives and decoys enhance the statistic stability of your experiments. For a remarkably cocise analysis of factors of variance in VS validation experiments see:

Truchon J.F., Bayly C.I.
Evaluating virtual screening methods: good and bad metrics for the "early recognition" problem.
J. Chem. Inf. Model. 2007, 47, 488-508.
doi: 10.1021/ci600426e
5.Q.During spst_muv, I get the error:

Error using ==> spst_GA at 75
spst_GA: number of candidates too small!
A.This happens, when too few of your potential decoys fulfill the similarity criterion of decoys for MUV design. Then the genetic algorithm spst_GA, that is called for decoy selection, can not select as many decoys as necessary.
The similarity criterion (options.r) is used to speed up the decoy selection process. Decoys that are too dissimilar to the actives, i.e. that have a distance to the next active that is larger than options.r, will cause artificial enrichment in validation experiments. It is therefore not reasonable to consider them in MUV design.
There are two measures you can take, if you encounter the above error:
1. Increase your pool of potential decoys.
2. Decrease options.r in small steps, until the design succeeds. Check the curves of F(t) and G(t) for the resulting datasets for abnormalities.
6.Q. The output of MUV design is a strange MATLAB struct variable R, with only 0 and 1 at its lowest level. Where are the MUV datasets?
A.In order to save computing time and memory, the MUV datasets are not compiled as actual descriptor matrices but as pointers (MATLAB lingo: logical indices) into the original datasets. You will note, that the variable has two fields at the top level: 'R.act' and 'R.dec'. These fields designate actives and decoys respectively. On the next level, the struct has a field for each activity class of your original data, e.g. R.act.ACE, R.act.Thrombin and R.dec.ACE, R.dec.Thrombin.  These fields constitute column vectors that have excactly as many lines as the respective original dataset, so for instance if the original dataset of ACE actives had 360 lines (=compounds), the vector R.act.ACE will also have 360 lines. R.act.ACE has the value 'true' in each line of the original dataset that is included in the MUV design and 'false' for all other lines. The same is true for the decoys.
If you generated the descriptor matrices for actives and decoys using muv_simple_descriptors, you can recover the descriptor matrices of the MUV actives and decoys as:

ACE_ACTIVES = act.ACE.dsc(R.act.ACE,:);

ACE_DECOYS = dec.ACE.dsc(R.dec.ACE,:);

You can get the IDs of the selected compounds by:

ACE_ACTIVES_ID = act.ACE.ids(R.act.ACE,:);
ACE_DECOYS_ID = dec.ACE.ids(R.dec.ACE,:);

You can export the MUV datasets as SD-File by:

spst_muv_extractSD(act, dec, R, classes, '/directory/of/original/sdfiles', '/directory/to/save/MUV/sdfiles', suffix, options)

For details see the HowTo-MUV tutorial.