Docking Scoring Functions Compared: A 2025 Guide to Performance, Pitfalls, and AI Advances

Nathan Hughes Dec 03, 2025 75

Accurate scoring functions are critical for the success of molecular docking in structure-based drug design.

Docking Scoring Functions Compared: A 2025 Guide to Performance, Pitfalls, and AI Advances

Abstract

Accurate scoring functions are critical for the success of molecular docking in structure-based drug design. This article provides a comprehensive, up-to-date comparison of docking scoring function performance for researchers and drug development professionals. We explore the foundational principles of classical and machine-learning-based functions, detail methodological approaches for their application in virtual screening, and offer practical strategies for troubleshooting and optimization. Finally, we present a rigorous validation framework, comparing the performance of various functions across different targets and highlighting how emerging artificial intelligence (AI) methods are reshaping the field. The insights synthesized here aim to guide the selection and application of scoring functions to improve the efficiency and success rate of virtual screening campaigns.

The Building Blocks of Prediction: Understanding Scoring Function Types and Mechanisms

Scoring functions are the computational engines of molecular docking, tasked with predicting the binding mode and affinity of a ligand to its biological target. They achieve this by approximating the interaction energy between the molecules, serving as a critical filter to identify the most likely binding poses from millions of possibilities and to rank compounds in virtual screening campaigns [1] [2]. Their performance directly impacts the success rate of structure-based drug design, influencing the accuracy of predicted protein-ligand complexes and the efficient identification of promising hit compounds [3].

The field is characterized by a diversity of approaches, each with distinct strengths and weaknesses. The core task of a scoring function can be broken down into several key capabilities, or "powers," that are used to benchmark their performance.

Docking Power: The ability to identify the correct binding pose of a ligand, typically defined as a pose with a Root Mean Square Deviation (RMSD) of less than 2 Å from the experimental structure [2].
Scoring Power: The capability to compute a binding score that correlates linearly with experimentally measured binding affinities [2].
Ranking Power: The proficiency to correctly rank a series of ligands bound to the same protein based on their binding affinity [2].
Screening Power: The effectiveness in distinguishing true binders from non-binders (decoys) in a virtual screen, often measured by enrichment factors [2].

A Comparative Taxonomy of Scoring Function Methodologies

Scoring functions are traditionally categorized by their underlying theoretical foundations. The table below outlines the main types, their core principles, and representative examples.

Table 1: Classification and comparison of classical and machine learning-based scoring functions.

Category	Core Principle	Representative Examples	Key Characteristics
Physics-Based	Calculates energy using classical force fields summing Van der Waals, electrostatic, and sometimes solvation terms [3].	GROMOS96 [4], AMBER	High computational cost; explicit physical representation [3].
Empirical-Based	Fits weighted energy terms (e.g., H-bonding, hydrophobic) to experimental binding affinity data using linear regression [3] [5].	MOE's London dG, Alpha HB [5], Glide XP, AutoDock Vina's function [6]	Faster computation; performance depends on training data [3].
Knowledge-Based	Derives potentials from statistical analysis of atom-pair frequencies in known protein-ligand structures (Boltzmann inversion) [3].	AP-PISA, SIPPER [3]	Good balance of speed and accuracy [3].
Hybrid	Combines elements from the above categories into a single scoring scheme [3].	PyDock, HADDOCK [3]	Aims to leverage the strengths of multiple approaches.
Machine/Deep Learning (ML/DL)	Learns complex, non-linear relationships between protein-ligand structural features and binding affinities or native poses [3] [2].	Various 3D-CNNs, Graph Neural Networks [3] [2]	No predetermined functional form; requires large training datasets [2].

Performance Comparison: Classical vs. Deep Learning and Across Software

Independent benchmarking studies reveal that no single scoring function excels universally across all tasks. The following table summarizes quantitative performance data from recent comparative assessments.

Table 2: Experimental performance comparison of selected scoring functions across different benchmarks.

Scoring Function	Type	Docking Power (Pose Selection)	Key Comparative Findings
MOE (London dG & Alpha HB)	Empirical	N/A	Showed the highest pairwise comparability and performance in a 2025 InterCriteria Analysis (ICrA) on the CASF-2013 benchmark [1] [5].
AutoDock Vina	Empirical	Used as a common baseline in DL studies [2].	A 2024 review noted that DL-based pose selectors frequently outperform classical SFs like Vina in identifying near-native poses [2].
Deep Learning Pose Selectors	Deep Learning	Superior to classical SFs like PLANTS ChemPLP, Glide XP, and AutoDock Vina [2].	Designed specifically for pose selection, overcoming limitations of affinity-based SFs; performance depends on training data [2].
NMRScore	Experimental Data-Based	Outperformed 8 docking program SFs (AutoDock, Dock, Glide, MOE, etc.) in ranking native-like poses for FKBP [7].	Uses NMR chemical shift perturbations as a scoring metric, showing excellent correlation with correct poses [7].

Beyond pose prediction, the screening power of scoring functions is critical for drug discovery. A 2025 study on large-scale docking benchmarks highlighted that machine learning models trained on docking scores can effectively prioritize molecules for testing. For example, models trained on just 1% of a massive docking library could identify a significant fraction of the top 0.01% scoring compounds, demonstrating the potential for ML to augment traditional scoring in virtual screening [8].

Experimental Protocols for Benchmarking Scoring Functions

To ensure fair and reproducible comparisons, the community relies on standardized benchmarks and protocols. The most cited protocol involves using the CASF (Comparative Assessment of Scoring Functions) benchmark.

The CASF Benchmark Methodology

The CASF benchmark, particularly the CASF-2013 and CASF-2016 versions, provides a high-quality dataset of protein-ligand complexes from the PDBbind database for a head-to-head evaluation of scoring functions [1] [5]. A typical workflow for assessing docking power is as follows:

Dataset Curation: A set of protein-ligand complexes with high-resolution crystal structures and reliable binding affinity data is selected (e.g., the 195 complexes in CASF-2013) [5].
Re-docking: For each complex, the native ligand is extracted and then re-docked into its protein's binding site using the docking program of interest. This generates multiple candidate poses for each ligand [5].
Pose Scoring and Selection: The scoring function under evaluation is used to score all generated poses. The top-ranked pose (the one with the "best" score, e.g., most negative) is selected as the prediction.
Accuracy Calculation: The Root Mean Square Deviation (RMSD) between the top-ranked predicted pose and the experimentally determined co-crystallized ligand structure is calculated.
Performance Metric: The docking power is reported as the percentage of test cases where the top-ranked pose has an RMSD below a defined threshold (commonly 2 Å), indicating a successful, near-native prediction [2].

This process was employed in a 2025 study comparing MOE's scoring functions, which also analyzed other outputs like the best docking score and the score of the pose with the lowest RMSD to provide a multi-faceted performance assessment [5].

Protocol for Novel Scoring Methods like NMRScore

Alternative methods like NMRScore employ a different, experimentally grounded protocol [7]:

Pose Generation: Multiple docking poses for a protein-ligand complex are generated using one or several docking programs.
Chemical Shift Calculation: NMR chemical shifts are calculated for each of the generated docking poses.
Experimental Comparison: The calculated chemical shifts for each pose are compared to the actual experimental NMR chemical shifts.
Scoring and Ranking: The NMRScore is defined as the RMSD between the calculated and experimental chemical shifts. A lower NMRScore indicates a closer match to the true native structure, allowing poses to be ranked accordingly [7].

For researchers conducting comparative studies on scoring functions, several key resources and tools are indispensable.

Table 3: Key research reagents and resources for benchmarking scoring functions.

Resource Name	Type	Function in Research
PDBbind Database	Curated Database	Provides a comprehensive collection of protein-ligand complexes with experimentally measured binding affinity data, serving as the foundation for benchmarks like CASF [5].
CASF Benchmark	Standardized Benchmark	Offers a ready-to-use subset of PDBbind for the fair and standardized comparison of scoring functions' docking, scoring, ranking, and screening powers [5].
CCharPPI Server	Computational Server	Allows for the evaluation of scoring functions independent of the docking process, enabling isolated assessment of the scoring step [3].
Large-Scale Docking (LSD) Database	Benchmarking Database	Provides access to docking scores and results for billions of molecules across multiple targets, useful for training ML models and benchmarking screening power [8].
Smina	Docking Software	A fork of AutoDock Vina that offers enhanced control over scoring terms and command-line usability, facilitating customized docking and scoring experiments [9].

In summary, scoring functions are indispensable tools in computational drug discovery, but their performance is highly variable and context-dependent. Empirical functions like those in MOE and Vina are widely used, but deep learning methods are emerging as powerful alternatives, particularly for the critical task of pose selection. Rigorous benchmarking using standardized protocols and databases like CASF is essential for selecting the appropriate scoring function for a specific research goal. The ongoing integration of machine learning and novel data sources like NMR chemical shifts promises to further enhance the accuracy and reliability of these computational tools.

In structure-based drug discovery, molecular docking is a pivotal technique for predicting how a small molecule (ligand) binds to a target protein. The reliability of this process depends critically on the scoring function, a mathematical model that approximates the binding affinity between the ligand and protein by calculating their interaction energy [1] [10]. Scoring functions are employed to determine the binding mode and site of a ligand, predict binding affinity, and identify potential drug leads for a given protein target [10]. Despite intensive research, accurate and rapid prediction of protein-ligand interactions remains a central challenge in molecular docking, driving continuous development and refinement of scoring methodologies [10] [11].

These functions can be conceptually categorized into four classical types: physics-based, empirical, knowledge-based, and modern machine learning-based approaches, with hybrid methods combining elements from multiple categories [10] [3]. This guide provides a comparative analysis of these scoring function paradigms, examining their theoretical foundations, performance characteristics, and practical applications in drug development workflows.

Theoretical Foundations and Classification

The four classical pillars of scoring functions each employ distinct theoretical approaches to quantify molecular interactions.

Physics-Based Scoring Functions

Physics-based scoring functions use classical force fields to calculate binding energy through fundamental physical interactions. They typically sum Van der Waals and electrostatic interactions between the protein and ligand, sometimes incorporating solvent effects, polarization, and charge features for improved accuracy [3]. These functions are often designed for use in molecular dynamics simulations and may require explicit treatment of water or an implicit solvent model [11]. The GBVI/WSA dG function in MOE (Molecular Operating Environment) represents an example of a force-field based scoring function [5]. While physically rigorous, these methods generally incur high computational costs [3].

Empirical Scoring Functions

Empirical scoring functions estimate binding affinity by summing a series of weighted energy terms parameterized to reproduce experimental binding affinities or binding poses [11]. They incorporate physically meaningful terms similar to force-field functions but may also include more complex, heuristic terms for hydrophobic and desolvation interactions not easily addressed by purely physical models [11]. The weights for different terms are typically determined using linear regression or other fitting techniques against training datasets of known protein-ligand complexes [10] [11]. Examples include London dG, ASE, Affinity dG, and Alpha HB in MOE, and the default scoring functions in AutoDock Vina and smina [11] [5]. These functions are typically less prone to overfitting due to constraints imposed by physical terms [11].

Knowledge-Based Scoring Functions

Knowledge-based (statistical-potential) scoring functions derive simplified potentials directly from structural databases using Boltzmann inversion of pairwise distances between atoms or residues in the two proteins [3]. This approach seeks to approximate complex physical interactions using large numbers of simple terms learned from existing protein-ligand complex structures [11]. However, the resulting scoring function may lack immediate physical interpretation, and the numerous terms increase overfitting risk, necessitating rigorous validation protocols [11]. Methods such as AP-PISA and CP-PIE fall into this category [3]. These functions generally offer a good balance between accuracy and speed [3].

Machine Learning-Based Scoring Functions

Machine learning (ML) and deep learning (DL) approaches represent a modern evolution beyond classical functions. These methods learn complex transfer functions that map combinations of interface features, energy terms, and accessible surface area to predict scoring functions [3]. Unlike traditional empirical functions with fixed parametric forms, ML-based functions can capture non-linear relationships between structural features and binding affinity, often demonstrating superior performance when sufficient training data is available [12]. These include random forest models and neural networks trained on structural and interaction fingerprints [12].

Hybrid Scoring Functions

Hybrid approaches combine elements from multiple scoring function categories to leverage their complementary strengths. For instance, HADDOCK incorporates terms for Van der Waals forces, electrostatic interactions, desolvation energy, and experimental data restraints [3]. PyDock balances electrostatic and desolvation energies [3]. These methods aim to overcome limitations of individual approaches through strategic combination of different scoring methodologies.

Table 1: Fundamental Characteristics of Scoring Function Types

Function Type	Theoretical Basis	Parametrization Method	Key Advantages	Inherent Limitations
Physics-Based	Classical molecular mechanics	First principles	Strong physical interpretability	High computational cost
Empirical	Multi-parameter regression	Linear regression on experimental data	Computational efficiency; Physical terms	Dependent on training data quality
Knowledge-Based	Statistical mechanics	Boltzmann inversion on structural databases	Balanced accuracy/speed	Potential overfitting; Less physical interpretation
Machine Learning	Pattern recognition	Model training on diverse features	Handles non-linear relationships; High accuracy with sufficient data	Black-box nature; Data hunger
Hybrid	Combined principles	Multiple approaches	Leverages complementary strengths	Increased complexity

Figure 1: Classification Framework for Molecular Docking Scoring Functions

Performance Comparison and Benchmarking

Comparative Assessment Metrics

Scoring functions are typically evaluated using multiple performance metrics that reflect their capabilities in different docking scenarios:

Pose Prediction Accuracy: Measured by the root mean square deviation (RMSD) between predicted poses and co-crystallized ligand structures, with lower RMSD values indicating better performance [1] [5].
Binding Affinity Prediction: Assessed through correlation coefficients (R²) between predicted and experimental binding affinities [13].
Virtual Screening Enrichment: Evaluated using metrics like Area Under the Curve (AUC) and Enrichment Factor (EF) that measure the ability to prioritize active compounds over decoys in database screening [14] [15].
Consistency and Robustness: Performance stability across diverse protein targets and ligand chemotypes [3] [13].

Performance Across Scoring Function Types

Recent benchmarking studies reveal distinct performance patterns across scoring function categories:

Table 2: Comparative Performance of Scoring Function Types Across Benchmarks

Function Type	Pose Prediction	Affinity Prediction	Virtual Screening	Computational Speed	Consistency Across Targets
Physics-Based	Variable [5]	Moderate (R² ~0.3-0.5) [13]	Moderate	Slow	Variable
Empirical	Good (BestRMSD) [5]	Moderate (R² ~0.3-0.5) [11]	Good (AUC ~0.8) [15]	Fast	Moderate
Knowledge-Based	Good [3]	Moderate	Good	Fast	Moderate
Machine Learning	Good [12]	Good (R² ~0.69) [13]	Excellent	Fast (after training)	Good
Hybrid	Good [3]	Moderate to Good	Good	Moderate	Good

A pairwise comparison of five MOE scoring functions using InterCriteria Analysis revealed that London dG and Alpha HB showed the highest comparability, while the lowest RMSD was identified as the best-performing docking output metric [1] [5]. In virtual screening contexts, Glide's empirical scoring function demonstrated strong performance with an average AUC of 0.80 across 39 target systems, recovering 34% of known actives in the top 2% of screened compounds [15].

Advanced quantum-mechanical approaches like SQM2.20 show particularly strong binding affinity prediction, achieving an average R² of 0.69 across ten diverse protein targets in the PL-REX benchmark dataset, reaching accuracy similar to much more expensive density functional theory (DFT) calculations but in minutes rather than days [13].

Consensus Scoring Approaches

Consensus scoring combines multiple scoring functions to improve reliability. However, research indicates that simple consensus methods using freely available programs like AutoDock Vina, smina, and idock perform equal to or worse than the highest-scoring individual program (smina in this case) [14]. This contrasts with studies using more diverse commercial programs where consensus approaches showed benefits, suggesting that consensus scoring works best when combining fundamentally different scoring methodologies rather than similar ones [14].

Experimental Protocols and Methodologies

Standard Benchmarking Protocols

Rigorous assessment of scoring functions requires standardized benchmarking datasets and methodologies:

Dataset Curation: High-quality datasets like CASF-2013 (195 protein-ligand complexes), CSAR-NRC HiQ (343 curated structures), and PL-REX (10 diverse protein targets) provide consistent benchmarking frameworks [11] [5] [13]. These datasets encompass diverse protein families, ligand chemotypes, and binding affinities.
Evaluation Workflow: Standard protocols involve re-docking ligands into protein structures, generating multiple poses, scoring them with different functions, and comparing results to experimental reference data using metrics like RMSD for pose prediction and correlation coefficients for affinity prediction [1] [5].
Cross-Validation: Proper validation requires clustered cross-validation to assess model generalizability and avoid overfitting to specific targets or ligand types [11].

Figure 2: Standard Benchmarking Workflow for Scoring Function Evaluation

Case Study: Empirical Function Development

The development of custom empirical scoring functions demonstrates a systematic methodology:

Term Selection: Starting with a diverse set of interaction terms (Gaussian, repulsion, hydrogen bonding, hydrophobic, electrostatic, desolvation) [11].
Parameter Optimization: Using linear regression or similar techniques to fit term weights to experimental binding affinity data [11].
Training and Validation: Employing cross-validation on high-quality datasets like CSAR-NRC HiQ 2010 to prevent overfitting and ensure generalizability [11].
Implementation: Integrating optimized functions into docking software like smina for practical application [11].

This approach yielded a custom scoring function that improved sampling of low RMSD poses compared to the default AutoDock Vina scoring function [11].

Case Study: Target-Specific Machine Learning Function

For target-specific applications, specialized machine learning workflows have demonstrated success:

Data Collection: Retrieving experimental data from databases like BindingDB for the specific target of interest [12].
Feature Engineering: Generating interaction fingerprints (IFP, SIFP) and chemical descriptors (ECFP4, ECFP6, MACCS) from protein-ligand complexes [12].
Model Training: Applying random forest classifiers and regressors to learn relationships between features and binding affinities [12].
Validation: Assessing performance through enrichment factor analysis and molecular dynamics simulations to verify binding stability [12].

This approach for SARS-CoV-2 3CLpro inhibitors achieved an area under the precision-recall curve of 0.80, outperforming generic scoring functions [12].

The Scientist's Toolkit

Successful scoring function development and application relies on several key resources:

Table 3: Essential Research Tools for Scoring Function Development and Application

Tool/Resource	Type	Primary Function	Application Context
MOE (Molecular Operating Environment)	Commercial Software	Provides multiple scoring functions (London dG, Alpha HB, etc.)	Drug discovery platform with diverse scoring capabilities [5]
smina	Open Source Software	AutoDock Vina fork optimized for scoring and custom functions	Academic research, custom scoring function development [11]
CSAR-NRC HiQ Dataset	Benchmark Data	343 curated protein-ligand structures with reliable affinities	Training and validation of scoring functions [11]
CASF-2013	Benchmark Data	195 protein-ligand complexes from PDBbind database	Comparative assessment of scoring functions [5]
PL-REX Dataset	Benchmark Data	High-quality structures and affinities for 10 diverse targets	Rigorous validation of scoring accuracy [13]
CCharPPI Server	Web Service	Assessment of scoring functions independent of docking	Isolated evaluation of scoring components [3]
BindingDB	Database	Experimental binding affinities for target-specific applications	Training target-specific scoring functions [12]

The comparative analysis of scoring function paradigms reveals that each approach offers distinct advantages and limitations. Empirical functions provide an effective balance of accuracy and speed for routine virtual screening, while physics-based methods offer stronger physical foundations at higher computational cost. Knowledge-based approaches deliver reasonable performance across multiple applications, and machine learning methods show promising results, particularly for target-specific applications.

The development of SQM2.20 demonstrates how semiempirical quantum-mechanical methods can bridge the gap between fast approximate functions and computationally intensive quantum calculations, achieving DFT-level accuracy in minutes rather than days [13]. Meanwhile, target-specific machine learning functions show how leveraging experimental data for particular protein targets can yield superior performance compared to generic functions [12].

Future directions in scoring function development will likely focus on hybrid approaches that combine the strengths of multiple methodologies, increased incorporation of quantum-mechanical calculations as computational resources grow, wider application of machine learning techniques to capture complex relationships, and development of improved benchmark datasets with high-quality experimental data across diverse target classes. As these methodologies evolve, scoring functions will continue to enhance their critical role in structure-based drug discovery, providing increasingly reliable predictions of molecular interactions to accelerate therapeutic development.

Molecular docking is a cornerstone of computational drug discovery, used to predict how small molecule ligands interact with protein targets. The heart of any docking protocol is its scoring function (SF), which approximates binding affinity by calculating the interaction energy between a ligand and a biomacromolecule. For decades, classical scoring functions—categorized as physics-based, empirical, or knowledge-based—have dominated the field. However, these traditional approaches often rely on simplified physical models or linear regression techniques, which have plateaued in performance for critical tasks like virtual screening (VS) and binding affinity prediction [16] [3].

The influx of large-scale structural and binding data, coupled with advances in computational power, has fueled a paradigm shift toward machine learning (ML) and deep learning (DL) scoring functions. Unlike classical functions, ML/DL SFs can learn complex, non-linear relationships directly from data, bypassing the need for pre-defined mathematical formulas or explicit physical approximations [16] [17]. This article provides a comparative performance analysis of this new paradigm, objectively evaluating ML/DL scoring functions against classical alternatives and within their own burgeoning categories. We synthesize findings from recent benchmark studies to offer drug discovery researchers a clear guide to the capabilities, optimal applications, and practical implementation of these powerful new tools.

Performance Benchmarking: Quantitative Comparisons

The superiority of ML/DL scoring functions is consistently demonstrated across multiple benchmarks, particularly in virtual screening and affinity prediction.

Virtual Screening and Binding Affinity Prediction

Table 1: Virtual Screening Performance on the DUD-E Benchmark

Scoring Function	Type	Hit Rate (Top 1%)	Hit Rate (Top 0.1%)	Notes
RF-Score-VS [16]	Machine Learning (Random Forest)	55.6%	88.6%	Trained on 15,426 active and 893,897 inactive molecules from DUD-E.
CNN-Score [18]	Deep Learning (Convolutional Neural Network)	~3x Vina's rate [18]	-	Significant improvement over classical SFs.
AutoDock Vina [16]	Classical (Empirical)	16.2%	27.5%	Baseline for comparison.

Table 2: Performance Against Resistant Malaria Target (PfDHFR)

Method	Variant	Best Enrichment (EF 1%)	Key Finding
PLANTS + CNN-Score [18]	Wild-Type (WT)	28	Best performance for WT PfDHFR.
FRED + CNN-Score [18]	Quadruple-Mutant (Q)	31	Best performance for resistant variant.
AutoDock Vina [18]	Both	Worse-than-random (WT)	Performance improved to better-than-random with ML re-scoring.

The data shows that ML SFs dramatically enhance the early enrichment crucial for practical drug discovery. RF-Score-VS achieves a hit rate at the top 0.1% that is more than three times higher than Vina, demonstrating an exceptional ability to prioritize the most promising candidates [16]. Furthermore, ML re-scoring can salvage the performance of weaker docking tools, as seen with Vina, transforming their output from worse-than-random to statistically useful for screening [18]. This capability is especially valuable for challenging targets like drug-resistant enzymes.

Pose Prediction and Physical Plausibility

While ML/DL methods excel in scoring, their performance in generating physically plausible binding poses is more nuanced. A comprehensive 2025 evaluation of docking methods across the Astex diverse set, PoseBusters benchmark, and DockGen dataset reveals a critical performance hierarchy [17].

Table 3: Docking Pose Accuracy and Physical Validity (Combined Success Rate: RMSD ≤ 2 Å & Physically Valid)

Method Category	Example Methods	Astex Diverse Set	PoseBusters (Unseen)	DockGen (Novel Pockets)
Traditional Methods	Glide SP	~61% (est.)	High	Maintains >94% physical validity [17]
Hybrid Methods	Interformer	~55% (est.)	Moderate	Better balance than pure DL [17]
Generative Diffusion	SurfDock, DiffBindFR	~61% (SurfDock)	~39% (SurfDock)	~33% (SurfDock)
Regression-Based DL	KarmaDock, GAABind	Lowest	Lowest	Lowest

This analysis reveals that generative diffusion models like SurfDock achieve superior pose accuracy (e.g., >75% RMSD ≤ 2 Å across benchmarks), but often produce poses with steric clashes or incorrect hydrogen bonding, leading to low physical validity [17]. In contrast, traditional methods and hybrid approaches (AI scoring with traditional conformational search) offer the best balance between accurate and physically plausible pose generation. Regression-based DL models, which directly predict ligand coordinates, frequently fail to produce chemically valid structures [17].

Experimental Protocols and Methodologies

Benchmarking studies follow rigorous protocols to ensure fair and generalizable comparisons. Understanding these methodologies is key to interpreting the data.

Common Benchmarking Datasets

DUD-E (Directory of Useful Decoys - Enhanced): A standard benchmark for virtual screening. It contains 102 targets, each with a set of confirmed active molecules and "decoys"—physically similar but topologically distinct molecules presumed to be inactive [16]. This tests a SF's ability to discriminate true binders from non-binders.
CASF (Comparative Assessment of Scoring Functions): A core set from the PDBbind database, often used to evaluate scoring power (binding affinity prediction), ranking power (relative ranking of ligands), docking power (pose prediction), and screening power [1] [19].
DEKOIS: Another virtual screening benchmark set, used for cross-validation and testing generalizability [16] [18].

Validation Strategies

To prevent overfitting and ensure model generalizability, especially for ML/DL methods, studies employ strict cross-validation strategies [16]:

Per-Target: A separate ML model is trained and tested exclusively on data (actives and decoys) for a single protein target.
Horizontal Split: Training and test sets contain data from all targets, simulating a scenario where known ligands exist for the targets being screened.
Vertical Split: Training and test sets contain data from completely different targets. This represents the most challenging scenario of predicting binding for a protein with no known ligands and tests the model's ability to generalize to novel targets.

The Scientist's Toolkit: Essential Research Reagents

Table 4: Key Software and Resources for Scoring Function R&D

Resource Name	Type/Function	Brief Description
MolScore [20]	Benchmarking & Evaluation Framework	An open-source Python framework that unifies scoring, evaluation, and benchmarking for generative models and de novo drug design.
CCharPPI [3]	Scoring Evaluation Server	A web server that allows for the assessment of scoring functions independent of their native docking programs.
RF-Score-VS [16]	Machine Learning SF	A ready-to-use random forest-based scoring function optimized for virtual screening performance.
CNN-Score [18]	Deep Learning SF	A convolutional neural network-based scoring function that shows consistent improvement in virtual screening enrichment.
PoseBusters [17]	Validation Toolkit	A toolkit to systematically evaluate docking predictions for physical and chemical plausibility, complementing RMSD metrics.
DOCKSTRING [20]	Benchmarking Suite	A benchmark suite that includes docking tasks against specific protein targets for evaluating generative models.

Integrated Workflows and Logical Relationships

The application of ML/DL scoring functions often involves a multi-step workflow, from data preparation to final evaluation. The diagram below outlines this process and the role of different scoring function types.

Discussion and Future Directions

The evidence unequivocally positions ML and DL scoring functions as the superior choice for virtual screening and binding affinity prediction, offering substantial performance gains over classical methods [16] [18]. However, the paradigm is not without its challenges. A significant hurdle is generalization. ML/DL models can struggle when encountering proteins or binding pockets that are underrepresented in their training data, limiting their application in novel target discovery [17]. Furthermore, as the pose prediction analysis shows, physical plausibility remains a concern for many pure DL methods, particularly regression-based models [17].

The future of scoring functions lies in addressing these limitations. Promising directions include:

Hybrid Methods: Combining the robust pose generation of traditional search algorithms with the superior scoring power of AI, as seen in methods like Interformer, offers a balanced and practical solution [17].
Physics-Informed ML: Integrating physical principles and energy terms into ML models, as explored in frameworks like DockBind, could enhance the physical realism and generalizability of predictions [21].
Focus on Data Quality and Diversity: Improving the size, quality, and chemical/structural diversity of training datasets is paramount for building more robust models that perform reliably across the proteome.

In conclusion, while classical scoring functions still hold value for certain tasks like generating physically sound initial poses, the new paradigm of ML/DL scoring is here to stay. For researchers aiming to maximize the success of their virtual screening campaigns, leveraging ML/DL functions for re-scoring docking poses is no longer an advanced tactic but a necessary standard.

Molecular docking is a cornerstone of modern computational drug discovery, enabling researchers to predict how small molecules (ligands) interact with biological targets (proteins) [5]. The accuracy of these predictions hinges critically on scoring functions, which are mathematical algorithms used to predict the binding affinity and orientation of a ligand within a protein's binding site [3]. These functions approximate the complex energetics of molecular interactions, and their performance directly impacts the success of virtual screening and structure-based drug design [5].

Scoring functions can be broadly classified into four main categories, each with a distinct theoretical foundation for assessing protein-ligand complexes. Physics-based functions use classical force fields to calculate interactions, while empirical functions sum weighted energy terms derived from experimental data [5] [3]. Knowledge-based methods employ statistical potentials from databases of known structures, and the emerging Machine Learning/Deep Learning (ML/DL) approaches learn complex relationships directly from data [3]. The following diagram illustrates the logical relationships and classification of these primary scoring function types.

Selecting an appropriate scoring function is a significant challenge for researchers. This guide provides an objective, data-driven comparison of the major scoring approaches, detailing their inherent strengths and weaknesses to inform method selection in drug discovery projects.

Comparative Analysis of Scoring Function Categories

The table below summarizes the core characteristics, strengths, and weaknesses of the four primary scoring function categories.

Table 1: Comparative overview of scoring function categories

Category	Theoretical Basis	Key Strengths	Inherent Weaknesses
Physics-Based [3]	Classical force fields (van der Waals, electrostatics)	Strong theoretical foundation; good transferability	High computational cost; limited by implicit solvation models
Empirical-Based [5] [3]	Linear regression to experimental binding affinities	Fast calculation; optimized for binding affinity prediction	Risk of overfitting; limited to represented chemical space in training set
Knowledge-Based [3]	Statistical potentials from structural databases (e.g., PDB)	Good balance of speed and accuracy; no need for parameter fitting	Dependence on database quality and size; limited by data completeness
Machine Learning/Deep Learning [3]	Complex non-linear models trained on structural and energy data	High potential accuracy; ability to capture complex patterns	Large data requirements; "black box" nature; potential poor generalization

In-Depth Category Performance

Empirical-Based Functions

These functions, such as London dG and Alpha HB in MOE software, calculate binding affinity by summing up a series of weighted energy terms describing hydrogen bonding, hydrophobic interactions, and entropy loss [5]. A pairwise comparison study using InterCriteria Analysis (ICrA) on the CASF-2013 benchmark revealed that Alpha HB and London dG showed the highest comparability, suggesting consistent performance across a diverse set of protein-ligand complexes [5] [1]. Their primary strength is computational efficiency, making them suitable for high-throughput virtual screening. However, their performance can degrade when applied to protein complexes or ligand chemotypes not well-represented in their training data [3].

Knowledge-Based Functions

Knowledge-based scoring functions offer a favorable balance between accuracy and computational speed [3]. Methods like AP-PISA and SIPPER leverage the growing repository of protein structures in the Protein Data Bank (PDB) to derive statistical potentials [3]. They operate on the principle that frequently observed atomic interactions in experimental structures are likely to be energetically favorable. A key advantage is that they do not require explicit parameter fitting for different energy terms. Their main limitation is their dependency on the completeness and quality of the underlying structural database, which can lead to biases against novel protein complexes or rare interaction types [3].

Physics-Based Functions

Physics-based functions, such as GBVI/WSA dG in MOE, use explicit physical energy terms like van der Waals forces and electrostatics to calculate interaction energies [5] [3]. These methods have a strong theoretical foundation and are generally more transferable across different systems. However, they suffer from high computational costs and often rely on simplified approximations for solvation effects and entropy, which can limit their predictive accuracy [3]. They are often used for detailed analysis of a limited number of candidate complexes rather than initial high-throughput screening.

Machine/Deep Learning-Based Functions

ML/DL approaches represent the cutting edge, using algorithms to learn complex scoring functions directly from data. These models can integrate a wide variety of features, including interface characteristics, energy terms, and solvent-accessible surface areas [3]. Their key strength is the potential to capture complex, non-linear relationships that classical functions might miss, leading to higher accuracy. For instance, some 3D convolutional neural network (3D-CNN) models have been successfully validated on the CASF-2013 benchmark [5]. The drawbacks include their "black box" nature, which makes interpretation difficult, and a high risk of poor generalization if the model is applied to data outside its training distribution [3].

Quantitative Performance Data from Benchmark Studies

Robust benchmarking is essential for an objective comparison. The CASF (Comparative Assessment of Scoring Functions) benchmark is a widely accepted standard for evaluating scoring functions [5]. The table below summarizes key quantitative findings from recent benchmark studies, including the CASF-2013 dataset and larger-scale surveys.

Table 2: Key performance metrics of selected scoring functions from benchmark studies

Scoring Function	Category	Key Performance Metric	Result	Context & Dataset
FMS (DOCK) [22]	Hybrid (Pharmacophore + Energy)	Pose Reproduction Success	93.5% (20% increase vs. SGE)	SB2012 database (1,043 complexes)
FMS + SGE (DOCK) [22]	Hybrid	Pose Reproduction Success	98.3%	SB2012 database (1,043 complexes)
Alpha HB & London dG (MOE) [5] [1]	Empirical	Pairwise Comparability (ICrA)	Highest	CASF-2013 dataset (195 complexes)
BestRMSD (MOE) [5]	N/A (Docking Output)	Docking Output Performance	Best-performing	CASF-2013 dataset (195 complexes)
Classical Methods (e.g., ZRANK2, FireDock) [3]	Empirical / Knowledge-Based	Runtime	Fast	Large-scale docking applications
DL-based Methods [3]	Machine/Deep Learning	Runtime	Variable (can be high)	Large-scale docking applications

Insights from Specialized Scoring Approaches

The Pharmacophore Matching Similarity (FMS) scoring function in DOCK demonstrates the power of combining geometric and chemical feature matching with traditional energy scoring. When used alone, FMS dramatically improved pose reproduction success by approximately 20% compared to the standard grid energy (SGE) score. When combined with SGE, the success rate reached 98.3% across 1,043 protein-ligand complexes [22]. This highlights a major strength: the ability to leverage known inhibitor geometries to guide docking. Its weakness may lie in its dependency on a well-defined reference pharmacophore, which might not be available for all targets.

Experimental Protocols for Benchmarking

To ensure the reproducibility and reliability of scoring function evaluations, standardized experimental protocols are used. The following diagram outlines a typical workflow for a comparative assessment study.

Detailed Methodology

The typical benchmark study involves several critical stages:

Dataset Curation: A high-quality, diverse set of protein-ligand complexes with known 3D structures and binding affinity data is essential. The CASF-2013 benchmark subset of the PDBbind database, containing 195 carefully selected protein-ligand complexes, is a prime example [5] [1]. This diversity ensures that scoring functions are tested across various protein families and ligand chemotypes.
Molecular Docking and Pose Generation: For each complex in the dataset, the native ligand is re-docked into its protein's binding site. Studies often save numerous candidate poses (e.g., 30) per ligand to test the scoring function's ability to identify the correct conformation [5].
Performance Metrics and Outputs: The evaluation typically uses multiple docking outputs to assess different capabilities:
- Pose Reproduction (BestRMSD): The lowest Root-Mean-Square Deviation (RMSD) between any predicted pose and the co-crystallized experimental structure. This measures the sampling capability and the scoring function's ability to identify a correct geometry. This metric was identified as the best-performing docking output in the MOE study [5].
- Scoring Power: The correlation between the best docking score (BestDS) and the experimentally measured binding affinity (e.g., -logKd/Ki). This assesses the function's ability to predict binding strength [5].
- Ranking Power: The ability to correctly rank-order multiple ligands based on their predicted affinity against a single target.
Data Analysis: Advanced analysis techniques, such as InterCriteria Analysis (ICrA), can be applied to perform pairwise comparisons of scoring functions and reveal complex relationships not always captured by simple correlation analysis [5].

Successful docking studies rely on a suite of software tools, datasets, and computational resources. The table below details key components of the modern computational scientist's toolkit.

Table 3: Essential resources for docking and scoring function research

Resource Name	Type	Primary Function in Research	Relevance to Scoring
PDBbind Database [5]	Database	Comprehensive collection of protein-ligand complexes with binding affinity data.	Provides curated data for training empirical and knowledge-based functions and for benchmark tests.
CASF Benchmark [5] [1]	Benchmark Set	Standardized subset of PDBbind for comparative assessment of scoring functions.	Enables objective, head-to-head performance comparison of different scoring methods.
Molecular Operating Environment (MOE) [5]	Software Suite	Integrated drug discovery platform with multiple embedded scoring functions.	Contains five scoring functions (London dG, ASE, Affinity dG, Alpha HB, GBVI/WSA dG) for direct comparison.
DOCK [22]	Docking Software	Structure-based design program supporting various scoring functions, including FMS.	Allows for pharmacophore-based and energy-based scoring, including hybrid approaches.
CCharPPI Server [3]	Web Server	Community server for computational scoring of protein-protein complexes.	Enables the evaluation of scoring functions independent of the docking process.
InterCriteria Analysis (ICrA) [5]	Analysis Method	Multi-criterion decision-making approach for pairwise comparison.	Helps reveal nuanced relations and comparability between different scoring functions.

From Theory to Practice: Implementing Docking and Scoring in Virtual Screening

This guide provides an objective comparison of molecular docking workflows, focusing on the critical steps of protein preparation and pose generation. We synthesize data from recent benchmarking studies to help you select the most effective protocols and tools for your drug discovery projects.

Workflow Components and Scoring Function Fundamentals

A robust molecular docking workflow is essential for accurate prediction of how small molecule ligands interact with protein targets. This process typically involves protein preparation, ligand preparation, docking simulation, and pose scoring. The scoring function, which approximates the binding affinity by calculating the interaction energy between a ligand and a protein, is a key element determining the success of docking protocols [5].

Scoring functions are generally categorized into four main types [23]:

Physics-based functions calculate binding energy using classical force fields, summing Van der Waals and electrostatic interactions.
Empirical functions estimate binding affinity by summing weighted energy terms parameterized against experimental data.
Knowledge-based functions use statistical potentials derived from pairwise atom distances in known structures.
Machine Learning (ML)-based functions learn complex relationships between structural features and binding affinities from large datasets.

Each category offers distinct trade-offs between computational speed, accuracy, and physical interpretability. The choice of scoring function directly impacts the reliability of virtual screening and binding mode prediction [23].

Quantitative Comparison of Docking Performance

Benchmarking studies provide critical data on the performance of various docking programs and scoring functions. The tables below summarize key metrics from recent comprehensive evaluations.

Table 1: Docking program performance on COX-1 and COX-2 enzymes for pose prediction [24]

Docking Program	Success Rate (RMSD < 2 Å)	Key Characteristics
Glide	100%	Outstanding pose prediction accuracy
GOLD	82%	Reliable performance
AutoDock	75%	Widely used, moderate performance
FlexX	59%	Lower success rate
Molegro Virtual Docker (MVD)	Not specified in top performers	Included in initial evaluation

Table 2: Performance comparison of MOE scoring functions on CASF-2013 benchmark [5]

MOE Scoring Function	Type	Key Findings from Pairwise Comparison
Alpha HB	Empirical	Highest comparability with London dG
London dG	Empirical	Highest comparability with Alpha HB
ASE	Empirical	Performance varies by output metric
Affinity dG	Empirical	Performance varies by output metric
GBVI/WSA dG	Force-field	Performance varies by output metric

Table 3: Virtual screening performance on DUD-E benchmark (102 targets) [16]

Scoring Function	Type	Hit Rate at Top 1%	Hit Rate at Top 0.1%	Pearson Correlation
RF-Score-VS	Machine Learning	55.6%	88.6%	0.56
AutoDock Vina	Empirical	16.2%	27.5%	-0.18

The data reveals that machine-learning scoring functions like RF-Score-VS can substantially outperform classical functions in virtual screening scenarios, showing remarkable enrichment of active compounds, particularly in the top percentage of ranked molecules [16]. For pose prediction, Glide demonstrated exceptional performance in correctly predicting binding modes of COX inhibitors [24].

Experimental Protocols and Benchmarking Methodologies

Dataset Curation and Preparation

Robust benchmarking requires high-quality, curated datasets. The CASF-2013 benchmark subset of the PDBbind database provides a standardized set of 195 protein-ligand complexes with binding affinity data [5]. The DUD-E (Directory of Useful Decoys: Enhanced) dataset offers 102 protein targets with active ligands and property-matched decoys, enabling virtual screening performance assessment [16].

Protein preparation typically involves:

Removing redundant chains, water molecules, and ions
Adding missing hydrogen atoms and correcting protonation states
Generating appropriate receptor grids centered on binding sites

Ligand preparation includes:

Generating 3D conformations from molecular structures
Assigning appropriate bond orders and formal charges
Energy minimization and tautomer enumeration

Docking and Evaluation Protocols

For pose prediction, the root mean square deviation (RMSD) between docked poses and experimental reference structures serves as the primary metric. An RMSD value below 2.0 Å generally indicates successful docking [24].

Studies typically evaluate multiple docking outputs [5]:

Best docking score: The most favorable predicted affinity
Best RMSD: The lowest RMSD between predicted and native poses
RMSD of best-score pose: The accuracy of the top-ranked pose
Score of best-RMSD pose: The affinity prediction for the most accurate pose

For virtual screening, performance is measured using:

Enrichment factors: The fold-enrichment of active compounds at early screening stages
Receiver Operating Characteristic (ROC) curves: Plotting true positive rate against false positive rate
Area Under the Curve (AUC): Overall performance metric with values from 0.5 (random) to 1.0 (perfect)

Cross-validation strategies are critical for machine-learning scoring functions [16]:

Per-target: Training and testing on the same target
Horizontal split: Training and testing sets contain different ligands from the same targets
Vertical split: Training and testing on completely different protein targets

Workflow Visualization

The following diagram illustrates the key stages and decision points in a robust docking workflow:

Research Reagent Solutions

Table 4: Essential tools for molecular docking workflows

Tool/Category	Representative Examples	Primary Function
Commercial Software	MOE (Molecular Operating Environment), Glide, GOLD	Integrated docking platforms with multiple scoring functions
Open-Source Docking Tools	AutoDock, AutoDock Vina, smina, DOCK	Molecular docking with customizable parameters
Specialized Scoring Functions	RF-Score-VS, NNScore, SFCscore	Machine-learning based scoring and ranking
Benchmark Datasets	CASF-2013, CSAR-NRC HiQ, DUD-E	Standardized datasets for method validation
Protein Preparation Tools	AutoDock Tools, Schrodinger Protein Preparation Wizard	Structure cleanup, protonation, and optimization
Ligand Preparation Tools	OpenBabel, Omega, LigPrep	2D to 3D conversion, tautomer generation, energy minimization

Establishing a robust workflow from protein preparation to pose generation requires careful consideration of both the docking tools and scoring functions. The benchmarking data presented reveals that while tools like Glide excel in pose prediction, machine-learning scoring functions like RF-Score-VS offer substantial advantages in virtual screening enrichment.

The optimal workflow depends on the specific research goal: pose prediction versus virtual screening. For pose prediction, emphasis should be placed on sampling algorithms and their integration with accurate scoring functions. For virtual screening, machine-learning scoring functions trained on appropriate data provide superior enrichment of active compounds. By implementing the standardized protocols and controls outlined in this guide, researchers can enhance the reliability and reproducibility of their molecular docking studies.

The accurate prediction of how a small molecule (ligand) binds to a protein target is a cornerstone of structure-based drug design. Central to this molecular docking process are scoring functions, which are mathematical models used to predict the binding affinity and orientation of a ligand within a protein's binding site. The reliability of these scoring functions directly impacts the success of virtual screening and lead optimization campaigns. Given the proliferation of both classical and machine learning-based scoring functions, the question of how to objectively evaluate and compare their performance has become paramount. This is where public benchmark data sets play an indispensable role. These standardized collections of protein-ligand complexes provide a common framework for the comparative assessment of scoring algorithms, enabling researchers to identify strengths, weaknesses, and optimal use cases for different docking tools. This guide focuses on two of the most influential benchmarks in the field: the Directory of Useful Decoys, Enhanced (DUD-E) and the Comparative Assessment of Scoring Functions (CASF) benchmark, detailing their composition, proper application, and how they are used to objectively quantify performance in molecular docking.

Core Public Data Sets for Docking Validation

Directory of Useful Decoys, Enhanced (DUD-E)

The Directory of Useful Decoys, Enhanced (DUD-E) was developed to address limitations identified in its predecessor, DUD. It serves as a community standard for benchmarking docking programs in virtual screening tasks, which focus on distinguishing potential active compounds from non-binders.

Design and Composition: DUD-E contains 102 targets across diverse protein classes, including kinases, proteases, nuclear receptors, GPCRs, and ion channels. It comprises 22,886 clustered ligands drawn from ChEMBL, each with confirmed binding affinity. A key feature is that each ligand is paired with 50 property-matched decoys—molecules that are physically similar to the ligands (in terms of molecular weight, logP, number of rotatable bonds, and hydrogen-bond donors/acceptors) but are topologically dissimilar to minimize the likelihood of actual binding. This design creates a challenging and unbiased benchmark for testing a scoring function's ability to enrich true ligands amid a background of deceivingly similar non-binders [25].
Primary Application: The principal metric for evaluation on DUD-E is enrichment, which measures how effectively a scoring function prioritizes known ligands over decoys in a virtual screening workflow. DUD-E is particularly valued for its focus on screening power [26] [27].

Comparative Assessment of Scoring Functions (CASF)

The Comparative Assessment of Scoring Functions (CASF) benchmark, built upon the PDBbind database, is designed for the comprehensive evaluation of scoring functions across multiple capabilities beyond just virtual screening.

Design and Composition: The CASF benchmark (e.g., CASF-2013, CASF-2016) is a high-quality curated set of hundreds of protein-ligand complexes with experimentally determined binding structures and affinities [5]. Its "core set" is selected to maximize structural diversity and the quality of experimental data.
Primary Application: CASF provides a framework for evaluating scoring functions across four distinct, critical metrics [26] [27]:
- Scoring Power: The linear correlation between predicted scores and experimentally measured binding affinities.
- Ranking Power: The capability to correctly rank the binding affinities of different ligands for a single target.
- Docking Power: The ability to identify the native binding pose (or one close to it) from a set of computer-generated decoy poses.
- Screening Power: Similar to DUD-E, it evaluates the enrichment of known binders over non-binders.

The following table summarizes the key characteristics of these two benchmark data sets.

Table 1: Key Characteristics of DUD-E and CASF Benchmark Data Sets

Feature	DUD-E	CASF
Primary Purpose	Virtual Screening / Enrichment	Holistic Scoring Function Assessment
Core Application	Screening Power	Scoring, Ranking, Docking, & Screening Power
Ligands	22,886 clustered ligands with known activity [25]	Hundreds of complexes with binding affinity data [5]
Decoys	50 property-matched decoys per ligand [25]	Computer-generated decoy poses; non-binders
Key Metrics	Enrichment (e.g., AUC, early enrichment)	Pearson's R (Scoring), Spearman's ρ (Ranking), Success Rate (Docking) [26]
Target Diversity	102 targets, including GPCRs & ion channels [25]	Diverse set from the PDBbind database

Experimental Protocols for Benchmarking

To ensure reproducible and objective comparisons, standardized protocols are employed when using DUD-E and CASF.

Benchmarking with DUD-E

A typical virtual screening benchmark using DUD-E follows a structured workflow, illustrated below.

The process involves preparing the DUD-E data for a specific target, which includes the known active ligands and their matched decoys. The docking program and its scoring function are then used to rank the entire combined set of actives and decoys. The resulting ordered list is analyzed to compute enrichment metrics. A common and telling metric is the area under the receiver operating characteristic curve (AUC), where a perfect enrichment yields an AUC of 1.0 and random performance gives an AUC of 0.5. Early enrichment, such as the fraction of true actives recovered in the top 1% or 2% of the ranked list, is often considered even more critical for assessing practical utility in large-scale virtual screens [15].

Benchmarking with CASF

The CASF benchmark employs a more multi-faceted workflow to evaluate the four key powers of a scoring function.

For scoring power, the scoring function is applied to the native crystal structures of the complexes in the CASF core set. The predicted scores are then correlated against the experimental binding affinities (e.g., K(d) or K(i)) using Pearson's correlation coefficient (R). For ranking power, the function is used to score multiple ligands for a single target, and the ranking of these ligands by score is compared to their experimental ranking using Spearman's rank correlation coefficient (ρ). For docking power, a set of decoy poses (including a near-native pose) is generated for each complex. The scoring function's ability to identify the near-native pose as the best-scoring one is measured as a success rate. Finally, the screening power test evaluates the function's ability to identify true binders for a target from a pool of non-binders [26] [27].

Comparative Performance Data

Benchmarking studies consistently reveal that no single scoring function excels across all tasks, highlighting a performance trade-off.

Performance on DUD-E

Studies using DUD-E demonstrate that performance can vary significantly. For instance, in one evaluation, the Glide (SP) docking program achieved an average AUC of 0.80 across 39 targets from the original DUD set. In terms of early enrichment, it recovered an average of 25% of known actives in the top 1% of its ranked list [15]. This level of performance is considered robust, though top-performing methods can achieve higher metrics. Newer machine-learning consensus methods, such as CoBDock, also report strong performance on DUD-E, leveraging multiple docking algorithms to improve accuracy [28].

Performance on CASF

The multi-faceted nature of CASF makes it an excellent tool for revealing the specialized strengths of different scoring functions. The table below summarizes a hypothetical comparison based on trends observed in the literature.

Table 2: Hypothetical Comparative Performance of Different Scoring Function Types on CASF Metrics

Scoring Function Type	Scoring Power (Pearson's R)	Docking Power (Top 1 Success Rate)	Screening Power (Enrichment Factor)	Notes
Classical Empirical (e.g., GlideScore)	Moderate (~0.6)	High (>85%)	High	Balanced performance for docking & screening [26] [15]
ML-based Regression Models	High (>0.8)	Low to Moderate	Low	Excellent affinity prediction, poor pose ID [26] [27]
ML-based with Δ-ML/Data Augmentation	High (>0.8)	High (>85%)	High	Balanced, high performance across tasks [26]
Knowledge-based/Statistical	Moderate	Moderate	Moderate	Good balance of speed and accuracy [23]

A specific 2025 study comparing the five scoring functions within the Molecular Operating Environment (MOE) software on the CASF-2013 set found that the lowest RMSD (BestRMSD) was the best-performing docking output for pose prediction. Furthermore, the two empirical scoring functions, Alpha HB and London dG, demonstrated the highest comparability and performance in their analysis [1] [5].

The Scientist's Toolkit: Essential Research Reagents

To conduct rigorous docking benchmarks, researchers rely on a suite of publicly available data and software.

Table 3: Essential Resources for Docking Benchmarking Studies

Resource Name	Type	Primary Function in Benchmarking	Access
DUD-E	Benchmark Data Set	Provides targets, active ligands, and property-matched decoys for virtual screening enrichment tests [25].	http://dude.docking.org
PDBbind & CASF	Benchmark Data Set	Provides a comprehensive collection of protein-ligand complexes with binding affinities for holistic scoring function assessment [26].	http://www.pdbbind.org.cn
Smiles2Dock	Benchmark Data Set	A large-scale, ML-ready dataset with docking scores for over 1.7M ligands against 15 AlphaFold2 proteins [29].	https://huggingface.co/datasets/tlemenestrel/Smiles2Dock
AutoDock Vina	Docking Software	A widely used, open-source docking program often used as a baseline or component in consensus methods [28].	http://vina.scripps.edu
P2Rank	Cavity Detection Tool	Predicts ligand binding sites on protein structures, often used to guide blind docking protocols [29] [28].	https://github.com/rdk/p2rank
RDKit	Cheminformatics Toolkit	An open-source toolkit for cheminformatics used for ligand preparation, descriptor calculation, and file format conversion [27].	https://www.rdkit.org

The rigorous and objective evaluation of molecular docking scoring functions is a critical component of methodological development in computational drug discovery. Public benchmark data sets, most notably DUD-E for virtual screening enrichment and the CASF benchmark for comprehensive multi-task assessment, provide the essential, standardized playgrounds for this evaluation. The consistent application of these benchmarks reveals a clear landscape: classical force-field and empirical functions often provide robust, balanced performance, while modern machine-learning-based functions can achieve superior results in specific tasks, such as binding affinity prediction. The emerging trend is towards balanced multi-task scoring functions, often leveraging machine learning to correct classical scores or to create novel, physics-informed models. For practitioners, the choice of a scoring function should be guided by its proven performance on these benchmarks in the specific task of interest—be it pose prediction, affinity ranking, or virtual screening—ensuring that computational predictions are built upon a foundation of validated performance.

Molecular docking is a cornerstone of computational drug discovery, enabling researchers to predict how small molecule ligands interact with protein targets. While docking algorithms can generate numerous potential binding poses, a critical bottleneck remains: the scoring function (SF) that evaluates these poses and predicts binding affinity. Traditional SFs, whether physics-based, empirical, or knowledge-based, often struggle with accuracy due to their simplified treatment of complex molecular interactions and their reliance on predetermined functional forms [30]. This limitation directly impacts the reliability of virtual screening (VS) campaigns, where the ability to distinguish true binders from non-binders is paramount.

The emergence of machine learning (ML) has introduced a paradigm shift in scoring function development. Unlike classical approaches, ML scoring functions do not assume a fixed relationship between structural features and binding affinity. Instead, they infer this relationship directly from experimental data, capturing complex, non-linear patterns that traditional methods miss [30]. This review explores the powerful strategy of combining conventional docking tools with ML-based rescoring, presenting a comprehensive analysis of performance gains, practical methodologies, and future directions for this integrated approach.

Performance Comparison: Traditional Docking vs. ML-Rescoring

Quantitative benchmarks demonstrate that ML rescoring consistently enhances virtual screening performance across diverse protein targets. The following tables summarize key findings from recent large-scale evaluations.

Table 1: Virtual Screening Enrichment (EF1%) for PfDHFR Antimalarial Target [18]

Docking Method	Rescoring SF	Wild-Type EF1%	Quadruple-Mutant EF1%
PLANTS	CNN-Score	28	-
FRED	CNN-Score	-	31
AutoDock Vina	None (Default)	Worse-than-random	-
AutoDock Vina	RF/CNN	Better-than-random	-

Table 2: Tiered Performance of Docking Paradigms (CASF Benchmark) [17]

Performance Tier	Method Class	Representative Examples	Key Characteristics
1 (Best)	Traditional Methods	Glide SP	High physical validity (>94% PB-valid)
2	Hybrid AI Scoring	Interformer	Balanced pose accuracy and validity
3	Generative Diffusion	SurfDock, DiffBindFR	Superior pose accuracy (>70% RMSD ≤2Å)
4	Regression-based Models	KarmaDock, QuickBind	Often produce physically invalid poses

Table 3: Performance Metrics for Machine Learning Scoring Functions [31]

Scoring Function	Baseline SF	ML Method	Scoring Power (R)	Screening Power
ΔLin_F9XGB	Lin_F9	XGBoost	0.853 (local optimized poses)	Superior on LIT-PCBA dataset
ΔVinaXGB	AutoDock Vina	XGBoost	Top performer on CASF-2016	Robust across tasks
ΔVinaRF20	AutoDock Vina	Random Forest	High	Good screening power

Experimental Protocols for ML Rescoring

Standard Rescoring Workflow

The typical rescoring pipeline involves sequential execution of traditional docking followed by ML-based evaluation. A recent benchmarking study on PfDHFR inhibitors exemplifies this protocol [18]:

Protein Preparation: Crystal structures (PDB IDs: 6A2M for wild-type, 6KP2 for quadruple-mutant) are prepared by removing water molecules, unnecessary ions, and redundant chains. Hydrogen atoms are added and optimized using tools like OpenEye's "Make Receptor".
Ligand and Decoy Preparation: Active compounds and decoy molecules from the DEKOIS 2.0 benchmark set are prepared using Omega and OpenBabel to generate multiple conformations and appropriate file formats.
Traditional Docking: Three docking tools—AutoDock Vina, PLANTS, and FRED—are used to generate poses. The docking grid is defined to encompass the entire binding site.
ML Rescoring: The generated poses are rescored using pretrained ML SFs (CNN-Score and RF-Score-VS v2) without modifying the poses themselves.
Performance Evaluation: Enrichment factors (EF1%), pROC curves, and chemotype enrichment are calculated to quantify screening performance.

Delta Machine Learning Protocol

The Δ-machine learning approach has emerged as a particularly effective strategy for developing robust SFs [31]. This method learns a correction term to an existing baseline SF, leveraging the physical principles embedded in classical functions while enhancing accuracy with ML:

Training Set Construction: Curate a diverse set of protein-ligand complexes with experimental binding affinities (e.g., from PDBbind). Include crystal poses, locally optimized poses, and docked poses to ensure robustness.
Feature Engineering: Develop comprehensive feature sets encompassing protein-ligand interaction descriptors (e.g., polar-polar, polar-nonpolar, and nonpolar-nonpolar interactions in different distance ranges) and ligand-specific features.
Model Training: Employ ML algorithms like XGBoost to train a model that predicts the difference between experimental binding affinities and baseline SF predictions.
Validation: Rigorously test the resulting Δ-SF using benchmarks like CASF-2016 that evaluate scoring, ranking, docking, and screening power across diverse protein families.

ML Rescoring Workflow

The Scientist's Toolkit: Essential Research Reagents and Software

Table 4: Key Research Tools for Docking and ML Rescoring

Tool Name	Type	Primary Function	Application Context
AutoDock Vina	Docking Program	Conformational search and scoring	Generating initial ligand poses
PLANTS	Docking Program	Protein-ligand docking	Pose generation with swarm algorithm
FRED	Docking Program	Exhaustive docking	High-throughput pose generation
CNN-Score	ML Scoring Function	Pose rescoring using CNN	Improving virtual screening enrichment
RF-Score-VS v2	ML Scoring Function	Random forest-based scoring	Distinguishing actives from decoys
ΔLin_F9XGB	ML Scoring Function	Delta machine learning SF	Superior scoring/ranking across pose types
PDBbind	Database	Curated binding affinity data	Training and benchmarking SFs
DEKOIS 2.0	Benchmark Set	Active/decoy complexes	Virtual screening performance evaluation

Discussion and Future Perspectives

The integration of ML rescoring with traditional docking represents a significant advancement in structure-based drug design. The empirical evidence consistently demonstrates that this combined approach outperforms either method in isolation. ML SFs excel at leveraging large datasets to identify complex binding patterns, while traditional docking provides physically plausible starting conformations [30] [18].

Despite these promising results, important limitations and research challenges remain. First, the generalization of ML SFs to novel protein targets or binding pockets outside their training distribution requires further investigation [17]. Second, while ML rescoring improves enrichment, some deep learning methods generate poses with questionable physical validity despite favorable RMSD values [17]. Finally, the computational cost of some ML approaches may limit their application to ultra-large libraries, though this continues to improve with hardware and algorithmic advances.

Future developments will likely focus on target-specific SFs that leverage transfer learning for improved performance on novel targets, multi-task learning that incorporates additional biological data, and explainable AI approaches to interpret the structural basis of ML predictions. The field is also moving toward end-to-end deep learning pipelines that integrate pose generation and scoring in a unified framework.

The powerful synergy between traditional docking tools and machine learning rescoring functions has demonstrably enhanced the accuracy and reliability of structure-based virtual screening. Quantitative benchmarks across diverse protein targets reveal that ML rescoring consistently improves enrichment over traditional scoring functions alone, particularly for challenging drug-resistant targets. The experimental protocols and toolkit resources outlined in this review provide researchers with practical guidance for implementing these methods in their drug discovery pipelines. As machine learning algorithms continue to evolve and structural databases expand, the rescoring paradigm will play an increasingly vital role in accelerating the identification of novel therapeutic compounds.

The persistent global health challenge of malaria is significantly compounded by the emergence of drug-resistant strains of the Plasmodium falciparum parasite. The enzyme Plasmodium falciparum Dihydrofolate Reductase (PfDHFR), crucial for the parasite's DNA synthesis, represents a critical therapeutic target. Resistance to antifolate drugs, such as pyrimethamine, primarily arises from mutations in the PfDHFR active site, most notably the quadruple-mutant (Q) variant (N51I/C59R/S108N/I164L) [18] [32]. This case study examines a comprehensive benchmarking analysis that evaluated the performance of structure-based virtual screening (SBVS) using classical docking tools enhanced by machine learning (ML) re-scoring against both wild-type (WT) and resistant (Q) PfDHFR variants [18]. The findings provide a validated computational framework for accelerating the discovery of novel antimalarial agents effective against resistant malaria.

Experimental Protocols & Workflow

The study employed a rigorous SBVS benchmarking protocol to assess and enhance the prediction of high-affinity binders for PfDHFR [18].

Protein and Benchmark Set Preparation

Protein Structures: The crystal structures of the wild-type PfDHFR (PDB ID: 6A2M) and the quadruple-mutant variant (PDB ID: 6KP2) were retrieved from the Protein Data Bank. Protein preparation was performed using OpenEye's "Make Receptor" tool, which involved removing water molecules and extraneous ions, adding hydrogen atoms, and optimizing the resulting structures [18].
DEKOIS 2.0 Benchmark Set: A high-quality benchmark set was compiled for each PfDHFR variant using the DEKOIS 2.0 protocol. For each target, 40 known bioactive molecules were curated from the literature and BindingDB. Subsequently, 1,200 structurally similar but physiologically inactive decoy molecules were generated for each set, maintaining a challenging 1:30 ratio of active to decoy compounds [18].
Ligand Preparation: The small molecule structures were prepared using the Omega software to generate multiple conformations. File formats were converted as needed for the different docking programs using OpenBabel and SPORES software [18].

Docking and Machine Learning Re-scoring

Docking Tools: Three widely used generic docking programs were evaluated:
- AutoDock Vina (version 1.5.7)
- PLANTS (version 1.2)
- FRED (from OpenEye, version 4.3.2.0) The docking grid boxes were centered on the active site of the respective protein structures [18].
Machine Learning Re-scoring: The ligand poses generated by each docking tool were subsequently re-scored using two pre-trained machine learning scoring functions:
- CNN-Score: A scoring function based on a convolutional neural network.
- RF-Score-VS v2: A scoring function for virtual screening based on a random forest algorithm [18].
Performance Evaluation: The screening performance and enrichment behavior of the eighteen resulting combinations (3 docking tools × 2 re-scoring methods × 2 protein variants) were analyzed using several metrics, including pROC-AUC, pROC-Chemotype plots, and early enrichment capability measured by EF 1% (Enrichment Factor at the top 1% of the ranked list) [18].

The following diagram illustrates the integrated workflow of this benchmarking study:

Comparative Performance Data

The quantitative results from the benchmarking study provide clear guidance for selecting the optimal virtual screening strategy for each PfDHFR variant. The performance was primarily assessed using the Enrichment Factor at 1% (EF 1%), a critical metric for evaluating the early recognition capability of virtual screening campaigns where only a small fraction of a compound library can be selected for experimental testing [18].

Table 1: Virtual Screening Performance (EF 1%) for Wild-Type (WT) PfDHFR

Docking Tool	Standard Docking	Re-scoring with RF-Score-VS v2	Re-scoring with CNN-Score
AutoDock Vina	Worse-than-random	Better-than-random	Better-than-random
PLANTS	Information Not Specified	Information Not Specified	28.0
FRED	Information Not Specified	Information Not Specified	Information Not Specified

Table 2: Virtual Screening Performance (EF 1%) for Quadruple-Mutant (Q) PfDHFR

Docking Tool	Standard Docking	Re-scoring with RF-Score-VS v2	Re-scoring with CNN-Score
AutoDock Vina	Worse-than-random	Better-than-random	Better-than-random
PLANTS	Information Not Specified	Information Not Specified	Information Not Specified
FRED	Information Not Specified	Information Not Specified	31.0

The data leads to two key conclusions. First, re-scoring docking outputs with machine learning scoring functions, particularly CNN-Score, consistently and significantly enhances performance across the board. This is dramatically evidenced by the improvement of AutoDock Vina's performance from worse-than-random to better-than-random after ML re-scoring for both variants [18]. Second, the optimal docking tool depends on the protein variant: for the WT PfDHFR, the best enrichment (EF 1% = 28) was achieved by PLANTS in combination with CNN-Score, whereas for the resistant Q PfDHFR, the best enrichment (EF 1% = 31) was achieved by FRED combined with CNN-Score [18].

The Scientist's Toolkit

The following table details the key research reagents and computational tools that formed the basis of this successful case study, along with their specific functions in the workflow.

Table 3: Essential Research Reagents and Computational Tools

Item Name	Type/Category	Primary Function in the Study
PfDHFR (WT, PDB: 6A2M)	Protein Structure	Three-dimensional crystal structure of the wild-type enzyme used for docking simulations [18].
PfDHFR (Quadruple Mutant, PDB: 6KP2)	Protein Structure	Three-dimensional crystal structure of the resistant variant (N51I/C59R/S108N/I164L) used for docking simulations [18].
DEKOIS 2.0 Benchmark Set	Chemical Dataset	A curated collection of 40 known active molecules and 1,200 decoy molecules for each PfDHFR variant, used to rigorously evaluate screening performance [18].
AutoDock Vina	Docking Software	A widely used open-source molecular docking program evaluated for its virtual screening performance [18].
PLANTS	Docking Software	A docking software based on an ant colony optimization algorithm, evaluated for its virtual screening performance [18].
FRED	Docking Software	A docking tool from OpenEye that uses a exhaustive search algorithm, evaluated for its virtual screening performance [18].
CNN-Score	Machine Learning Scoring Function	A pre-trained convolutional neural network used to re-score docking poses, significantly improving enrichment by better distinguishing high-affinity binders [18].
RF-Score-VS v2	Machine Learning Scoring Function	A pre-trained random forest-based scoring function used to re-score docking poses, showing significant performance improvements over standard docking [18].

This case study demonstrates a robust and effective computational strategy for targeting drug-resistant malaria through the PfDHFR enzyme. The integrated workflow, which combines classical docking tools with advanced machine learning re-scoring, successfully addresses the critical challenge of identifying novel inhibitors against resistant PfDHFR variants. The key finding is that CNN-Score consistently augments SBVS performance, enabling the retrieval of diverse and high-affinity binders for both wild-type and quadruple-mutant PfDHFR [18]. The specific recommendations—using PLANTS/CNN-Score for WT PfDHFR and FRED/CNN-Score for the resistant Q variant—provide drug discovery researchers with a validated, high-performance pipeline. This approach offers a powerful path forward for accelerating the discovery of next-generation antimalarial therapies capable of overcoming existing resistance mechanisms.

Overcoming Common Challenges: A Strategic Guide to Enhanced Performance

Molecular docking is a cornerstone of structure-based drug design, primarily serving two critical functions: predicting the correct binding geometry (pose) of a ligand in a protein's binding site and estimating the binding affinity to rank compounds by their potential potency [33] [34]. The accuracy of these tasks hinges on the scoring functions employed. However, a central challenge in computational drug discovery lies in diagnosing whether a failed docking campaign is due to poor pose prediction (an incorrect geometric prediction of how the ligand binds) or incorrect affinity ranking (an accurate pose that is mis-scored, leading to a wrong potency estimate) [33] [35].

This guide objectively compares the performance of different classes of scoring functions in addressing these distinct failure modes. We synthesize data from comparative studies to provide researchers with a clear framework for selecting and evaluating docking protocols, based on experimental benchmarks and emerging methodologies.

Quantitative Performance Comparison of Scoring Functions

Pose Prediction Accuracy Across Scoring Functions

Pose prediction performance is typically measured by the Root Mean Square Deviation (RMSD) between the predicted ligand pose and its experimentally determined co-crystallized structure. A lower RMSD indicates a more accurate prediction, with an RMSD below 2.0 Å generally considered successful [34].

Table 1: Pose Prediction Performance (RMSD in Å) of MOE Scoring Functions on CASF-2013 Benchmark

Scoring Function	Type	BestRMSD	RMSD_BestDS
Alpha HB	Empirical	Best Performing	N/A
London dG	Empirical	Best Performing	N/A
ASE	Empirical	Intermediate	N/A
Affinity dG	Empirical	Intermediate	N/A
GBVI/WSA dG	Force-Field	Intermediate	N/A

Note: Data adapted from a pairwise comparison study using the CASF-2013 benchmark (195 protein-ligand complexes). The "BestRMSD" metric represents the lowest RMSD achieved among all saved poses, while "RMSD_BestDS" is the RMSD of the pose selected as having the best docking score. The study identified Alpha HB and London dG as the most comparable and best-performing for pose prediction [1] [5].

Affinity Ranking Accuracy Across Targets and Methods

Affinity ranking accuracy measures how well a scoring function can rank a series of ligands by their experimentally determined binding affinities or biological activities. This is often evaluated using rank correlation coefficients like Pearson or Spearman.

Table 2: Affinity Ranking Correlation for Different Targets and Methods

Target / Method	Scoring Function / Approach	Performance (Correlation)	Notes
Hydrophilic Targets
Cdk2 Kinase	Fitted	Pearson: 0.86, Spearman: 0.91	Excellent correlation [34]
Factor Xa, Cdk2, Aurora A Kinase	FlexX, GOLDScore	Pearson > 0.6	Good correlation [34]
Hydrophobic Targets
COX-2, pla2g2a	Multiple Functions	Poor Correlation	Difficult targets for most functions [34]
cGAS & kRAS	Target-Specific GCN	Significant Superiority	Outperformed generic scoring functions [36]
FXR (D3R Challenge)	Smina (Rigid Receptor)	Spearman ρ ~ 0.5	Best for targets with open pockets [37]
Cathepsin S (D3R)	HADDOCK (Ligand-based predictor)	Kendall’s Tau: 0.36	3rd best predictor overall [38]

Experimental Protocols for Benchmarking

To ensure fair and reproducible comparisons, the field relies on standardized benchmark datasets and evaluation protocols.

Benchmarking with the CASF-2013 Dataset

The Comparative Assessment of Scoring Functions (CASF)-2013 benchmark, a subset of the PDBbind database, is a widely accepted standard for evaluating scoring functions [5] [3]. The standard workflow for a comparative assessment study involves:

Dataset Preparation: Use the 195 high-quality protein-ligand complexes from the CASF-2013 dataset [5].
Molecular Docking: Perform re-docking of each ligand into its corresponding protein structure using the software and scoring functions under investigation (e.g., the five functions in MOE). For each complex, save multiple poses (e.g., 30) [1] [5].
Data Extraction: For each complex and scoring function, extract the following key metrics [1] [5]:
- Best Docking Score (BestDS): The most favorable docking score among all saved poses.
- Best RMSD (BestRMSD): The lowest RMSD value among all saved poses compared to the crystal structure.
- RMSD of BestDS (RMSDBestDS): The RMSD of the pose that had the best docking score.
- DS of BestRMSD (DSBestRMSD): The docking score of the pose that had the lowest RMSD.
Performance Analysis:
- Pose Prediction: Use BestRMSD to evaluate the sampling capability and the potential of a scoring function. Use RMSD_BestDS to evaluate a scoring function's ability to identify the correct pose from a set of decoys [5] [33].
- Affinity Ranking: Use BestDS to evaluate the scoring function's power to predict binding affinity and rank ligands correctly [5].

The D3R Grand Challenge Protocol

The Drug Design Data Resource (D3R) organizes blind Grand Challenges that provide a robust, prospective test for docking methods in a realistic drug discovery scenario [38] [37]. A typical participation workflow for affinity ranking, as demonstrated by top-performing groups, involves:

Target & Data Receipt: Receive the target protein sequence (for cross-docking) or crystal structures (for self-docking) and SMILES strings of test ligands [38].
Template Selection: Identify and select optimal protein structures from the PDB for docking. Strategies include:
- "Close" method: Docking each test compound into the receptor structure containing the most chemically similar known ligand (effective for targets with deep pockets) [37].
- "Cross" method: Docking all test compounds into a single, optimally chosen receptor structure (effective for targets with open pockets) [37].
Ligand Preparation: Generate 3D conformers for each ligand, often selecting top conformers based on shape and chemical similarity to a known crystallographic ligand [38].
Pose Prediction & Refinement: Dock ligands and refine the poses using methods like short energy minimization and explicit water refinement [38].
Scoring & Ranking: Rank the compounds based on the best docking score from the refined poses [37].

Diagram: A diagnostic workflow for identifying docking failure modes and potential mitigation strategies.

The Scientist's Toolkit: Key Research Reagents & Solutions

Table 3: Essential Resources for Docking Scoring Function Research

Resource Name	Type / Category	Primary Function in Research
PDBbind Database	Benchmark Dataset	A comprehensive collection of protein-ligand complexes with binding affinity data for training and testing scoring functions [5].
CASF Benchmark	Benchmark Dataset	A curated subset of PDBbind (e.g., CASF-2013) designed for the comparative assessment of scoring functions [5].
D3R Grand Challenge	Community Benchmark	A blind competition that provides realistic targets and data to prospectively test docking and scoring methods [38] [37].
MOE (Molecular Operating Environment)	Software Suite	A commercial drug discovery platform containing multiple scoring functions (London dG, Alpha HB, etc.) for comparative studies [1] [5].
AutoDock Vina / Smina	Docking Software	Popular, open-source docking tools often used as baselines or integrated into custom scoring pipelines [35] [37].
HADDOCK	Docking Software	An information-driven docking platform that integrates experimental data and performs well in community challenges [38] [3].
Chemprop	Machine Learning Framework	A widely used framework for training ML models on molecular properties, including docking scores [8].
Random Forest (RF) / RF-Score	Machine Learning Model	A classical ML algorithm used to create scoring functions that learn the relationship between structural features and binding affinity [35].
Graph Convolutional Network (GCN)	Deep Learning Model	A neural network architecture that operates on graph representations of molecules and proteins to create target-specific scoring functions [36].

Emerging Trends and Machine Learning Solutions

The limitations of classical scoring functions have spurred significant innovation, particularly using machine learning (ML) and deep learning (DL).

Machine-Learning Scoring Functions

Machine-learning scoring functions do not assume a predetermined mathematical form. Instead, they learn the relationship between a set of descriptive features of the protein-ligand complex and the binding affinity or native pose status. Random Forest-based models like RF-Score demonstrated that using features such as atom interaction counts could substantially improve binding affinity prediction over classical functions [35]. A key finding is that pose generation error generally has a small impact on the accuracy of ML-based affinity prediction, and this error can be further corrected by training the models on re-docked poses rather than crystal structures [35].

Deep Learning for Pose Selection

Deep learning models, particularly Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), can extract relevant information directly from the 3D structure of the protein-ligand complex [33]. These models show a remarkable capability to distinguish correct from incorrect binding poses, a task where classical functions often struggle. For challenging targets like cGAS and kRAS, target-specific scoring functions built with Graph Convolutional Networks (GCNs) have shown significant superiority over generic scoring functions in virtual screening accuracy [36].

Large-Scale Benchmarking and Data Availability

The advent of massive docking campaigns has created new opportunities and benchmarks. Databases such as LSD (lsd.docking.org), which provides docking scores and poses for over 6.3 billion molecules across 11 targets, are invaluable for training and testing sophisticated ML models [8]. Proof-of-concept studies using these datasets show that models trained on larger sets perform better, but also reveal a critical insight: a high correlation with docking scores does not guarantee the ability to enrich for true top-ranking molecules or experimentally confirmed binders [8]. This underscores the continued importance of experimental validation.

The Impact of Receptor Flexibility and Binding Site Mutations

Molecular docking stands as a pivotal element in computer-aided drug design (CADD), consistently contributing to advancements in pharmaceutical research by predicting the interaction between a protein receptor and a small molecule ligand [39]. In essence, it employs algorithms to identify the best match between two molecules, akin to solving intricate three-dimensional jigsaw puzzles [39]. The accuracy of docking predictions, however, is fundamentally governed by two critical and interrelated challenges: receptor flexibility and binding site mutations [40] [41]. The inherent dynamics of proteins and genetic variations between individuals can dramatically alter binding site architecture and energetics, making the accurate prediction of ligand binding a central problem in structural biology and rational drug design [39] [40] [42]. This guide provides a comparative analysis of computational strategies developed to overcome these challenges, evaluating their performance, underlying methodologies, and applicability in modern drug discovery pipelines.

Physical Basis and Challenges in Molecular Recognition

Fundamental Interactions in Protein-Ligand Complexes

Protein-ligand binding is driven by a combination of weak, non-covalent interactions that, when combined, produce stable and specific associations [39]. The major types of interactions include:

Hydrogen bonds: Polar electrostatic interactions between an electron donor (D) and acceptor (A), described as D—H…A, with a strength of about ~5 kcal/mol [39].
Ionic interactions: Electronic attraction between oppositely charged ionic pairs, highly specific but influenced by the aqueous solvent environment [39].
Van der Waals interactions: Non-specific forces arising from transient dipoles in electron clouds when atoms are in close proximity, with a strength of ~1 kcal/mol [39].
Hydrophobic interactions: Entropy-driven aggregation of nonpolar molecules, excluding water from their surroundings [39].

The net binding affinity is quantitatively described by the Gibbs free energy equation: ΔGbind = ΔH - TΔS, where ΔH represents enthalpy from bonds formed/broken, and TΔS represents the entropic contribution [39]. This balance is delicate, and even minor perturbations from flexibility or mutations can significantly impact ΔGbind.

Molecular Recognition Models

Three conceptual models explain how proteins and ligands recognize each other, with implications for docking:

Lock-and-Key Model: Theorizes rigid, complementary surfaces between protein and ligand, an entropy-dominated process [39].
Induced-Fit Model: Proposes conformational changes in the protein during binding to best accommodate the ligand, adding flexibility to the lock-and-key idea [39].
Conformational Selection: Posits that ligands selectively bind to pre-existing favorable conformational states from an ensemble of protein substates [39].

The induced-fit and conformational selection models highlight why treating receptors as static entities is a major limitation in docking simulations.

Computational Strategies for Receptor Flexibility

Incorporating receptor flexibility remains a frontier in docking method development. The following table compares the primary approaches:

Table 1: Comparison of Computational Strategies for Receptor Flexibility

Method Category	Key Principle	Representative Tools/Protocols	Performance Highlights	Key Limitations
Ensemble Docking	Docking into multiple static receptor conformations from experiments or MD simulations [41] [43].	RCS [43], QR Factorization [43]	Reduces false negatives by exploring different pocket shapes [43].	Success depends on proper weighting and coverage of relevant states [41].
On-the-Fly Flexibility	Modeling receptor structural changes during the docking process itself [41].	FlexCovDock [41], IFD [41], IFD-MD [41]	FlexCovDock increased success rates from 55% to 89% in KRASG12C pose prediction [41].	Computationally intensive; requires predefined flexible residues [41].
Enhanced Sampling & FEP	Using molecular dynamics and alchemical transformations for binding affinity calculation [41].	FEP, TI	Can achieve high accuracy (MUE ~0.89 kcal/mol) when conformational sampling is sufficient [41].	Struggles with large backbone movements on practical timescales [41].

Case Study: Overcoming KRASG12C Flexibility

The KRASG12C oncogene, once considered "undruggable," features a highly flexible switch-II pocket. This pocket undergoes significant conformational rearrangement upon ligand binding, with its volume increasing from 150 Å³ to 280 Å³ for the inhibitor sotorasib (AMG 510) [41]. Standard covalent docking (CovDock) achieved only a 55% success rate for pose prediction. The developed FlexCovDock protocol, which incorporates a predefined list of flexible residues, improved this rate to 89% across a diverse set of 10 inhibitors [41]. Furthermore, a novel strategy using targeted protein mutations to accelerate conformational transitions of the stubborn switch-II loop reduced the error in FEP binding affinity predictions from 1.44 kcal/mol to 0.89 kcal/mol for 14 compounds [41].

The workflow below illustrates the process of integrating flexibility into docking and affinity calculations for a challenging target like KRAS.

Predicting the Impact of Binding Site Mutations

Single-point mutations in a protein's binding site can alter drug response by modulating binding affinity, impacting both drug efficacy and the prevalence of side effects across populations [40]. It is estimated that every individual carries approximately six SNPs affecting five different FDA-approved drugs, and 80% of patients carry at least one functional variant in the targets of the top 100 prescribed drugs [40].

Machine Learning for Mutation Effects

The PSnpBind-ML methodology addresses this by training on a massive database of ~600,000 docking experiments to predict binding affinity changes upon mutation [40]. Its framework and performance are summarized below:

Model Architecture: A two-stage regression model; the first predicts wild-type protein-ligand binding affinity, and the second predicts the affinity for the mutated protein [40].
Feature Engineering: 256 features numerically represent the protein, binding site, mutation, and ligand, with half manually curated based on domain knowledge [40].
Performance: On an independent test set, the best models reported an RMSE of 0.50-0.60 kcal/mol and an R² of 0.87-0.90, outperforming several existing models [40].

Table 2: Performance Comparison of Mutation Effect Prediction Methods

Method Name	Methodology	Key Application	Reported Performance
PSnpBind-ML [40]	Random Forest on docking-based features	General protein-ligand binding	RMSE: 0.50-0.60 kcal/mol, R²: 0.87-0.90
mCSM-lig [40]	Graph-based signatures & ML	General protein-ligand binding	Pearson's R: 0.627 (0.737 after outlier removal)
SAAMBE [40]	Physics-based & knowledge-based terms	Protein-protein binding	Pearson's R: 0.62

SARS-CoV-2 Variants of Concern: A Real-World Example

Structural analyses of SARS-CoV-2 spike protein variants provide a clear example of how mutations strategically balance affinity and immune evasion. The N501Y mutation (found in Alpha, Beta, and Gamma variants) directly enhances ACE2 receptor affinity [44]. Conversely, mutations at residues 417 (K417N/T) and 484 (E484K) primarily reduce antibody binding, albeit sometimes at a cost to ACE2 affinity [44]. This demonstrates how VoC evolution is driven by mutations that fine-tune binding interactions for selective advantage.

Comparative Performance of Scoring Functions

Scoring functions are the core of docking protocols, approximating the binding affinity of a ligand by calculating its interaction energy with the protein receptor [1] [3] [45]. They can be categorized into four types:

Physics-based: Use classical force fields to calculate binding energy from Van der Waals, electrostatic terms, etc. (High computational cost) [3].
Empirical: Estimate binding affinity as a weighted sum of energy terms fit to experimental data (Faster, simpler computation) [3] [45].
Knowledge-based: Use pairwise atom/residue distances converted to potentials via Boltzmann inversion (Good balance of speed and accuracy) [3].
Machine Learning/Deep Learning-based: Learn complex transfer functions from features and structural data [3].

A pairwise comparison of the five scoring functions in the Molecular Operating Environment (MOE) software using InterCriteria Analysis (ICrA) revealed that the lowest RMSD (root-mean-square deviation between predicted and native poses) was the best-performing docking output [45]. Among the functions, two empirical ones—Alpha HB and London dG—showed the highest comparability [45]. This underscores that the choice of scoring function and the metric used for evaluation are both critical for success.

The Scientist's Toolkit: Essential Research Reagents & Solutions

The following table details key resources and their applications in studying receptor flexibility and mutations.

Table 3: Key Research Reagent Solutions for Docking Studies

Reagent / Resource	Type	Primary Function in Research
PSnpBind Database [40]	Data Resource	Provides a large-scale dataset of ~600k docking experiments for training ML models on mutation effects.
CCharPPI Server [3]	Computational Tool	Allows for the independent assessment of scoring functions, decoupling them from the docking process itself.
FlexCovDock Protocol [41]	Computational Method	A modified covalent docking workflow that incorporates receptor flexibility for improved pose prediction.
CASF-2013 Benchmark [45]	Data Resource	A standardized benchmark set of 195 protein-ligand complexes for validating and comparing scoring functions.
Targeted Loop Mutation [41]	Computational Strategy	A technique to accelerate conformational sampling in FEP calculations for flexible binding sites.
PDBbind Database [45]	Data Resource	A comprehensive collection of protein-ligand complexes with binding affinity data for general benchmarking.

The accurate computational prediction of protein-ligand binding in the face of receptor flexibility and genetic mutations remains a formidable challenge. No single method universally outperforms all others; rather, the choice depends on the specific biological problem and available resources. For binding pose prediction in highly flexible proteins like KRAS, incorporating explicit flexibility through protocols like FlexCovDock is crucial [41]. For anticipating population-wide drug response or resistance due to mutations, efficient machine learning models like PSnpBind-ML trained on large docking datasets offer a powerful solution [40]. Finally, the selection and evaluation of scoring functions should be guided by rigorous benchmarks like CASF, with empirical functions such as Alpha HB and London dG often providing a strong balance of performance and comparability [45]. As these methods continue to evolve and integrate, they will undoubtedly enhance our ability to design effective and personalized therapeutics.

In computational drug discovery, molecular docking is a cornerstone technique used to predict how a small molecule (ligand) binds to a protein target. The reliability of these predictions critically depends on the scoring functions that estimate the binding affinity. However, no single scoring function is universally superior; each has distinct strengths and weaknesses due to its unique parameterization and training data, leading to variable performance across different protein families and ligand chemotypes [5] [46]. This limitation presents a significant challenge in virtual screening (VS) campaigns, where the goal is to identify genuine active compounds from large chemical libraries.

Consensus scoring (CS) has emerged as a powerful strategy to overcome the limitations of individual scoring functions. This approach combines the results from multiple, independent docking programs or scoring functions to produce a more robust and reliable ranking of potential ligands [47]. The underlying principle is that true active compounds are more likely to be ranked favorably by several different methods, whereas false positives are less consistent. By leveraging the strengths of various functions, consensus scoring mitigates the risk of method-specific errors and data biases, thereby improving the fidelity of ligand-protein docking predictions and enhancing the enrichment of true hits in virtual screening [47] [48]. This guide provides a comparative analysis of consensus scoring methodologies, supported by experimental data and detailed protocols, to inform their application in drug development.

The Need for Consensus: Quantitative Evidence of Performance Variation

Substantial evidence demonstrates that the performance of individual docking scoring functions is highly system-dependent. A pairwise comparison of the five scoring functions within the Molecular Operating Environment (MOE) software—London dG, ASE, Affinity dG, Alpha HB, and GBVI/WSA dG—on the CASF-2013 benchmark revealed significant performance disparities. The analysis, which employed InterCriteria Analysis (ICrA), found that the degree of agreement between these functions was often low, hitting a "dissonance zone," particularly for outputs like the best docking score or the RMSD of the best-score pose [5] [45]. This inconsistency underscores the inherent risk of relying on a single function.

Further independent studies corroborate this variability. Research evaluating six docking programs (AutoDock, ICM, LeDock, rDock, AutoDock Vina, and Smina) across diverse protein targets like CDK2, ESR1, ADRB2, and CAH2 showed that no single program consistently outperformed the others [46]. For instance, while ICM generally presented good performance, it was one of the worst performers for the CAH2 system [46]. This system dependency highlights a critical vulnerability in individual scoring functions, which consensus strategies are designed to address.

Table 1: Performance Variation of Individual Docking Programs

Docking Program	Reported Typical Success Rate	Performance Notes
Smina	>90% [47]	High accuracy but performance is system-dependent [46].
AutoDock Vina	~80% [47]	Widely used, but performance varies across targets [46].
Gemdock	~79% [47]	Success rate is not consistent for all systems.
ICM	Not specified	Often among the best performers, but failed on CAH2 [46].
rDock	Not specified	Generally good performance, but system-dependent [46].
LeDock	~75% [47]	Performance is not universal.

Performance Comparison: Consensus vs. Individual Scoring

Quantitative benchmarks demonstrate that consensus scoring consistently delivers superior results compared to individual docking methods. A landmark study combining ten docking programs (ADFR, DOCK, Gemdock, Ledock, PLANTS, PSOVina, QuickVina2, Smina, Autodock Vina, and VinaXB) against 29 MRSA-oriented targets showed that CS provides improved ligand-protein docking fidelity [47]. The study demonstrated that a finite, well-chosen combination of docking programs is sufficient to achieve high accuracy, offering a parsimonious and viable alternative to computationally expensive single-method docking [47].

The performance gain from consensus scoring is particularly evident in virtual screening enrichment, which measures the ability to prioritize active compounds over decoys. A study on six protein targets (including PPARG and DPP4) found that a novel consensus scoring workflow integrating QSAR, pharmacophore, docking, and 2D shape similarity methods achieved exceptional Area Under the Curve (AUC) values of 0.90 and 0.84 for PPARG and DPP4, respectively. This consensus approach consistently prioritized compounds with higher experimental pIC50 values and outperformed all individual screening methodologies [48].

Another powerful illustration comes from the introduction of the Exponential Consensus Ranking (ECR) method. Tests on benchmark systems showed that traditional consensus methods (like taking the intersection of top-ranked molecules from multiple programs) could achieve an enrichment factor at 2% (EF2) of up to 25.8 for a single CDK2 structure. However, the novel ECR method matched or surpassed the performance of the top traditional consensus strategies across a wide range of systems, proving to be a simple, robust, and parameter-independent solution [46].

Table 2: Enrichment Factor at 2% (EF2) for Consensus vs. Individual Docking [46]

Docking Method / Strategy	CDK2 (1AQ1)	ESR1 (3ERT)	ADRB2 (3NYA)	CAH2 (1G52)
ICM (Individual)	21.5	20.0	13.6	2.2
AutoDock Vina (Individual)	11.1	14.1	7.3	11.9
Traditional Consensus (Intersection)	25.8	21.4	18.2	13.6
Exponential Consensus (ECR)	25.8	21.4	18.2	15.3

Key Methodologies and Experimental Protocols

Consensus scoring strategies can be broadly categorized based on how they process and combine the outputs from individual docking runs. The choice of methodology directly impacts the robustness and success of a virtual screening campaign.

Data Normalization and Combination Strategies

A critical first step in consensus scoring, especially when combining heterogeneous docking programs, is data normalization. Different programs use different units and scales for their scores, making direct comparison invalid. Common normalization procedures include [47]:

Rank Transformation: Converting raw docking scores into ranks (e.g., rank 1 for the best score).
Minimum-Maximum Scaling: Scaling all scores to a common range, typically [0, 1].
Z-score Scaling: Standardizing scores based on the mean and standard deviation of the distribution.

After normalization, several core strategies exist for combining the results:

Traditional Rank-Based Consensus: This includes methods like "rank-by-rank" (averaging the ranks of a molecule across programs) and "rank-by-vote" (counting how many times a molecule appears in the top-N% of lists from different programs) [46]. The traditional intersection approach selects molecules that are top-ranked by all programs.
Exponential Consensus Ranking (ECR): A novel method that overcomes the limitations of traditional consensus. It assigns a score to each molecule based on the sum of exponential distributions of its ranks from each program. The formula is: ( P(i) = \frac{1}{\sigma} \sum{j} \exp\left(-\frac{ri^j}{\sigma}\right) ) where ( r_i^j ) is the rank of molecule ( i ) by program ( j ), and ( \sigma ) is a parameter setting the consensus threshold. ECR acts like a conditional "OR", rewarding molecules that rank well in any program, not necessarily all [46].
Machine Learning (ML) and Holistic Pipelines: Advanced workflows use ML models to amalgamate scores from diverse screening methods (e.g., docking, pharmacophore, QSAR). A novel approach involves calculating a "w_new" metric to rank fine-tuned ML models, which then generate a final consensus score through a weighted average Z-score [48].

Protocol for Implementing Exponential Consensus Ranking

The following protocol outlines the steps to perform a virtual screening campaign using the ECR method [46]:

Docking Execution: Run molecular docking for all compounds in the library against the target protein using at least two, but preferably more, independent docking programs (e.g., AutoDock Vina, Smina, rDock, ICM).
Rank Assignment: For each docking program, rank all compounds from best (rank 1) to worst based on their docking scores.
ECR Score Calculation: For each compound, calculate its ECR score using the formula above. The parameter ( \sigma ) can be set to the number of compounds to consider (e.g., the top 5-10% of the library size). The performance is largely independent of the specific ( \sigma ) value chosen.
Final Ranking: Rank all compounds based on their final ECR score, ( P(i) ), in descending order. Compounds with the highest ECR scores are the top consensus hits.
Validation (Optional): For compounds with identical ECR scores, shuffle their ranks multiple times and calculate the average enrichment to ensure statistical robustness.

Diagram: A generalized workflow for implementing consensus scoring in virtual screening. The process involves docking with multiple programs, normalizing the results, applying a consensus strategy, and generating a final ranked list.

The Scientist's Toolkit: Essential Research Reagents and Software

Implementing a successful consensus scoring strategy requires access to specific computational tools and datasets. The table below details key resources referenced in the experimental studies.

Table 3: Essential Research Reagents and Software for Consensus Docking

Item Name / Resource	Type / Category	Primary Function in Research	Example Use in Consensus Scoring
Molecular Operating Environment (MOE) [5]	Commercial Software Suite	Provides a suite of empirical (London dG, Alpha HB) and force-field (GBVI/WSA dG) scoring functions for docking.	Used as a source of multiple scoring functions for pairwise performance comparison and consensus building [5] [49].
AutoDock Vina & Smina [47] [46]	Open-Source Docking Program	Fast, widely-used docking tools for pose prediction and scoring. Smina is a fork of Vina with added features.	Frequently included in consensus ensembles due to their speed, accessibility, and generally good performance [47] [46].
CASF-2013 Benchmark Set [5]	Curated Dataset	A high-quality set of 195 diverse protein-ligand complexes from the PDBbind database, with binding affinity data.	Serves as a standard benchmark for the comparative assessment of scoring functions and consensus methods [5] [45].
DUD-E Repository [47] [48]	Curated Dataset	The Directory of Useful Decoys, Enhanced, provides active ligands and decoys for protein targets to test screening enrichment.	Used to rigorously evaluate the ability of consensus scoring to distinguish true actives from decoys in virtual screening [47] [48].
ICM [46]	Commercial Docking Program	A comprehensive docking and molecular modeling software.	Often used in consensus docking studies due to its high reported performance, though it remains system-dependent [46].
Exponential Consensus Ranking (ECR) [46]	Computational Algorithm	A novel rank-based consensus method that sums exponential distributions of individual ranks.	Provides a robust, parameter-independent strategy for combining results from multiple docking programs [46].

The empirical data and comparative analyses presented in this guide unequivocally demonstrate that consensus scoring is a superior strategy for enhancing the reliability of molecular docking outcomes. By integrating predictions from multiple, independent scoring functions, researchers can effectively mitigate the system-dependent weaknesses of any single method, leading to improved virtual screening enrichment, higher success rates in pose prediction, and a more robust identification of true bioactive compounds. Methodologies like Exponential Consensus Ranking and machine learning-powered holistic pipelines represent the cutting edge, offering robust, parameter-independent solutions. For researchers and drug development professionals, adopting a consensus approach is no longer just an option but a necessary best practice to de-risk computational campaigns and improve the odds of success in lead discovery.

Selecting the optimal docking scoring function is a critical step in structure-based drug design, as no single function performs universally best across all protein targets. This guide provides an objective comparison of scoring function performance based on recent benchmarking studies, offering researchers a data-driven strategy for tailoring their selection to specific projects.

Molecular docking is a cornerstone of computer-aided drug design (CADD), simulating how a small molecule ligand binds to a protein target. Scoring functions are the core computational algorithms that approximate the binding affinity by calculating the interaction energy between the ligand and protein [5]. They are essential for predicting the correct binding pose (pose prediction) and estimating the strength of the interaction (affinity prediction) [39] [17].

The performance of these functions varies significantly based on their underlying physical principles and computational methods. Traditional functions are often categorized as:

Physics-based: Use classical force fields to calculate terms like van der Waals and electrostatic interactions [3].
Empirical: Estimate binding affinity by summing weighted energy terms derived from linear regression to experimental data [5].
Knowledge-based: Use statistical potentials derived from the frequency of atom-pair interactions in known structures [3].

Recently, machine learning (ML) and deep learning (DL) methods have emerged, leveraging neural networks to learn complex patterns from large datasets of protein-ligand complexes [17] [50]. The choice of function directly impacts the success of virtual screening and lead optimization campaigns, making function selection a key strategic decision.

Performance Comparison of Scoring Functions

Comparative Data from Benchmarking Studies

Benchmarking studies on standardized datasets like CASF-2013 and CASF-2016 provide critical insights into the relative performance of different scoring functions. The following tables summarize key quantitative comparisons from recent literature.

Table 1: Pose Prediction Performance of MOE Scoring Functions (CASF-2013 Benchmark)

Scoring Function	Type	BestRMSD (Å)	RMSD_BestDS (Å)	Comparability (ICrA)
Alpha HB	Empirical	1.42	3.21	High (with London dG)
London dG	Empirical	1.58	3.05	High (with Alpha HB)
Affinity dG	Empirical	1.63	3.14	Moderate
ASE	Empirical	1.71	3.33	Moderate
GBVI/WSA dG	Force-field	1.66	3.28	Low

Table 2: Virtual Screening Performance Across Docking Software (CASF-2016 Benchmark)

Docking Software	Scoring Function	Pearson Correlation (Rc) with Experimental Affinity
AutoDock Vina	Empirical	0.604
GOLD	Various	0.416 - 0.617
MOE	Various	0.405 - 0.591
Glide	Various	0.467 - 0.513
AEV-PLIG (ML)	Graph Neural Network	0.59 - 0.85 (varies by benchmark)

Table 3: Performance Tiers of Modern Docking Methods (Multi-Dataset Analysis)

Method Category	Representative Methods	Pose Accuracy (RMSD ≤ 2Å)	Physical Validity (PB-Valid)	Combined Success Rate
Traditional	Glide SP, AutoDock Vina	Moderate	>94%	High
Hybrid (AI Scoring)	Interformer	High	High	High
Generative Diffusion	SurfDock, DiffBindFR	>70%	40-63%	Moderate
Regression-Based	KarmaDock, QuickBind	Low	Low	Low

Key Findings from Comparative Analysis

Trade-offs between Pose Accuracy and Physical Validity: Deep learning methods, particularly generative diffusion models like SurfDock, demonstrate superior pose prediction accuracy (exceeding 70% success rates across multiple datasets). However, they often produce physically implausible structures, with validity rates dropping to 40% on novel targets [17]. Traditional methods like Glide SP maintain high physical validity (>94%) across diverse datasets.
Scoring-Power Performance Gap: Classical scoring functions from mainstream docking software (AutoDock Vina, MOE, Glide) show weak to moderate correlation (Rc ~0.4-0.6) with experimental binding affinities [51]. Modern ML-based scoring functions like AEV-PLIG achieve significantly higher correlation (PCC 0.85-0.90) on standard benchmarks [52], though their performance can decrease on out-of-distribution complexes.
Function-Specific Strengths: The pairwise comparison of MOE's functions revealed that Alpha HB and London dG showed the highest comparability in performance, particularly for pose prediction measured by BestRMSD [5]. This suggests these functions may be interchangeable for certain applications, while GBVI/WSA dG (a force-field function) exhibited distinct behavior.

Experimental Protocols for Benchmarking

Standardized Evaluation Methodology

To ensure fair comparison across scoring functions, researchers employ standardized benchmarking protocols:

Dataset Preparation:

The CASF (Comparative Assessment of Scoring Functions) benchmark is widely used, with CASF-2013 comprising 195 high-quality protein-ligand complexes from the PDBbind database [5]. The dataset covers diverse protein families and ligand chemotypes to ensure broad applicability.
For rigorous generalization testing, out-of-distribution (OOD) benchmarks include novel protein binding pockets (DockGen dataset) and congeneric series typical of lead optimization campaigns [17] [52].

Evaluation Metrics:

Pose Prediction: Root-mean-square deviation (RMSD) between predicted pose and crystallographic reference, with RMSD ≤ 2.0 Å considered successful [17].
Scoring Power: Pearson correlation coefficient (Rc) between predicted scores and experimental binding affinities [51].
Physical Validity: Assessed using tools like PoseBusters to check for geometric inconsistencies, steric clashes, and chemical correctness [17].
Virtual Screening Performance: Enrichment factors, area under ROC curve (AUC), and ability to discriminate strong from weak binders [51].

Implementation Protocol:

Re-docking: Ligands from crystallized complexes are extracted and re-docked into their protein binding sites.
Pose Generation: Multiple poses (typically 30) are generated for each complex using the docking software's sampling algorithm.
Scoring: Each pose is scored using the functions being evaluated.
Data Extraction: For each complex, researchers record:
- Best docking score (BestDS)
- Lowest RMSD between predicted and native pose (BestRMSD)
- RMSD of the best-scoring pose (RMSDBestDS)
- Docking score of the pose with lowest RMSD (DSBestRMSD)
Statistical Analysis: Results are aggregated across the entire dataset and analyzed using correlation metrics and specialized approaches like InterCriteria Analysis (ICrA) [5].

Advanced Methodologies for Robust Assessment

InterCriteria Analysis (ICrA): A multi-criteria decision-making approach that reveals relations between scoring functions by analyzing their performance across multiple docking outputs. ICrA calculates degrees of agreement between functions, helping identify which pairs perform similarly under specific conditions [5].

Multi-Dimensional Evaluation: Comprehensive assessment across five critical dimensions:

Pose prediction accuracy (RMSD)
Physical plausibility (steric clashes, bond geometry)
Interaction recovery (hydrogen bonds, hydrophobic contacts)
Virtual screening efficacy (enrichment, hit identification)
Generalization across diverse protein-ligand landscapes [17]

Decision Framework for Function Selection

Target-Aware Selection Strategy

The optimal scoring function depends on the specific protein target, available structural information, and project goals. The following diagram illustrates the strategic decision process for selecting scoring functions.

Practical Recommendations for Different Scenarios

For Novel Targets with Limited Structural Information: Traditional methods like Glide SP or AutoDock Vina provide more reliable performance due to their physical foundations and lower sensitivity to training data distribution [17].
For Well-Characterized Protein Families: ML-based scoring functions like AEV-PLIG and FeatureDock typically outperform classical functions when the target shares similarity with proteins in the training data [52] [51].
When Pose Accuracy is Critical: Generative diffusion models (e.g., SurfDock) currently achieve the highest pose prediction accuracy, though requires post-docking validation for physical plausibility [17].
For Virtual Screening campaigns: Hybrid approaches that combine traditional conformational sampling with AI-based rescoring offer the best balance between accuracy and physical validity [17] [50].
For Lead Optimization Projects: ML scoring functions trained with augmented data (including docked poses) show improved performance in ranking congeneric series, narrowing the gap with more computationally intensive free energy perturbation (FEP) methods [52].

The Scientist's Toolkit: Essential Research Reagents

Table 4: Key Computational Tools for Scoring Function Evaluation

Tool/Resource	Type	Primary Function	Application Context
PDBbind Database	Database	Curated collection of protein-ligand complexes with binding affinity data	Provides standardized benchmark sets (e.g., CASF-2013, CASF-2016) for scoring function evaluation [5]
MOE (Molecular Operating Environment)	Software Suite	Comprehensive drug discovery platform with multiple scoring functions	Contains five specialized scoring functions (London dG, ASE, Affinity dG, Alpha HB, GBVI/WSA dG) for comparative studies [5]
PoseBusters	Validation Tool	Checks physical plausibility and chemical correctness of docking poses	Detects steric clashes, geometric inconsistencies, and chemical errors in predicted structures [17]
CCharPPI Server	Evaluation Platform	Isolated assessment of scoring functions independent of docking protocols	Enables direct comparison of scoring functions without confounding factors from docking algorithms [3]
FeatureDock	Deep Learning Framework	Transformer-based docking with physicochemical feature learning	Predicts binding poses and provides strong scoring power for virtual screening [51]
AEV-PLIG	ML Scoring Function	Attention-based graph neural network for affinity prediction	Combines atomic environment vectors with protein-ligand interaction graphs for improved binding affinity prediction [52]

Tailoring scoring function selection to specific protein targets requires careful consideration of multiple factors, including target characteristics, project goals, and available resources. While modern ML-based functions show promising performance on standard benchmarks, traditional methods maintain advantages in physical plausibility and generalization to novel targets.

The most effective strategy involves:

Using traditional or hybrid methods for novel targets with limited structural information
Leveraging ML-based functions for well-characterized protein families
Validating predictions with physical plausibility checks, especially for generative models
Considering ensemble approaches that combine multiple scoring functions for critical applications

As the field evolves, the integration of physics-based principles with data-driven approaches and the development of more robust benchmarking protocols will further enhance our ability to select optimal scoring functions for specific protein targets.

This guide objectively compares the performance of AI-powered molecular docking methods with traditional physics-based approaches, focusing on their ability to enrich true active compounds in virtual screening. The evaluation is based on recent, rigorous benchmarking studies to inform researchers and drug development professionals.

↑ Performance Comparison: AI-Powered vs. Traditional Docking Methods

Comprehensive benchmarking reveals a complex performance landscape where no single method dominates across all metrics. The following table summarizes key performance indicators from multi-dimensional evaluations.

Table 1: Overall Docking Performance and Physical Validity Across Benchmark Datasets [17]

Method Category	Example Methods	Pose Accuracy (RMSD ≤ 2 Å)	Physical Validity (PB-valid rate)	Combined Success (RMSD ≤ 2 Å & PB-valid)
Traditional	Glide SP, AutoDock Vina	Moderate to High	> 94% (Consistently high)	Highest Tier
Hybrid (AI Scoring)	Interformer	Moderate	High	Second Highest Tier
Generative AI	SurfDock, DiffBindFR	High (e.g., SurfDock: >70%)	Moderate to Low (e.g., SurfDock: 40-63%)	Moderate
Regression-based AI	KarmaDock, GAABind	Low	Lowest	Lowest Tier

Table 2: Virtual Screening Enrichment and Computational Efficiency [53] [54] [55]

Method	Early Enrichment (DUD-E)	Screening Speed	Top 0.01% Recall (logAUC)	Required Training Set
Glide WS	Superior to Glide SP	~20x slower than Glide SP	Not Reported	Not Applicable
Machine Learning-Guided (CatBoost)	Not Reported	>1000x improvement over full docking	0.77 (Stratified Sampling) vs. 0.49 (Random)	1 million compounds
AI-Powered Docking (General)	Shows "great potential" [53]	Varies by method	Not Reported	Not Reported

↑ Experimental Protocols for Benchmarking

↑ VSDS-vd Benchmark and Evaluation Metrics

A comprehensive virtual screening benchmark test set, VSDS-vd, was constructed to evaluate both physically-based and AI-powered docking methods. The evaluation framework assessed several critical dimensions for virtual screening [53] [17]:

Pose Prediction Accuracy: Measured by heavy-atom Root-Mean-Square Deviation (RMSD) between predicted and experimental ligand poses, with success defined as RMSD ≤ 2.0 Å.
Physical Validity: Assessed using the PoseBusters toolkit, which checks for chemical and geometric consistency, including bond lengths, angles, stereochemistry, and protein-ligand steric clashes.
Virtual Screening Efficacy: Evaluated through early enrichment performance on the DUD-E dataset, measuring the ability to recover true binders near the top of the ranking list.
Generalization Capability: Tested across proteins with varying sequence similarity, ligand topologies, and novel binding pockets not encountered during training.

↑ Machine Learning-Guided Docking Workflow

Recent research has developed protocols combining machine learning with molecular docking to screen ultralarge libraries [54]:

Training Set Creation: For a given target, 1 million compounds are randomly selected and docked. Docking scores are used to label compounds, with the top 1% typically defined as "active."
Classifier Training: Machine learning models (CatBoost with Morgan2 fingerprints recommended) are trained to predict docking scores based on molecular structures.
Conformal Prediction: The trained model screens billions of compounds. The Mondrian conformal prediction framework is applied, allowing users to control error rates and define confidence levels for predictions.
Docking Validation: The significantly reduced compound set identified by ML (typically ~10% of the original library) undergoes explicit molecular docking.

↑ AI-Powered Docking Workflow

The following diagram illustrates the machine learning-guided docking workflow that enables efficient screening of billion-compound libraries, balancing speed and accuracy.

Table 3: Key Software and Databases for Docking Research [5] [56] [54]

Resource Name	Type	Primary Function	Application Context
CASF-2013	Benchmark Dataset	Curated set of 195 protein-ligand complexes from PDBbind for scoring function assessment	Method validation and comparison
DUD-E	Benchmark Dataset	Directory of useful decoys for virtual screening evaluation with active compounds and property-matched decoys	Enrichment performance testing
LSD Database (lsd.docking.org)	Data Repository	Docking scores, poses, and experimental results for 6.3 billion molecules across 11 targets	Training machine learning models, benchmarking
PoseBusters	Validation Tool	Checks physical and chemical validity of predicted ligand poses	Quality control for docking results
Chemprop/CatBoost	Machine Learning Framework	Predicts docking scores from molecular structures using Morgan fingerprints	Accelerating ultralarge library screening
Glide WS	Docking Software	Advanced docking incorporating water energetics from WaterMap	High-accuracy pose prediction and scoring
VSDS-vd	Benchmark Dataset	Comprehensive virtual screening test set for evaluating AI and traditional methods	Comparative performance assessment

Benchmarking the State-of-the-Art: A 2025 Performance Showdown

Molecular docking is a cornerstone of modern computational drug discovery, enabling researchers to predict how small molecules interact with target proteins [57]. The reliability of these predictions, however, hinges on the rigorous evaluation of docking performance using standardized metrics. Three fundamental metrics form the cornerstone of docking protocol validation: Root Mean Square Deviation (RMSD) for assessing binding pose accuracy, Enrichment Factors (EF) for quantifying virtual screening efficiency, and Receiver Operating Characteristic (ROC) curves for evaluating classification performance in distinguishing active from inactive compounds [24] [58]. These metrics provide complementary insights and, when used collectively, offer a comprehensive framework for comparing the performance of different docking programs and scoring functions. This guide objectively compares current docking tools using these standardized metrics, providing researchers with experimental data to inform their methodological selections for drug discovery pipelines.

Core Performance Metrics Explained

Root Mean Square Deviation (RMSD)

RMSD measures the average distance between the atoms of a docked ligand pose and its experimentally determined conformation in a crystal structure. It is the primary metric for evaluating the accuracy of binding mode predictions [24]. The formula for calculating RMSD involves the square root of the average squared distances between corresponding atoms in the superimposed structures.

A lower RMSD value indicates a closer match to the experimental structure. The widely accepted threshold for a successful docking pose is an RMSD value of less than 2.0 Å [24] [34]. Poses below this threshold are considered correctly predicted, as the ligand's orientation and conformation closely resemble the native state. Performance is often reported as the percentage of ligands in a test set for which a docking program can generate a pose with an RMSD below this 2.0 Å cutoff [24].

Enrichment Factor (EF)

Enrichment Factor quantifies the effectiveness of a virtual screening workflow in identifying true active compounds from a large database of decoy molecules. It measures the concentration of active compounds found in a selected top fraction of the ranked database compared to a random selection [58] [34].

The formula for EF is: [ EF = \frac{(N{hit}^{selected} / N{total}^{selected})}{(N{hit}^{total} / N{total}^{total})} ] where:

( N_{hit}^{selected} ) is the number of active compounds found in the top fraction
( N_{total}^{selected} ) is the total number of compounds in the top fraction
( N_{hit}^{total} ) is the total number of active compounds in the database
( N_{total}^{total} ) is the total number of compounds in the database

EF values can range from 1 (random enrichment) to the theoretical maximum (( N{total}^{total} / N{hit}^{total} )), which represents perfect enrichment. It is common to report EF at specific early time points, such as EF1% or EF2%, which are critical for assessing performance in realistic virtual screening scenarios where only a small fraction of the database can be tested experimentally [58] [34].

Receiver Operating Characteristic (ROC) Curves

ROC curves provide a comprehensive visualization of a scoring function's ability to discriminate between active and inactive compounds across all possible classification thresholds [24] [58]. A ROC curve plots the True Positive Rate (Sensitivity) against the False Positive Rate (1-Specificity).

The Area Under the Curve (AUC) is a single numeric value that summarizes the overall performance of the classifier, with a perfect classifier achieving an AUC of 1.0 and a random classifier achieving 0.5 [24]. In virtual screening, a higher AUC indicates a better ability to prioritize active compounds over inactive ones throughout the entire ranking list, providing a complementary perspective to the early enrichment captured by EF [58].

Experimental Protocols for Metric Evaluation

Standardized Benchmarking Workflow

A robust evaluation of docking performance requires a standardized experimental workflow to ensure fair comparisons between different programs and scoring functions. The following protocol, synthesized from multiple benchmarking studies, outlines the key steps:

Dataset Curation: Compile a set of high-quality protein-ligand complexes from the Protein Data Bank (PDB) [24]. The set should encompass diverse protein families and ligand chemotypes. Complexes are typically prepared by removing redundant chains, water molecules, and cofactors, followed by adding missing hydrogen atoms and assigning correct protonation states [24] [59].
Pose Prediction (for RMSD assessment): For each complex, separate the ligand from the protein. Use each docking program to re-dock the ligand back into the prepared binding site. Generate multiple poses per ligand and record the RMSD between each predicted pose and the original co-crystallized conformation [24] [58].
Virtual Screening Preparation (for EF and ROC): For selected target proteins, create a benchmark dataset by seeding known active compounds (obtained from experimental literature or databases like ChEMBL) into a large collection of decoy molecules (pharmacologically inert but physically similar compounds, available from directories like DUD-E) [58] [59].
Virtual Screening Execution: Dock the entire benchmark dataset (actives + decoys) against the target protein using the docking programs and scoring functions under evaluation. Rank all compounds based on their docking scores [24] [58].
Performance Calculation: For the ranked lists, calculate:
- The RMSD of the top-ranked pose for each re-docked ligand to assess pose prediction accuracy [24].
- The Enrichment Factor at 1% and 2% of the screened database to measure early enrichment [58] [34].
- The ROC curve and its corresponding AUC value to evaluate overall classification performance [24] [58].

The following diagram visualizes this standardized benchmarking workflow.

Data Analysis and Validation

To ensure statistical robustness, performance metrics should be calculated across the entire test set rather than on individual complexes. For pose prediction, report the success rate as the percentage of ligands docked with RMSD < 2.0 Å [24]. For virtual screening, report EF and AUC values as averages across multiple targets or different decoy sets [58]. It is critical to use independent test sets not used in the training of any data-driven scoring functions to avoid over-optimistic results [59]. Furthermore, recent benchmarks highlight the importance of also assessing the physical validity of predicted poses (e.g., using tools like PoseBusters) to check for steric clashes and correct bond geometries, which is an aspect sometimes overlooked when focusing solely on RMSD [17].

Comparative Performance of Docking Tools

Pose Prediction Accuracy (RMSD)

The ability to reproduce experimental binding modes is a fundamental requirement for any docking program. Performance is typically measured by the percentage of ligands in a test set for which the top-ranked pose achieves an RMSD below 2.0 Å relative to the crystal structure.

Table 1: Pose Prediction Performance (RMSD < 2.0 Å) Across Different Docking Programs

Docking Program	Scoring Function	Success Rate (%)	Test Set / Context
Glide	GlideScore	100% [24]	COX-1/COX-2 inhibitors
GOLD	GoldScore	82% [24]	COX-1/COX-2 inhibitors
AutoDock	AutoDock Score	71% [24]	COX-1/COX-2 inhibitors
FlexX	F-Score	59% [24]	COX-1/COX-2 inhibitors
Surflex	Surflex-Score	Statistically equivalent to Glide [58]	Dihydroperoate Synthase (DHPS)
Deep Learning (SurfDock)	N/A	91.8% (Astex Set) [17]	Diverse Test Set

Key Findings:

Glide demonstrated exemplary performance in a benchmark study on cyclooxygenase (COX) enzymes, correctly predicting the binding poses of all studied co-crystallized ligands [24].
Traditional physics-based methods like Glide and GOLD consistently show high pose prediction accuracy, with GOLD achieving an 82% success rate in the same COX study [24].
Deep learning-based methods have emerged as strong competitors. The generative diffusion model SurfDock achieved a 91.8% success rate on the Astex diverse set, showcasing superior pose accuracy [17]. However, it is crucial to note that a significant portion of its predicted poses lacked physical validity (e.g., containing steric clashes), highlighting a key limitation of some nascent AI methods [17].

Virtual Screening Performance (EF and ROC)

For virtual screening, the critical metrics are the Enrichment Factor (EF), which measures early recognition of actives, and the Area Under the ROC Curve (AUC), which measures overall classification performance.

Table 2: Virtual Screening Performance Across Docking and Scoring Combinations

Docking Program	Scoring Function	AUC	Enrichment Factor	Test Context
Glide	GlideScore	0.92 [24]	40-fold [24]	COX Enzymes
GOLD	GoldScore	0.83 [24]	31-fold [24]	COX Enzymes
AutoDock	AutoDock Score	0.81 [24]	25-fold [24]	COX Enzymes
FlexX	F-Score	0.61 [24]	8-fold [24]	COX Enzymes
Surflex	Surflex-Score	Statistically equivalent to Glide [58]	High	Dihydroperoate Synthase (DHPS)
Fitted	N/A	N/A	Good correlation for Cdk2 [34]	Kinase Inhibitors
MOE	Alpha HB, London dG	High Comparability [5] [1]	N/A	CASF-2013 Benchmark

Key Findings:

Glide and Surflex are consistently top performers. A study on DHPS found these two programs to be the "best overall performers," with neither showing statistical superiority over the other [58].
Performance variability is significant across targets. Scoring functions generally perform better on hydrophilic targets (e.g., Factor Xa, Cdk2 kinase) than on hydrophobic ones (e.g., COX-2) [34].
Target-specific scoring functions, particularly those developed using machine learning on data for a single target, can show significant superiority over general-purpose functions [36]. For example, graph convolutional neural networks have been used to create highly accurate target-specific functions for cGAS and kRAS [36].
Consensus scoring (combining multiple scoring functions) and post-docking relaxation were not found to consistently improve enrichment in several studies [58].

The Scientist's Toolkit: Essential Research Reagents

Successful docking studies rely on a suite of software tools and data resources. The table below details key solutions and their functions in a typical workflow.

Table 3: Key Research Reagent Solutions for Molecular Docking

Tool / Resource	Type	Primary Function	Key Feature / Note
RCSB Protein Data Bank (PDB)	Database	Repository of experimentally determined 3D structures of proteins and nucleic acids.	Primary source of target protein structures for docking [24].
PDBbind	Database	A curated database of protein-ligand complexes with binding affinity data [5] [59].	Used for training and testing scoring functions.
DUD-E / ZINC	Database	Libraries of active compounds and decoy molecules for virtual screening validation [59].	Essential for benchmarking virtual screening performance (EF, ROC) [58].
Glide	Docking Software	Comprehensive docking suite using a hybrid search algorithm and empirical scoring function.	Consistently high performer in pose prediction and virtual screening [24] [58].
GOLD	Docking Software	Docking program using a genetic algorithm for conformational search.	Strong performance, particularly with its native GoldScore [24].
AutoDock Vina	Docking Software	Widely used open-source program using a gradient optimization algorithm.	Good balance of speed and accuracy [34].
Surflex	Docking Software	Docking program using a molecular similarity-based search algorithm.	Top performer in multiple studies, comparable to Glide [58].
MOE	Modeling Suite	Integrated software platform with docking module and multiple scoring functions.	Offers several scoring functions (London dG, Alpha HB); useful for consensus approaches [5].
Deep Learning Tools (e.g., SurfDock, DiffBindFR)	Docking Software	New generation of docking tools using AI for pose generation and scoring.	Show superior pose accuracy but sometimes lack physical validity [17].
PoseBusters	Validation Tool	Toolkit to check the physical plausibility and chemical validity of docking poses.	Critical for validating AI-generated poses that may have good RMSD but bad geometry [17].

The comparative analysis of docking tools using the standard metrics of RMSD, EF, and ROC/AUC reveals a nuanced landscape. Glide, GOLD, and Surflex consistently rank among the top performers for both pose prediction and virtual screening tasks [24] [58]. However, the "best" tool is often context-dependent. Target characteristics, such as binding site hydrophobicity, significantly influence performance [34].

The emergence of deep learning methods presents a promising frontier, offering superior pose prediction accuracy in many cases [17]. However, their current limitations in guaranteeing physically valid structures necessitate careful validation. Furthermore, the development of target-specific scoring functions, powered by machine learning, demonstrates a clear path to enhanced virtual screening accuracy for specific projects [36].

For researchers, the selection of a docking protocol should be guided by their primary objective: high-precision pose prediction for lead optimization or robust virtual screening for lead discovery. This guide provides the foundational metrics and experimental data to inform that critical decision, underscoring the importance of rigorous, target-relevant benchmarking in computational drug discovery.

Comparative Assessment of Classical Scoring Functions (Glide, GOLD, MOE, etc.)

Molecular docking is a cornerstone of structure-based drug design, enabling the prediction of how a small molecule ligand binds to a protein target and the estimation of the binding affinity through scoring functions (SFs). The performance of these SFs—typically categorized as force-field-based, empirical, or knowledge-based—directly impacts the success of virtual screening and binding pose prediction in lead discovery and optimization campaigns [39] [60]. While classical SFs are embedded in widely used docking software like Glide, GOLD, and MOE, their relative performance remains a subject of intense investigation due to the complex trade-offs between computational efficiency and predictive accuracy. This guide provides an objective, data-driven comparison of classical scoring functions, framing their performance within the broader context of rigorous benchmarking studies and the emergent challenge posed by deep-learning-based methods.

Table 1: Key Performance Metrics from Comparative Docking Studies

Docking Software / Method	Primary Scoring Function	Pose Prediction Success (RMSD ≤ 2 Å)	Virtual Screening Enrichment	Binding Affinity Correlation (w/ Experiment)	Key Strengths / Weaknesses
Glide (XP)	Empirical (Extra Precision)	High (e.g., >94% physical validity [17])	Consistently superior enrichments [61]	Varies by system; competitive	Superior physical plausibility and screening enrichment; computationally more expensive [61] [17]
GOLD	Empirical (GoldScore, ChemScore)	Moderate to High	Outperforms DOCK on average [61]	Varies by system	Robust performance across diverse targets [61]
MOE	London dG, Alpha HB, etc.	N/A (See Table 2)	N/A	N/A	Alpha HB and London dG show high comparability [5] [1]
AutoDock Vina	Empirical	Moderate	Moderate	Moderate	Fast and widely used; balance of speed and accuracy [17]
MM/GBSA	Physics-Based (End-Point)	Good (e.g., 69.4% pose identification [62])	Good (via re-scoring)	Moderate to Good (e.g., Spearman R=0.66 [62])	Higher accuracy than many docking SFs; computationally intensive; requires parameter optimization [62] [63]
MM/PBSA	Physics-Based (End-Point)	Lower (e.g., 45.5% pose identification [62])	Good (via re-scoring)	Lower than MM/GBSA (e.g., Spearman R=0.49 [62])	Theoretically more rigorous but often outperformed by MM/GBSA in practice [62]

Table 2: Pairwise Comparability of MOE Scoring Functions (InterCriteria Analysis) [5] [1]

MOE Scoring Function	London dG	Alpha HB	Affinity dG	ASE	GBVI/WSA dG
London dG	-	High	Low	Low	Low
Alpha HB	High	-	Low	Low	Low
Affinity dG	Low	Low	-	Low	Low
ASE	Low	Low	Low	-	Low
GBVI/WSA dG	Low	Low	Low	Low	-

Detailed Benchmarking Results and Experimental Protocols

Virtual Screening Enrichment: Glide, GOLD, and DOCK

A critical study directly compared the virtual screening performance of Glide, GOLD, and DOCK across 14 pharmaceutically relevant targets [61]. The experimental protocol involved docking large libraries of compounds, each containing known active molecules and decoys, into a rigid receptor structure. Performance was measured by the enrichment factor, which quantifies the method's ability to rank active compounds early in the list.

Key Finding: The Glide XP (Extra Precision) methodology consistently yielded enrichment factors superior to those obtained with GOLD and DOCK across the test set [61].
Protocol Detail: To ensure a fair evaluation of scoring function accuracy, the study separated "fitting" ligands (those sterically capable of docking correctly into the rigid receptor) from "non-fitting" ligands (those requiring receptor flexibility). Enrichment factors were calculated using only the "fitting" compounds [61].
Comparative Performance: On average, GOLD outperformed DOCK, but Glide XP demonstrated the most robust performance in identifying active compounds [61].

Pose Prediction Accuracy and Physical Plausibility

A comprehensive 2025 benchmark evaluated traditional and deep-learning docking methods across multiple datasets, including the Astex diverse set and the more challenging DockGen set containing novel protein pockets [17].

Performance Stratification: The study revealed a clear performance hierarchy for pose prediction when considering both accuracy (RMSD ≤ 2 Å) and physical plausibility (as validated by the PoseBusters toolkit). The traditional method Glide SP consistently excelled in producing physically valid poses, with PB-valid rates above 94% across all datasets, coupled with strong pose accuracy [17].
Key Insight: While some deep-learning generative models achieved higher raw pose accuracy (e.g., SurfDock with >75% success on novel pockets), they often produced physically implausible structures with steric clashes or incorrect bond geometries. This highlights a critical strength of well-validated classical methods like Glide [17].

The Role of Advanced End-Point Methods: MM/GBSA and MM/PBSA

MM/GBSA and MM/PBSA are more rigorous, physics-based methods often used to re-score docking poses from simpler SFs, offering a balance between computational cost and accuracy [62] [63].

Pose Identification: In a systematic evaluation on 98 protein-ligand complexes, MM/GBSA significantly outperformed MM/PBSA and multiple popular docking SFs in identifying correct binding conformations, achieving a success rate of 69.4% compared to 45.5% for MM/PBSA [62].
Affinity Prediction: MM/GBSA also demonstrated a better ability to reproduce experimental binding free energies, with the best model yielding a Spearman correlation coefficient of 0.66, compared to 0.49 for MM/PBSA [62].
Parameter Dependence: The performance of MM/PB(GB)SA is highly case-specific. Benchmarks on soluble and membrane proteins show that parameters like the solute dielectric constant, GB model, and force field must be optimized for each specific system to achieve highest accuracy, which can be competitive with more expensive methods like free energy perturbation (FEP) [63].

Diagram 1: MM/PB(GB)SA Binding Free Energy Calculation Workflow.

Table 3: Key Resources for Docking and Scoring Function Assessment

Resource Name	Type	Primary Function in Assessment	Relevance
CASF Benchmark Sets [5] [1]	Standardized Dataset	Provides a curated set of high-quality protein-ligand complexes with experimental binding affinity data for scoring function testing and validation.	Essential for fair, standardized comparison of different scoring functions.
PDBbind Database [5]	Comprehensive Database	A large collection of protein-ligand complexes with binding affinity data, from which benchmark sets like CASF are derived.	Provides the foundational data for training and testing scoring functions.
PoseBusters Toolkit [17]	Validation Software	Checks the physical plausibility and chemical correctness of docking poses, including bond lengths, steric clashes, and stereochemistry.	Crucial for identifying methods that produce accurate (low RMSD) but physically invalid poses.
gmx_MMPBSA / MMPBSA.py [63]	Computational Tool	Automates the calculation of MM/PBSA and MM/GBSA binding free energies from molecular dynamics trajectories.	Key software for implementing advanced, physics-based re-scoring.
Astex Diverse Set [17]	Standardized Dataset	A well-known set of protein-ligand complexes often used as a primary benchmark for evaluating docking pose prediction accuracy.	A standard first test for any new docking method or scoring function.

The comparative assessment of classical scoring functions reveals a landscape defined by trade-offs. For robust virtual screening enrichment and physically plausible pose prediction, empirical functions like Glide XP and GOLD's scoring functions demonstrate strong, validated performance. For specific pose prediction tasks, the empirical functions within MOE, particularly Alpha HB and London dG, show high concordance. When higher accuracy in binding affinity prediction is required and computational resources permit, MM/GBSA provides a powerful re-scoring alternative that often outperforms standard docking SFs, though its results are parameter-dependent. Despite the rise of deep learning, classical methods, especially those with a strong physical basis or rigorous empirical parametrization, remain competitive, particularly in their reliability and ability to produce chemically valid results. The optimal choice of scoring function ultimately depends on the specific application—whether it is high-throughput virtual screening, precise pose identification, or binding affinity estimation—and the resources available to the researcher.

In the realm of structure-based drug design, molecular docking stands as a pivotal computational technique for predicting how small molecules interact with protein targets. The core of any docking protocol lies in its scoring function—the algorithm that approximates the binding affinity of a ligand for its target. For decades, classical scoring functions, rooted in empirical, force-field, or knowledge-based approaches, have been the standard tools for this task. However, their performance often plateaus due to inherent simplifications and an inability to fully capture the complex physicochemical nature of molecular recognition [39] [3].

The advent of machine learning (ML) has ushered in a new paradigm. ML scoring functions learn the relationship between protein-ligand complex structures and binding affinities directly from large datasets, without relying on pre-defined formulas or parameters. This article provides a comparative performance analysis of two pioneering ML scoring functions—RF-Score-VS and CNN-Score—against established classical methods, offering objective experimental data and methodologies to guide researchers in selecting optimal tools for their virtual screening campaigns.

Methodology & Experimental Protocols in Scoring Function Evaluation

To ensure a fair and rigorous comparison, benchmarking studies follow standardized protocols involving curated datasets and performance metrics. Understanding these methodologies is crucial for interpreting the results.

Benchmark Datasets and Preparation

DEKOIS 2.0: This benchmark set is specifically designed for virtual screening evaluation. It contains known bioactive molecules for a target protein and a set of challenging "decoys"—structurally similar but purportedly inactive molecules. The performance of a scoring function is measured by its ability to enrich the bioactive molecules over the decoys during screening [18].
DUD-E (Directory of Useful Decoys: Enhanced): A widely used benchmark that includes 102 protein targets, each with a set of confirmed active molecules and property-matched decoys. Its size and diversity make it ideal for robust testing, particularly for machine learning models that require extensive training data [16].
CASF (Comparative Assessment of Scoring Functions): A popular benchmark derived from the PDBbind database. It is often used to assess scoring functions on their ability to predict binding poses (pose prediction) and binding affinities (scoring power) [5] [1].

Protein and ligand structures undergo meticulous preparation before docking, which includes removing water molecules, adding hydrogen atoms, optimizing protonation states, and generating multiple conformations for ligands [18] [64].

Performance Metrics

Enrichment Factor (EF): This metric, particularly EF1%, quantifies the ratio of true active molecules found within the top 1% of the ranked database compared to a random selection. A higher EF indicates better early enrichment, which is critical for efficient virtual screening [18].
Hit Rate: The percentage of true binders identified within a specified top fraction of the screened library [16].
Receiver Operating Characteristic (ROC) Analysis: The Area Under the Curve (AUC) of an ROC plot measures the overall ability of a method to distinguish between active and inactive compounds. An AUC of 0.5 represents a random classifier, while an AUC of 1.0 represents a perfect classifier [64].
Root Mean Square Deviation (RMSD): Used primarily for pose prediction accuracy, it measures the deviation (in Ångströms) between a docked pose and the experimentally determined native pose [5].

Quantitative Performance Comparison: ML vs. Classical Scoring

The following tables summarize key experimental results from rigorous retrospective virtual screening studies, highlighting the performance gap between ML and classical functions.

Table 1: Virtual Screening Performance on the DUD-E Benchmark

Scoring Function	Type	Hit Rate (Top 1%)	Hit Rate (Top 0.1%)	Citation
RF-Score-VS	Machine Learning (Random Forest)	55.6%	88.6%	[16]
AutoDock Vina	Classical (Empirical)	16.2%	27.5%	[16]
CNN-Score	Machine Learning (Convolutional Neural Network)	Outperformed Vina in pose ranking and virtual screening	N/A	[65]

Table 2: Performance on Malaria Drug Target (PfDHFR) Using DEKOIS 2.0

Target	Docking Tool	Scoring Function	EF1%	Citation
Wild-Type PfDHFR	PLANTS	CNN-Score (Re-scoring)	28	[18]
Quadruple-Mutant PfDHFR	FRED	CNN-Score (Re-scoring)	31	[18]
Wild-Type PfDHFR	AutoDock Vina	Vina (Classical)	Worse-than-random	[18]

The data consistently demonstrates a significant performance leap achieved by ML scoring functions. RF-Score-VS provides a more than threefold improvement in hit rate at the critical top 1% of the screened library compared to the classical AutoDock Vina [16]. Similarly, re-scoring docking outputs with CNN-Score dramatically enhanced enrichment, transforming worse-than-random performance into highly effective screening pipelines, even for the challenging drug-resistant quadruple-mutant variant of PfDHFR [18].

Inside the Black Box: Architectures of Leading ML Scoring Functions

RF-Score-VS: Harnessing Feature-Based Learning

RF-Score-VS is built on a Random Forest algorithm—an ensemble of decision trees. Its input consists of a vector of features describing the protein-ligand complex. The model was trained on a massive dataset comprising 15,426 active and 893,897 inactive molecules docked to 102 targets from the DUD-E database. A key aspect of its training was the use of different cross-validation strategies, including a "vertical split" where the model is tested on protein targets not seen during training. This approach validates its potential for application to novel targets with no known ligands [16].

CNN-Score: Learning from 3D Structural Images

CNN-Score employs a Convolutional Neural Network architecture, which is exceptionally powerful for processing spatial data. It treats a protein-ligand complex as a 3D image. The interaction space is discretized into a 3D grid, where each grid point holds information about the types of protein and ligand atoms present (represented as different "channels," analogous to RGB channels in a 2D image). The network then hierarchically learns the key features of protein-ligand interactions that correlate with binding, from basic atomic properties to more complex interaction patterns, without any human pre-definition of what those features should be [65].

The Re-scoring Workflow: A Practical Hybrid Strategy

A common and highly effective strategy in modern virtual screening is the re-scoring workflow. This hybrid approach leverages the strengths of both classical and ML methods.

This workflow diagram illustrates the standard re-scoring protocol: initial docking and pose generation are performed using classical tools, which are computationally efficient. The generated poses are then re-evaluated using a more accurate ML scoring function, which refines the ranking to better prioritize true active compounds [18].

Essential Research Reagents & Computational Tools

Table 3: Key Software and Resources for Docking and Scoring Evaluation

Resource Name	Type	Primary Function & Application	Citation
AutoDock Vina	Docking Software	Widely used classical docking tool with an empirical scoring function; often used as a baseline or for initial pose generation.	[18] [64]
FRED & PLANTS	Docking Software	Other generic docking tools evaluated alongside Vina in benchmarking studies.	[18]
GNINA	Docking Software / ML Scorer	A docking program that incorporates a CNN-based scoring function, demonstrating superior performance in some evaluations.	[64] [66]
DEKOIS 2.0	Benchmark Dataset	A public benchmark set used to evaluate virtual screening performance with known actives and challenging decoys.	[18] [16]
DUD-E	Benchmark Dataset	A large and diverse benchmark dataset with 102 targets, essential for training and testing ML scoring functions.	[16]
PDBbind / CASF	Benchmark Dataset	A comprehensive database of protein-ligand complexes with binding affinity data, used for scoring power assessment.	[5] [66]
RF-Score-VS Binary	ML Scoring Function	A ready-to-use implementation of the RF-Score-VS for virtual screening.	[16]

Discussion and Future Perspectives

The experimental evidence leaves little doubt that ML scoring functions like RF-Score-VS and CNN-Score represent a significant performance leap over classical counterparts in structure-based virtual screening. Their ability to learn complex, non-linear relationships from vast structural data allows for superior enrichment and hit identification [18] [16].

The field continues to evolve rapidly. New architectures like Graph Neural Networks (GNNs) and Pose Ensemble Models are being actively developed. For instance, DockBox2 (DBX2), a graph neural network framework that encodes ensembles of docking poses, has shown further improved performance in retrospective experiments, highlighting the potential of learning from multiple conformations rather than single poses [66]. Furthermore, research into target-specific scoring functions built with graph convolutional networks has shown remarkable robustness and accuracy for specific proteins like cGAS and kRAS, indicating a move towards more specialized, high-accuracy models [36].

In conclusion, for researchers and drug development professionals, integrating ML scoring functions—particularly via the practical re-scoring workflow—into their virtual screening pipelines is no longer a speculative future step but a present-day best practice to substantially increase the odds of success in lead discovery.

Molecular docking, a cornerstone of structure-based drug discovery, is undergoing a revolutionary transformation with the integration of artificial intelligence (AI). Traditional docking methods rely on physical models that use search strategies to explore conformational space and predefined scoring functions to determine ligand binding modes and strengths [53]. These physics-based tools, while strong on structural rationality, often face limitations in computational efficiency and scoring accuracy due to simplified representations of protein-ligand interactions [53] [50]. The emergence of AI-powered docking tools promises to address these limitations by leveraging machine learning to improve both the speed and accuracy of binding pose predictions, yet questions remain about their physical soundness and real-world applicability in virtual screening (VS) projects [67] [68].

This comparative analysis examines the critical balance between physical plausibility and virtual screening efficacy in next-generation AI docking tools, framed within the broader context of ongoing research on docking scoring functions. As noted in a comprehensive benchmark study by Hou, Kang, and colleagues from Zhejiang University, "AI-driven docking tools have emerged, showing great potential in docking accuracy and speed. However, these tools tend to neglect the physical rationality of the conformations of docked complexes and their actual performance in virtual screening of drugs" [53]. This tension between accuracy and rationality forms the central theme of our investigation, with important implications for researchers, scientists, and drug development professionals seeking to implement these tools in their workflows.

Methodological Frameworks for Benchmarking Docking Tools

Established Benchmarking Datasets and Protocols

Robust benchmarking of docking tools requires carefully curated datasets that represent diverse protein families, ligand chemotypes, and binding affinities [5]. The VSDS-vd benchmark developed by Zhejiang University researchers provides a comprehensive virtual screening test set that enables direct comparison between AI-powered and physics-based approaches [53]. Similarly, the CASF-2013 benchmark subset of the PDBbind database offers a collection of 195 high-quality protein-ligand complexes with binding affinity data that has been widely used for evaluating scoring functions [5]. The TrueDecoy and RandomDecoy sets represent additional important resources, with the latter designed to more closely mimic real-world virtual screening scenarios where prior knowledge of binding sites may be limited [67] [69].

Standardized evaluation metrics are essential for meaningful comparisons between docking tools. The most fundamental metric is the root mean square deviation (RMSD) between predicted poses and experimentally determined co-crystallized ligand structures, typically with success defined as RMSD values below 2Å [68]. However, as critical research has revealed, RMSD alone provides an incomplete picture of docking performance [68]. The PoseBusters validation framework introduced stricter physical plausibility checks that assess chemical correctness, identifying issues like atom clashes, impossible bond lengths, and distorted angles that may be overlooked by traditional RMSD measurements [68]. For virtual screening applications, enrichment factors (EF) and ROC curves quantify the ability of docking tools to prioritize active compounds over decoys in large compound libraries [70].

Experimental Workflow for Comparative Assessment

The following diagram illustrates the standardized experimental workflow for benchmarking docking tools, incorporating both pose prediction accuracy and virtual screening performance assessments:

Diagram 1: Standardized workflow for benchmarking docking tools.

Research Reagent Solutions: Essential Tools for Docking Benchmarking

Table 1: Key research reagents and computational tools for docking benchmarking

Resource Category	Specific Tools/Databases	Primary Function	Relevance to Benchmarking
Benchmark Datasets	VSDS-vd [53], CASF-2013 [5], TrueDecoy/RandomDecoy [67] [69]	Standardized test sets	Provide controlled environments for fair tool comparison
Validation Frameworks	PoseBusters [68], ICrA [5]	Quality assessment	Evaluate physical plausibility and statistical reliability
Traditional Docking Tools	AutoDock Vina [70], Glide [67], MOE [5]	Physics-based docking	Establish baseline performance metrics
AI-Powered Docking Tools	DiffDock [68], CarsiDock [53], KarmaDock [67], HelixDock [68]	Machine learning-based docking	Next-generation tools under evaluation
Scoring Functions	London dG, Alpha HB [5], RTMScore [67]	Binding affinity prediction	Critical for virtual screening performance

Comparative Performance Analysis: AI vs. Traditional Tools

Pose Prediction Accuracy and Physical Plausibility

The fundamental requirement for any docking tool is the accurate prediction of ligand binding modes. Recent benchmarking studies reveal a complex landscape where AI-powered tools demonstrate remarkable performance in traditional RMSD-based metrics but show significant variability in physical plausibility.

Table 2: Pose prediction performance across docking tools

Docking Tool	Type	Success Rate (RMSD ≤ 2Å)	PoseBusters Valid (%)	Notable Strengths
HelixDock [68]	AI-powered	90.1% (PDBbind Core Set)	85.6%	Extensive pre-training, high accuracy
KarmaDock [67]	AI-powered	89.1% (PDBbind Core Set)	Not specified	Excellent balance of speed and accuracy
CarsiDock [68]	AI-powered	Not specified	79.7%	Strong structural reliability
DiffDock [68]	AI-powered	44.2% (blind docking)	47.0%	Diffusion-based approach
Physics-based Tools [67]	Traditional	Variable but generally high	~90-95%	Superior physical plausibility
Glide [67]	Traditional	Not specified	Not specified	High enrichment factors

The benchmarking conducted by Hou and Kang's team revealed that while AI-powered tools like KarmaDock and CarsiDock surpassed all physics-based tools in docking accuracy as measured by RMSD, "all physics-based tools notably outperformed AI-based methods in structural rationality" [67]. This physical plausibility gap represents a significant challenge for AI methods, particularly in applications where chemically realistic models are essential for downstream optimization.

The PoseBusters validation framework highlighted this issue starkly, demonstrating that when predictions were filtered for physical plausibility, the performance of many deep learning methods dropped dramatically. For example, DiffDock's success rate on the Astex Diverse set fell from 72% (by RMSD alone) to just 47% when PB-valid criteria were applied. In contrast, classical tools like AutoDock Vina maintained nearly all their accuracy under the same physical plausibility checks [68].

Virtual Screening Performance and Enrichment Capability

Beyond pose prediction, virtual screening performance represents the ultimate test for docking tools in real-world drug discovery applications. The ability to identify active compounds from large libraries of decoys is where AI-powered tools show particular promise.

Table 3: Virtual screening performance comparison

Docking Approach	Screening Context	Performance	Limitations
Glide-based Methods [67]	TrueDecoy set	Highest enrichment factors	Computationally intensive
AI-powered Tools [67]	RandomDecoy set (real-world scenarios)	Superior to traditional tools	Performance varies by dataset
GNINA [70]	Multiple target classes	Outstanding VS performance and pose replication	Specific to CNN scoring
RTMScore [67]	Rescoring function	High effectiveness in enrichment	Dependent on initial poses
Traditional Tools [68]	Known binding sites	Strong performance (e.g., Surflex-Dock: 68% success)	Require proper configuration

The benchmarking results reveal an important context-dependent performance pattern. On the TrueDecoy set, Glide-based methods achieved the highest enrichment factors among all docking tools. However, on the RandomDecoy set, which more closely resembles real-world virtual screening scenarios, AI-based tools obviously outperformed Glide [67]. This suggests that AI methods may possess superior generalizability to novel targets and binding sites.

GNINA, which integrates convolutional neural networks (CNNs) for pose scoring, demonstrates the potential of hybrid approaches. Systematic comparison showed that GNINA "outperformed AutoDock Vina in both virtual screening of active ligands and re-docking steps of co-crystallized ligands," with its enhanced ability to accurately distinguish between true positives and false positives confirmed by ROC curves and enrichment factor results [70].

The Training Data Dependency and Generalization Challenge

A critical vulnerability identified in early AI docking tools is their dependence on training data similarity. Research by Jain, Cleves & Walters found that approximately two-thirds of test cases used to evaluate DiffDock had nearly identical protein-ligand examples within its training set. On these "near-neighbor" cases, DiffDock performed reasonably well, but when faced with truly novel proteins, its success rate dropped dramatically from over 50% to just 21-28% [68]. This suggests that instead of learning fundamental principles of molecular interaction, the model had often simply memorized training examples.

Newer AI tools address this limitation through more extensive and diverse pre-training. HelixDock, for instance, utilized 100 million computer-generated docking conformations requiring approximately 1 million CPU core days of computation, followed by fine-tuning with experimentally validated complexes [68]. This massive scale of pre-training appears to enhance generalizability, though the computational resources required are substantial.

Practical Implementation: Hierarchical Virtual Screening Strategy

Based on the benchmarking results, researchers from Zhejiang University proposed a hierarchical virtual screening strategy that achieves a dynamic balance between screening speed and accuracy [53] [67]. This approach leverages the complementary strengths of AI-powered and physics-based docking tools.

The following diagram illustrates this efficient hierarchical workflow for large-scale virtual screening:

Diagram 2: Hierarchical virtual screening workflow.

The hierarchical strategy begins with ultra-fast AI-powered screening of entire compound libraries, leveraging the speed advantages of tools like KarmaDock and DiffDock. This initial phase rapidly filters the library to a manageable subset of potential hits (typically 10-20% of the original library). These candidates then undergo standard accuracy verification using more computationally intensive but reliable physics-based tools like Glide or AutoDock Vina. The most promising compounds from this second phase advance to high-precision refinement, which may involve consensus docking with multiple tools or advanced molecular dynamics simulations. Finally, the top-ranked candidates proceed to experimental validation [67].

This tiered approach achieves an optimal balance by leveraging AI speed for initial filtering while maintaining the physical soundness of traditional methods for critical decision-making stages. The implementation of this strategy has demonstrated excellent performance in actual large-scale drug screening projects [53].

The comprehensive benchmarking of AI-powered docking tools reveals a field in transition, where AI methods demonstrate superior speed and impressive performance in virtual screening tasks but still trail traditional physics-based tools in generating physically plausible structures. The comparative analysis suggests that the most effective path forward lies not in exclusive adoption of either approach, but in strategic integration that leverages their complementary strengths.

AI-powered tools excel in rapid screening scenarios and situations requiring generalization across diverse protein families, while physics-based methods provide critical reliability for final candidate selection and optimization. The hierarchical virtual screening strategy exemplifies this integrative approach, offering a practical framework for drug discovery professionals to harness the speed of AI without sacrificing the physical soundness essential for successful drug development.

As AI methodologies continue to evolve with improved architectural designs and more diverse training data, the physical plausibility gap is likely to narrow. However, the current benchmarking results indicate that hybrid approaches, which combine the predictive power of machine learning with the physicochemical rigor of traditional methods, represent the most promising direction for the future of molecular docking in drug discovery.

Rescoring with ML Functions Consistently Augments Screening Performance

Molecular docking is a cornerstone of computational drug discovery, enabling researchers to predict how small molecules interact with protein targets. This process typically involves two main stages: sampling, which generates numerous potential binding poses, and scoring, which ranks these poses to identify the most biologically relevant complexes [23]. The accuracy of scoring functions directly impacts the success of structure-based virtual screening (VS), a method used to identify potential drug candidates from vast chemical libraries.

Classical scoring functions, which include physics-based, empirical, and knowledge-based approaches, have historically been used to predict binding affinity. However, these methods often rely on simplified physical models or linear regression, limiting their accuracy and ability to generalize across diverse protein families [23] [71]. The introduction of machine learning (ML) has transformed this landscape. ML-based scoring functions can learn complex, non-linear relationships from large datasets of protein-ligand complexes, leading to significant improvements in both binding pose prediction and virtual screening efficacy [17] [16].

This guide objectively compares the performance of classical and ML-based scoring functions, with a specific focus on rescoring—the process of re-evaluating docking outputs with a more sophisticated scoring function. We present experimental data demonstrating that rescoring with ML functions consistently enhances virtual screening performance across diverse targets and benchmarks.

Performance Comparison: ML-Based vs. Classical Scoring Functions

Multiple independent studies have systematically evaluated the performance of ML-based and classical scoring functions in virtual screening scenarios. The tables below summarize key quantitative findings.

Table 1: Virtual Screening Performance on the DUD-E Benchmark

Scoring Function	Type	Top 1% Hit Rate	Top 0.1% Hit Rate	Reference
RF-Score-VS	Machine Learning	55.6%	88.6%	[16]
AutoDock Vina	Classical (Empirical)	16.2%	27.5%	[16]

Table 2: Pose Prediction Accuracy and Physical Validity Across Datasets

Method Category	Representative Example	Pose Accuracy (RMSD ≤ 2Å)	PB-Valid Rate	Combined Success Rate
Generative Diffusion	SurfDock	75.7% - 91.8%	40.2% - 63.5%	33.3% - 61.2%
Traditional	Glide SP	Information missing	> 94%	Information missing
Regression-Based ML	KarmaDock, QuickBind	Lower performance tier	Lower performance tier	Lower performance tier

The data consistently shows that ML-based functions, particularly those trained for virtual screening like RF-Score-VS, dramatically outperform classical functions in identifying active compounds. RF-Score-VS more than triples the hit rate in the top 1% of screened compounds compared to Vina [16]. Furthermore, while some complex ML methods like generative diffusion models achieve high pose accuracy, their physical validity can be variable, whereas traditional methods like Glide SP maintain high physical plausibility [17].

Experimental Evidence: Key Studies and Methodologies

Large-Scale Benchmarking of RF-Score-VS

A seminal study trained and evaluated the RF-Score-VS function on a massive scale using the DUD-E (Directory of Useful Decoys: Enhanced) benchmark. This benchmark includes 102 protein targets, with an average of 224 active molecules and 50 property-matched decoys per target [16].

Experimental Protocol: The researchers used three docking tools (AutoDock Vina, Smina, and Dock) to generate poses for all actives and decoys. They trained RF-Score-VS using a horizontal split cross-validation, where the training and test sets contained data from all targets, simulating a scenario where known ligands exist for the targets being screened. This approach ensures the model is evaluated on its ability to generalize to new ligands for familiar targets, a common real-world application [16].
Key Finding: The superior performance of RF-Score-VS highlights the ability of ML models to learn complex patterns from large datasets that are inaccessible to classical linear regression models [16].

Rescoring in Practice: The PADIF Fingerprint Approach

Another study demonstrated that ML models using the Protein per Atom Score Contributions Derived Interaction Fingerprint (PADIF) significantly enhance binder selection in molecular docking over classical scoring functions [72].

Experimental Protocol: Researchers trained ML models (e.g., Random Forest) on PADIF representations of protein-ligand complexes. The training data combined active molecules from ChEMBL with strategically selected decoys from the ZINC15 database or recurrent non-binders from high-throughput screening (HTS) assays, known as dark chemical matter [72].
Key Finding: Models trained with these decoy sets closely mimicked the performance of models trained with confirmed non-binders. More importantly, all PADIF-based models showed an enhanced ability to explore new chemical spaces and improved the selection of top active compounds compared to classical scoring functions, thereby boosting screening power [72].

Addressing Generalization and Data Bias

A critical challenge for ML scoring functions is generalization to novel targets or binding pockets not represented in the training data. A 2025 analysis revealed that data leakage between popular training sets (e.g., PDBbind) and benchmark sets (e.g., CASF) has led to an overestimation of the generalization capabilities of many deep-learning models [73].

Experimental Insight: When top-performing models were retrained on a rigorously filtered dataset (PDBbind CleanSplit) to eliminate this leakage, their benchmark performance dropped substantially. This confirms that previous high performance was partly driven by data memorization [73]. However, a well-designed graph neural network model (GEMS) maintained high performance when trained on this clean data, demonstrating that robust generalization is achievable with appropriate model architecture and unbiased data [73].

The Rescoring Workflow: A Practical Guide

The process of applying an ML rescoring function to enhance virtual screening results can be visualized in the following workflow. This diagram outlines the key stages from initial docking to the final selection of hit compounds.

Diagram 1: The ML Rescoring Workflow for Virtual Screening. This process takes initial docking outputs and applies a machine learning model to generate a more reliable ranking of compounds for experimental testing.

Implementing a successful ML rescoring strategy requires leveraging specific computational tools and datasets. The table below details key resources mentioned in the cited research.

Table 3: Key Research Reagents and Computational Tools

Resource Name	Type	Primary Function in Research	Reference
DUD-E Benchmark	Dataset	Provides a standardized set of targets with known actives and property-matched decoys for training and evaluating scoring functions.	[16]
PDBbind Database	Dataset	A comprehensive collection of protein-ligand complex structures with experimental binding affinity data, used for training ML models.	[73] [52]
RF-Score-VS	Software/Model	A ready-to-use random forest scoring function specifically optimized for virtual screening performance.	[16]
PADIF Fingerprint	Method/Descriptor	An interaction fingerprint that classifies atom types and assigns numerical values to interactions, capturing nuanced binding interface details for ML models.	[72]
Dark Chemical Matter	Dataset	Collections of compounds that show no activity in numerous HTS assays, providing a source of confirmed non-binders for model training.	[72]
CCharPPI Server	Web Server	Allows for the assessment of scoring functions independent of the docking process, enabling direct comparison.	[23]

The accumulated experimental evidence firmly supports the thesis that rescoring with machine learning functions consistently augments screening performance. ML-based scoring functions like RF-Score-VS and models using PADIF fingerprints have demonstrated a superior ability to distinguish active compounds from decoys, significantly increasing hit rates in virtual screening campaigns [72] [16].

Future progress in this field hinges on addressing key challenges such as model generalization and data bias. The development of rigorously curated datasets like PDBbind CleanSplit is a crucial step toward accurately assessing model performance [73]. Furthermore, innovative strategies like data augmentation—using template-based modeling and docking to generate additional training complexes—are showing promise in closing the performance gap between fast ML scoring functions and computationally intensive physics-based methods like free energy perturbation [52]. As these trends continue, ML rescoring is poised to become an even more indispensable tool in computational drug discovery.

Conclusion

The comparative landscape of docking scoring functions reveals a field in rapid evolution, driven by the integration of artificial intelligence. While classical functions remain useful, machine learning and deep learning scoring functions demonstrate a clear and substantial performance advantage in virtual screening tasks and binding affinity prediction, often achieving hit rates several times higher than traditional methods. The most effective modern strategies involve using classical docking tools for initial pose sampling, followed by ML-based rescoring to significantly improve enrichment. Furthermore, novel AI-powered docking methods show great potential, though their physical soundness requires careful validation. For researchers, this underscores the importance of a tiered screening strategy that balances speed and accuracy. Future directions point toward more specialized, target-aware scoring functions, the seamless integration of AI throughout the docking pipeline, and the use of advanced benchmarks to combat overfitting, ultimately accelerating the discovery of novel therapeutics for complex diseases.