Propensity Score Methods in Pharmacoepidemiology: A Modern Guide to Confounding Control and Causal Inference

Ethan Sanders Dec 02, 2025 371

This article provides a comprehensive overview of the application of propensity score (PS) methods in pharmacoepidemiology, targeting researchers and drug development professionals.

Propensity Score Methods in Pharmacoepidemiology: A Modern Guide to Confounding Control and Causal Inference

Abstract

This article provides a comprehensive overview of the application of propensity score (PS) methods in pharmacoepidemiology, targeting researchers and drug development professionals. It covers foundational principles, from defining causal estimands within the potential outcomes framework to implementing key designs like the new-user active comparator design to mitigate biases. The scope extends to practical guidance on methodological execution, including matching, weighting, and the use of high-dimensional propensity scores (hdPS) with machine learning for covariate selection. It also addresses troubleshooting for common challenges such as the 'PSM paradox,' model dependence, and unmeasured confounding, while validating methods through balance assessment and alignment with emerging frameworks like ICH E9(R1) and target trial emulation. The synthesis aims to equip researchers with the knowledge to produce more valid and reliable real-world evidence on drug safety and effectiveness.

Laying the Groundwork: Core Principles and Causal Frameworks for Propensity Scores

The Role of Propensity Scores in Addressing Confounding by Indication

Confounding by indication represents a fundamental methodological challenge in pharmacoepidemiology and comparative effectiveness research. This form of confounding arises when the clinical indications for prescribing a particular medication are themselves associated with the study outcome, creating a spurious association between treatment and outcome that does not reflect a causal relationship [1]. In clinical practice, healthcare professionals appropriately prescribe treatments based on patients' prognostic factors, channeling specific medications toward patients with particular characteristics or disease severities [1]. This channeling phenomenon, while clinically appropriate, creates substantial imbalances in baseline prognosis between treated and untreated groups in observational studies, potentially leading to severely biased treatment effect estimates if not adequately addressed.

Propensity score (PS) methods were specifically developed to address such confounding in observational studies by modeling how prognostic factors influence treatment decisions [1]. The propensity score, defined as the conditional probability of receiving treatment given observed covariates, provides a powerful tool for creating balanced comparison groups that mimic the balance achieved in randomized controlled trials [2]. By balancing measured baseline characteristics across treatment groups, propensity score methods help isolate the true effect of the treatment from the confounding influence of the treatment indications [3]. The application of propensity scores has become increasingly sophisticated, with recent advances including machine learning integration, high-dimensional propensity scores, and extensions for complex treatment regimens [4] [5].

Theoretical Foundations of Propensity Score Methods

The Propensity Score Framework

The theoretical foundation of propensity score analysis rests on the potential outcomes framework for causal inference [1]. For each patient, we consider potential outcomes under treatment (Y(1)) and control (Y(0)) conditions. The propensity score, defined as e(X) = P(Z=1|X), where Z indicates treatment assignment and X represents observed covariates, possesses the key property of being a balancing score [1]. This means that conditional on the propensity score, the distribution of observed baseline covariates is independent of treatment assignment: X âŠ¥ Z | e(X).

For valid causal inference using propensity scores, two critical assumptions must be satisfied. The first is strong ignorability, which requires that all common causes of treatment and outcome are measured and included in the propensity score model [2]. The second is the positivity assumption, which stipulates that every patient has a non-zero probability of receiving either treatment: 0 < P(Z=1|X) < 1 for all X [2]. When these assumptions hold, the propensity score can be used to remove confounding by indicated factors, enabling unbiased estimation of average treatment effects.

Evolution of Propensity Score Applications

Propensity score methods have evolved significantly since their introduction by Rosenbaum and Rubin in 1983 [4]. Initially applied primarily in settings with limited numbers of predefined confounders, these methods have expanded to address the challenges of high-dimensional healthcare databases, which may contain hundreds of potential covariates [1]. This evolution has included the development of automated variable selection algorithms, extensions for time-varying treatments, and incorporation of machine learning techniques for model specification [5].

Table 1: Key Developments in Propensity Score Methodology

Development	Description	Application Context
High-Dimensional Propensity Score (hdPS)	Automated algorithm to select and prioritize covariates from large healthcare databases	Claims data analysis with numerous potential confounders [1]
Generalized Propensity Scores	Extension to categorical and continuous treatments	Comparative effectiveness of multiple treatments [1]
Machine Learning Integration	Use of ensemble methods, autoencoders, and other ML approaches for PS estimation	High-dimensional data with complex nonlinear relationships [4]
Target Trial Emulation	Framework for designing observational studies that mimic randomized trials	Addressing time-related biases and confounding [6]

Practical Implementation of Propensity Score Methods

Propensity Score Estimation Protocols

The first step in implementing propensity score methods involves building an appropriate model for treatment assignment. Traditional approaches typically use logistic regression with investigator-specified covariates based on clinical knowledge and literature review [1]. However, in high-dimensional settings such as healthcare claims databases, several advanced approaches have been developed:

High-Dimensional Propensity Score (hdPS) Algorithm: This automated approach identifies and prioritizes covariates from large healthcare databases based on their potential for confounding adjustment [1]. The hdPS algorithm empirically identifies data dimensions (e.g., medication codes, diagnosis codes, procedure codes) and selects covariates based on their prevalence and potential for bias reduction.
Dimensionality Reduction Techniques: Recent methodological advances include the application of principal component analysis (PCA), logistic PCA, and autoencoders for propensity score estimation [4]. In a 2025 cohort study comparing dialysis exposure in heart failure patients, autoencoder-based propensity scores achieved superior covariate balance compared to traditional methods, with only 8 covariates showing standardized mean differences >0.1 versus 20-83 covariates with other methods [4].
Machine Learning Approaches: Ensemble methods, random forests, and regularized regression can accommodate complex nonlinear relationships and interactions without overfitting [7]. These approaches are particularly valuable when the true treatment assignment mechanism is complex or unknown.

Table 2: Comparison of Propensity Score Estimation Methods in a 2025 Cohort Study [4]

Propensity Score Method	Number of Covariates with SMD>0.1	Relative Performance
Autoencoder-based PS	8	Best balance
Principal Component Analysis (PCA)	20	Good balance
Logistic PCA	25	Moderate balance
High-Dimensional Propensity Score (hdPS)	37	Limited balance
Investigator-specified Covariates	83	Poorest balance

Propensity Score Application Protocols

After estimating propensity scores, researchers must select an appropriate method for incorporating these scores into the analysis. The four primary approaches are:

Propensity Score Matching: This method creates matched sets of treated and untreated subjects with similar propensity scores [7]. The most common implementation is 1:1 nearest-neighbor matching without replacement, often with a caliper width (typically 0.2 of the standard deviation of the logit of the propensity score) to prevent poor matches [7]. After matching, balance should be assessed using standardized mean differences (target <0.1) and variance ratios.
Propensity Score Weighting: Inverse probability of treatment weighting (IPTW) creates a pseudo-population in which treatment assignment is independent of measured covariates [2]. Weights are defined as w = Z/e(X) + (1-Z)/(1-e(X)). Alternative weighting schemes include matching weights and overlap weights, which may improve precision and balance in regions of poor propensity score overlap [1].
Propensity Score Stratification: Subjects are stratified into quantiles (typically 5-10 strata) based on their propensity scores, and treatment effects are estimated within each stratum before pooling [2]. This approach works well when the relationship between propensity score and outcome is approximately constant within strata.
Covariate Adjustment: The propensity score is included directly as a covariate in the outcome regression model [7]. While straightforward to implement, this approach requires correct specification of both the propensity score model and the outcome model.

Figure 1: Propensity Score Analysis Workflow. This diagram illustrates the sequential process for implementing propensity score methods, highlighting the iterative balance assessment stage.

Advanced Applications and Case Studies

Addressing Complex Biases in Pharmacoepidemiology

Recent applications of propensity score methods have demonstrated their utility in addressing multiple methodological challenges simultaneously. A 2025 study in multiple sclerosis research implemented a high-dimensional propensity score analysis within a nested case-control framework to simultaneously address immortal time bias and residual confounding [6]. This innovative approach combined the design-based control of immortal time bias through the nested case-control design with the confounding control of hdPS, demonstrating a 28% reduction in mortality risk associated with disease-modifying drugs (HR: 0.72, 95% CI: 0.62-0.84) [6].

The hdPS algorithm was particularly valuable in this context as it could empirically identify hundreds of potential confounders from healthcare claims data, including diagnostic codes, procedure codes, and medication records [6]. The algorithm prioritizes covariates based on their potential for confounding control, allowing researchers to address residual confounding that might remain after adjustment for predefined covariates.

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Methodological Tools for Propensity Score Analysis

Tool Category	Specific Examples	Function in PS Analysis
Statistical Software	R (MatchIt, twang, WeightIt), SAS (PROC PSMATCH), Python (causalinference, psmatching)	Implementation of propensity score estimation and application methods
Balance Metrics	Standardized mean differences, variance ratios, Kolmogorov-Smirnov statistics	Quantifying covariate balance between treatment groups after PS adjustment
High-Dimensional Covariate Algorithms	hdPS, LASSO, Bayesian confounding adjustment	Automated covariate selection in settings with numerous potential confounders
Machine Learning Approaches	Random forests, gradient boosting, autoencoders, principal component analysis	Flexible propensity score estimation with minimal model misspecification
Sensitivity Analysis Methods	Rosenbaum bounds, E-values, propensity score calibration	Assessing robustness to unmeasured confounding
MurA-IN-4	MurA-IN-4, CAS:318280-69-2, MF:C8H12ClNO3, MW:205.64 g/mol	Chemical Reagent
Tetramethyl-d12-ammonium bromide	Tetramethyl-d12-ammonium bromide, CAS:284474-82-4, MF:C4H12BrN, MW:166.12 g/mol	Chemical Reagent

Validation and Sensitivity Analysis Protocols

Assessing Covariate Balance

The critical step in validating any propensity score analysis is assessing whether the method has successfully balanced covariates between treatment groups. The following protocol should be implemented:

Calculate standardized mean differences (SMD) for all covariates before and after propensity score adjustment. The SMD should be â‰¤0.1 for all important confounders to indicate adequate balance [4].
Examine variance ratios (ratio of variances between treatment groups) for continuous covariates, with ideal values between 0.5 and 2.0.
Visualize balance using Love plots, which display SMDs before and after adjustment, and empirical cumulative distribution function plots for continuous variables.
Assess propensity score distribution overlap using histograms or kernel density plots by treatment group.

In the 2025 dialysis study, the superiority of autoencoder-based propensity scores was demonstrated through superior balance metrics, with only 8 covariates showing SMD>0.1 compared to 20-83 with other methods [4].

Sensitivity Analysis for Unmeasured Confounding

Since propensity scores can only adjust for measured confounders, sensitivity analysis is essential to assess potential residual confounding:

Propensity Score Calibration: This method uses additional information on a subset of patients to correct for unmeasured confounding [1] [8].
External Adjustment: Incorporate estimates of the strength of association between unmeasured confounders and both treatment and outcome from external literature [3].
E-Value Calculations: Quantify the minimum strength of association that an unmeasured confounder would need to have with both treatment and outcome to explain away the observed effect [3].

Figure 2: Propensity Score Role in Addressing Confounding by Indication. This causal diagram illustrates how propensity scores (derived from measured covariates) help block backdoor paths created by treatment indications.

Propensity score methods represent a powerful approach for addressing confounding by indication in pharmacoepidemiological studies. When properly implemented, these methods can create balanced comparison groups that approximate the balance achieved in randomized trials, substantially reducing bias in treatment effect estimates [3]. The recent methodological advances in propensity score applicationsâ€”including high-dimensional propensity scores, machine learning integration, and sophisticated weighting approachesâ€”have enhanced their utility in modern healthcare database studies [4] [5].

Future developments in propensity score methodology will likely focus on improving robustness to model misspecification, enhancing approaches for time-varying treatments, and developing more sophisticated integration with machine learning techniques [1]. Additionally, there is growing interest in transparent reporting standards and sensitivity analysis frameworks that appropriately communicate the strength of evidence from propensity score analyses [3]. As these methods continue to evolve, they will play an increasingly important role in generating valid real-world evidence about treatment benefits and harms in clinical practice.

The Potential Outcomes Framework and Assumptions for Causal Inference

The Potential Outcomes Framework (POF), also known as the Rubin Causal Model (RCM), provides a formal mathematical foundation for defining and estimating causal effects. In the context of pharmacoepidemiological studies, which often rely on observational data from sources like health claims databases, this framework is indispensable for estimating the causal effects of drug exposures on patient outcomes while accounting for confounding [9] [10]. The framework augments the observed data by considering the outcomes that would occur under all possible treatment states, thus enabling a rigorous articulation of the assumptions required for causal inference.

Core Concepts of the Potential Outcomes Framework

Potential Outcomes and Causal Estimands

Consider a binary treatment (Z) (e.g., (Z=1) for drug exposure and (Z=0) for control). The potential outcome framework augments the joint distribution of ((Z,Y)) by introducing two random variables, ((Y(1), Y(0))), where:

(Y(1)) is the outcome if the unit receives treatment.
(Y(0)) is the outcome if the unit does not receive treatment [9].

The observed outcome (Y) is then defined as: [ Y = \begin{cases} Y(1) & \text{if } Z = 1 \ Y(0) & \text{if } Z = 0 \end{cases} ] or, more compactly, (Y^{\text{obs}} = Z \cdot Y(1) + (1-Z) \cdot Y(0)) [9]. The key problem of causal inference is that for any individual unit, only one of these potential outcomes is observed; the other is counterfactual [9] [10].

Within this framework, several causal estimands can be defined. The table below summarizes the most common ones.

Table 1: Key Causal Estimands in the Potential Outcomes Framework

Estimand	Definition	Interpretation
Individual Treatment Effect (ITE)	(\taui = Yi(1) - Y_i(0))	The causal effect for a single unit (i) [9].
Average Treatment Effect (ATE)	(\mathrm{E}[Y(1) - Y(0)])	The expected effect of moving an entire population from control to treatment [9] [10].
Average Treatment Effect on the Treated (ATT)	(\mathrm{E}[Y(1)-Y(0)\|Z=1])	The average effect for those who actually received the treatment [10].

In pharmacoepidemiology, the choice between ATE and ATT depends on the research question. The ATE is often relevant for policy decisions (e.g., what is the effect of introducing a new drug to the entire population?), whereas the ATT is useful for understanding the effect on patients who typically receive a particular treatment [10].

The Fundamental Problem of Causal Inference

The core challenge is that the ITE (\taui) is fundamentally unobservable because it requires simultaneously observing both (Yi(1)) and (Y_i(0)) for the same unit [9] [10]. Therefore, statistical methods for causal inference must rely on comparisons between groups under assumptions that allow the unobserved potential outcomes to be inferred from the observed data.

Critical Assumptions for Causal Identification

For causal effects to be identified from observed data, certain assumptions must hold. The following diagram illustrates the core structure of the problem and the role of these assumptions.

Diagram 1: Potential Outcomes and Confounding. Y(1) and Y(0) are potential outcomes. Solid lines represent observed relationships, while dashed lines represent unobserved influences. Confounders (X, U) affect both treatment and outcomes.

Unconfoundedness (Ignorability)

This is the most critical assumption for causal inference in observational studies. It states that, conditional on observed covariates (X), the treatment assignment (Z) is independent of the potential outcomes [9] [10] [11]: [ (Y(1), Y(0)) \perp ! ! ! \perp Z \mid X ] This means that, within strata defined by the covariates (X), the assignment to treatment or control is as good as random [10]. This assumption is also known as the "no unmeasured confounding" assumption. In Diagram 1, this assumption would hold if all common causes of (Z) and (Y) (the confounders) are captured in (X), with no role for (U) [12] [13].

Positivity (Overlap)

This assumption requires that every unit has a positive probability of receiving either treatment, given the covariates [10]: [ 0 < P(Z=1 \mid X) < 1 ] In practice, this means that for all values of the observed covariates (X), there are both treated and untreated units. This ensures that there is a comparable control unit for every treated unit, and vice-versa, allowing for meaningful comparisons.

Consistency

The consistency assumption links the potential outcomes to the observed data. It states that the observed outcome for a unit that received treatment (Z=z) is exactly that unit's potential outcome under (z) [10]: [ \text{If } Z=z, \text{ then } Y^{\text{obs}} = Y(z) ] This implies that the treatment is well-defined and that there is no interference between units (the potential outcome of one unit is not affected by the treatment assignment of other units).

Application in Pharmacoepidemiology: The Role of Propensity Scores

In high-dimensional pharmacoepidemiological studies, directly conditioning on all covariates (X) is often impractical due to the curse of dimensionality. The propensity score, defined as the probability of treatment assignment conditional on observed covariates, (e(X) = P(Z=1|X)), addresses this issue [10] [11]. Rosenbaum and Rubin showed that if treatment assignment is unconfounded given (X), it is also unconfounded given the propensity score (e(X)) [10]: [ (Y(1), Y(0)) \perp ! ! ! \perp Z \mid e(X) ] This allows researchers to adjust for confounding by using the scalar propensity score instead of the high-dimensional vector (X).

Table 2: Common Propensity Score Methods in Pharmacoepidemiology

Method	Protocol Description	Key Considerations
Propensity Score Matching (PSM)	Treated subjects are matched to untreated subjects with similar propensity scores. The ATE or ATT is estimated by comparing outcomes between matched groups [10].	Requires decisions on matching algorithm (e.g., 1:1 nearest-neighbor), caliper width, and with/without replacement. Assess balance of covariates post-matching.
Inverse Probability of Treatment Weighting (IPTW)	Subjects are weighted by the inverse of their probability of receiving the treatment they actually received. Weights: (1/e(X)) for treated, (1/(1-e(X))) for untreated [10] [11].	Creates a "pseudo-population" where confounding is eliminated. Can be unstable with extreme weights; truncated or stabilized weights are often used.
Stratification (Subclassification)	The sample is divided into strata (e.g., quintiles) based on the propensity score. Treatment effects are estimated within each stratum and then pooled [10].	Typically, 5 strata remove ~90% of bias from a continuous confounder. Assess balance within strata.
Covariate Adjustment	The propensity score is included as a covariate in a regression model for the outcome [10].	Simple to implement but relies on correct specification of the outcome model.

The following workflow diagram illustrates a standard protocol for applying propensity score methods in a pharmacoepidemiological study.

Diagram 2: Propensity Score Analysis Workflow. This iterative process ensures confounding is adequately addressed before effect estimation.

Advanced Applications: High-Dimensional Propensity Scores (hdPS)

Pharmacoepidemiological studies using claims data often contain hundreds of potential covariates. The high-dimensional propensity score (hdPS) algorithm is a semi-automated data-driven method to efficiently select and adjust for a large number of covariates from such databases [4]. The protocol involves:

Identifying Covariate Proxies: Identifying all recorded diagnoses, procedures, and drug prescriptions as potential covariates.
Prioritizing Covariates: Ranking these covariates by their potential for confounding based on their prevalence and association with the treatment and outcome.
Score Estimation: Including the top (n) covariates (e.g., 500) in the propensity score model [4] [6].

Recent research has shown that combining hdPS with dimensionality reduction techniques like autoencoders can further improve covariate balance in such high-dimensional settings [4].

The Scientist's Toolkit: Essential Reagents for Causal Inference

Table 3: Key Research Reagent Solutions for Causal Inference Studies

Reagent / Tool	Function	Application Notes
Structured Healthcare Databases	Provide longitudinal data on drug exposure, patient outcomes, and confounders.	Examples: Claims data (e.g., Optum's Clinformatics Data Mart), electronic health records (EHR). Data quality and completeness are critical [4] [6].
Propensity Score Estimation Algorithms	Model the probability of treatment assignment given covariates.	Logistic regression is standard. Advanced methods: hdPS, machine learning (boosted regression, random forests, autoencoders) for high-dimensional data [4] [10] [11].
Balance Diagnostics	Quantify the similarity of covariate distributions between treated and control groups after PS adjustment.	Primary metric: Standardized Mean Differences (SMD). Target: SMD < 0.1 for all covariates. Visualization: Love plots, overlap plots [4] [10].
Causal Graphical Models	Visually represent assumptions about causal relationships between variables.	Used to identify a sufficient set of confounders and to spot sources of bias like colliders [12] [13].
Sensitivity Analysis Frameworks	Quantify how strong an unmeasured confounder would need to be to explain away an observed effect.	Assesses the robustness of causal conclusions to potential violations of the unconfoundedness assumption [14].
Amethopterin-d3	Amethopterin-d3, MF:C20H22N8O5, MW:457.5 g/mol	Chemical Reagent
Dodecanedioic acid-d20	Dodecanedioic acid-d20, CAS:89613-32-1, MF:C12H22O4, MW:250.42 g/mol	Chemical Reagent

Experimental Protocol: A Case Study on Dialysis and Mortality

To illustrate, here is a detailed protocol based on a real pharmacoepidemiological study investigating the association between dialysis and mortality in older heart failure patients [4].

Aim: To estimate the average treatment effect of dialysis on in-hospital mortality. Data Source: Optum's de-identified Clinformatics Data Mart Database. Design: Retrospective cohort study with propensity score matching.

Step-by-Step Protocol:

Cohort Definition:
- Inclusion: Patients with heart failure and advanced chronic kidney disease.
- Exposure: Initiation of dialysis (Z=1) vs. no dialysis (Z=0).
- Outcome: All-cause in-hospital mortality (dichotomous).
- Final Cohort: 485 exposed and 1,455 unexposed after matching.

Propensity Score Estimation:
- Covariates (X): A high-dimensional set of covariates derived from claims data, including demographics, comorbidities, medications, and healthcare utilization.
- Method Comparison: The study compared PS estimated via:
  - Investigator-specified covariates: Expert-selected variables.
  - hdPS: A data-driven algorithm.
  - Dimensionality Reduction Techniques: Principal Component Analysis (PCA), logistic PCA, and autoencoders [4].
Propensity Score Application:
- Matching: 1:1 matching was performed without replacement within a specified caliper.
- Balance Assessment: Covariate balance was assessed using SMDs across all PS methods.
Outcome Analysis:
- Effect Estimation: Hazard ratios for mortality were estimated using Cox proportional hazards models on the matched cohorts.
- Result: Autoencoder-based PS achieved the best covariate balance (only 8 covariates with SMD > 0.1), and hazard ratios for mortality were similar across all PS methods [4].

This case study demonstrates the application of the potential outcomes framework and highlights how advanced PS methods can improve confounding control in real-world research.

In pharmacoepidemiological studies research, which often relies on large, observational healthcare databases, defining the precise causal question is the critical first step before any analysis begins [1]. The causal estimand is a precise description of the causal quantity one seeks to learn from the data, specifying the target population, the treatment contrast of interest, and the outcome [15]. Within the potential outcomes framework, three fundamental estimands are the Sample Average Treatment Effect (SATE), the Sample Average Treatment Effect on the Treated (SATT), and the Population Average Treatment Effect (PATE) [1] [16]. The choice between them hinges on the underlying clinical question and dictates the analytical approach, the interpretation of results, and the scope of inference. Propensity score methods are a primary tool for estimating these estimands from observational data by attempting to mimic the balance achieved in randomized trials [1] [17]. This document outlines the definitions, applications, and estimation protocols for SATE, SATT, and PATE, framed specifically for pharmacoepidemiology research.

Defining the Estimands

Core Definitions and Formulations

The following table summarizes the core definitions and formulations of the three primary estimands.

Table 1: Core Definitions of SATE, SATT, and PATE

Estimand	Full Name	Definition	Causal Question	Primary Application Context
SATE	Sample Average Treatment Effect	`SATE = (1/n) * Î£ [Y_i(1) - Y_i(0)]` for all units `i` in the study sample [1].	What is the average treatment effect for all individuals in our study sample?	Efficacy evaluation within a randomized controlled trial (RCT) or a well-defined observational cohort [16].
SATT	Sample Average Treatment Effect on the Treated	`SATT = (1/n_t) * Î£ [Y_i(1) - Y_i(0)]` for all units `i` who actually received the treatment in the sample [1].	What is the average treatment effect for those individuals who actually received the treatment in our study?	Effectiveness and safety research in pharmacoepidemiology, where inference is for the patients who are prescribed the drug in real-world practice [18] [1].
PATE	Population Average Treatment Effect	`PATE = E[Y(1) - Y(0)]` for all units in the broader target population [16].	What is the average treatment effect for the entire target population of interest?	Guiding broad policy or formulary decisions for a entire patient population (e.g., all patients with a specific condition in a country) [16].

These estimands are defined within the potential outcomes framework (or Rubin Causal Model) [1] [15]. For each individual i, there exists a potential outcome under treatment, Y_i(1), and a potential outcome under control, Y_i(0). The individual treatment effect is Ï„_i = Y_i(1) - Y_i(0). The fundamental problem of causal inference is that we can never observe both Y_i(1) and Y_i(0) for the same individual [15]. Therefore, we define average effects, like SATE, SATT, and PATE, over groups of individuals.

Conceptual Relationships and Workflow

The following diagram illustrates the conceptual relationship between the different estimands and the general workflow for defining a causal question.

Estimation in Pharmacoepidemiology

The Role of Propensity Scores

In observational pharmacoepidemiology, treatment is not randomly assigned. This leads to confounding, as treated and untreated groups differ in their prognostic characteristics [17]. Propensity score methods are a primary tool to control for this confounding by modeling the probability of treatment assignment given observed covariates [1].

The propensity score for a binary treatment A given covariates X is defined as e(X) = P(A=1 | X) [1]. Rosenbaum and Rubin proved that, under the assumption of conditional exchangeability (ignorability), conditioning on the propensity score balances the distribution of observed covariates X between treatment and control groups. This allows for the estimation of causal effects as if treatment had been randomized [1].

Linking Estimands to Propensity Score Methods

The choice of estimand directly influences how propensity scores are applied.

SATT: When the goal is to estimate the SATT, the analysis is focused on achieving balance and making inferences for the treated individuals. Propensity score matching (PSM) is a natural choice, where each treated individual is matched to one or more untreated individuals with similar propensity scores. The effect is then estimated by contrasting outcomes within these matched pairs, effectively creating a pseudo-population that resembles the treated group but with similar covariate distributions [1] [17].
SATE/PATE: To estimate the SATE or PATE, which pertain to the entire sample or population, inverse probability of treatment weighting (IPTW) is often used. IPTW creates a weighted pseudo-population where the weights, w = A/e(X) + (1-A)/(1-e(X)), are inversely proportional to the probability of receiving the treatment actually received. This effectively simulates a scenario where every unit had the same chance of being treated, thus allowing for the estimation of an effect for the entire group [1]. A key challenge for PATE is that the study sample may not be perfectly representative of the target population, requiring additional transportability methods [19].

Table 2: Application of Propensity Score Methods by Estimand

Estimand	Recommended Propensity Score Method	Intuitive Explanation	Key Considerations
SATT	Matching (e.g., 1:1 nearest neighbor)	Finds a control "twin" for each treated individual based on their probability of being treated. The effect is the average outcome difference between each treated individual and their twin.	Preserves the original treated group. The quality of the estimate depends on the ability to find good matches for all treated individuals.
SATE	Inverse Probability of Treatment Weighting (IPTW)	Uses weights to make the treated group look like the full sample and the control group look like the full sample. The weighted groups represent a pseudo-population where everyone had the same chance to be treated or not.	Can be inefficient if some propensity scores are very close to 0 or 1. Requires careful checking of weight distributions.
PATE	IPTW + Transportability Methods	First uses IPTW to balance the study sample, then uses a second set of weights (inverse odds of sampling weights) to make the balanced study sample resemble the target population.	Requires data on covariates from the target population. Relies on the strong assumption that all effect modifiers are measured and accounted for [19].

Detailed Experimental Protocol: Estimating SATT via Propensity Score Matching

The following workflow details the steps for estimating the SATT using propensity score matching, a common application in pharmacoepidemiology.

Protocol Steps:

Variable Definition [17]:
- Treatment Variable: Clearly define the treatment (e.g., initiation of a new drug vs. an active comparator).
- Outcome Variable: Define the clinical outcome of interest (e.g., hospitalization, mortality).
- Confounding Variables: Select a set of pre-treatment covariates X that are potential common causes of both the treatment and the outcome. This should include demographic information, clinical comorbidities, medication history, and healthcare utilization measures. Leverage clinical expertise and guidelines to build a plausible model.
Estimate Propensity Scores [17]:
- Fit a model, typically a logistic regression, where the treatment indicator is the dependent variable and all confounding covariates are independent variables.
- logit(P(A=1 | X)) = Î²â‚€ + Î²â‚Xâ‚ + ... + Î²â‚–Xâ‚–
- The predicted values from this model are the estimated propensity scores, Ãª(X), for each individual.
- Advanced methods like Generalized Boosted Models (GBM) or Bayesian Additive Regression Trees (BART) can also be used for more flexible estimation.
Perform Matching [17]:
- Use a matching algorithm to pair treated individuals with untreated individuals who have a similar propensity score.
- Nearest-neighbor matching is the most common. Each treated individual is matched to the untreated individual with the closest propensity score, often within a pre-specified "caliper" (e.g., 0.2 standard deviations of the logit of the propensity score) to prevent poor matches.
- Unmatched individuals are excluded from the subsequent analysis.
Check Covariate Balance [17]:
- After matching, it is imperative to check if the distribution of covariates is now similar between the treated and control groups.
- Use quantitative measures like standardized mean differences (SMD) for each covariate. An SMD below 0.1 (10%) is generally considered indicative of good balance.
- If imbalance persists, the propensity score model or matching parameters may need to be revised (e.g., adding interaction terms, using a different caliper).
Estimate Treatment Effect (SATT):
- In the matched sample, compare the outcomes between the treated and control groups.
- Since matching creates paired data, a paired t-test (for continuous outcomes) or conditional logistic regression (for binary outcomes) can be used. Alternatively, a generalized linear model (e.g., logistic regression) that includes the matched pairs as strata can be fitted [17].
- The resulting coefficient for the treatment variable is an estimate of the SATT.
Sensitivity Analysis:
- Conduct analyses to assess how sensitive the results are to unmeasured confounding. Methods like the Rosenbaum bounds can be used to determine how strong an unmeasured confounder would need to be to qualitatively change the study conclusions [1].

The Scientist's Toolkit

Table 3: Essential Research Reagents and Software for Causal Inference Analysis

Tool / Reagent	Category	Function in Causal Workflow	Examples & Notes
'MatchIt' R Package [17]	Software	A comprehensive tool for performing propensity score matching and other matching methods.	Implements nearest-neighbor, optimal, full, and genetic matching. Integrates with the R ecosystem for balance assessment and outcome analysis.
'cobalt' R Package [17]	Software	Covariate Balance Assessment Tables and Plots.	Provides a wealth of functions and graphics (e.g., love plots) to evaluate covariate balance before and after applying propensity score methods.
High-Dimensional Propensity Score (hdPS) [1]	Algorithm	Automates the selection of a large number of potential confounders from healthcare claims data.	Identifies and prioritizes covariates based on their prevalence and potential for confounding. Useful for dealing with the high dimensionality of administrative databases.
DAGitty [20]	Software	A browser-based tool for creating, editing, and analyzing causal Directed Acyclic Graphs (DAGs).	Helps researchers visually articulate and test their causal assumptions, identify minimal sufficient adjustment sets, and detect potential biases like M-bias.
Stable Unit Treatment Value Assumption (SUTVA) [16]	Conceptual Assumption	The foundational assumption that one unit's outcome is unaffected by another unit's treatment assignment.	Violations (e.g., interference or contagion) complicate causal inference. Must be considered in the study design phase.
Generalized Linear Model (GLM) [17]	Statistical Model	The standard workhorse for estimating the propensity score (via logistic regression) and for outcome analysis after matching or weighting.	Flexible framework for different types of outcomes (binary, continuous, count).
N-Myristoyl-Lys-Arg-Thr-Leu-Arg	N-Myristoyl-Lys-Arg-Thr-Leu-Arg, CAS:125678-68-4, MF:C42H82N12O8, MW:883.2 g/mol	Chemical Reagent	Bench Chemicals
Uracil-15N2	Uracil-15N2, CAS:5522-55-4, MF:C4H4N2O2, MW:114.07 g/mol	Chemical Reagent	Bench Chemicals

The explicit definition of the causal estimandâ€”SATE, SATT, or PATEâ€”is a fundamental prerequisite for rigorous pharmacoepidemiological research. SATT is often the most relevant estimand for questions of drug effectiveness and safety in real-world practice, as it directly concerns the patients who are actually prescribed the treatment. Propensity score methods, particularly matching for SATT, provide a powerful design-based approach to minimize confounding by indication in observational studies. However, no analytical method can compensate for a poorly defined causal question. By starting with a clear estimand, researchers can select an appropriate methodology, justify their analytical choices, and ultimately produce evidence that is interpretable and meaningful for clinical and regulatory decision-making.

Pharmacoepidemiology bridges clinical pharmacology and epidemiology, studying the use and effects of medications in large human populations [21]. While randomized controlled trials (RCTs) remain the gold standard for establishing efficacy, they have inherent limitations including strict inclusion criteria, short follow-up durations, and limited power for rare adverse events [22] [21]. Consequently, observational studies using real-world data (RWD) provide essential complementary evidence on drug effectiveness and safety in routine clinical practice [23].

However, analyses of observational data face formidable methodological challenges, primarily confounding by indication and various selection biases [1] [24]. In clinical practice, treatments are prescribed selectively based on clinical parametersâ€”healthcare professionals prescribe when anticipating benefit and withhold treatment when concerned about adverse events [1]. This fundamental aspect of clinical decision-making creates systematic differences between treatment groups that, if unaddressed, render crude outcome comparisons uninterpretable [24].

The new-user design and active comparator design constitute a paradigm shift in pharmacoepidemiology that addresses these fundamental methodological challenges [24] [25]. When combined into the active comparator, new user (ACNU) design, these approaches enable observational studies to emulate the design of head-to-head randomized trials, significantly improving the validity of real-world evidence [24] [22]. This article explores the foundational concepts, implementation protocols, and analytical framework of these designs within the context of modern pharmacoepidemiologic research employing propensity score methods.

Conceptual Foundations and Rationale

The Active Comparator Design

The active comparator design (also known as active comparator design) compares the drug of interest ('Drug A') to another active drug ('Drug B') used for the same indication, rather than comparing to non-users [22] [25]. This approach provides three distinct methodological advantages:

First, it increases overlap of measured characteristics between treatment groups. By selecting comparator drugs with similar therapeutic indications, the design creates treatment groups that are more similar in terms of measured baseline characteristics, facilitating more effective statistical adjustment [22].

Second, it reduces potential for unmeasured confounding. Non-user groups often include patients with contraindications to treatment or those with very mild disease, introducing systematic differences in unmeasured characteristics like frailty or disease severity [22]. As demonstrated in studies of influenza vaccine, comparisons with non-users can yield implausibly strong protective effects against all-cause mortality due to such confounding [24]. Active comparator groups minimize these differences by restricting comparisons to patients with clear treatment indications [22].

Third, it addresses more clinically relevant questions. For many chronic conditions where some treatment is necessary, the relevant clinical question is not whether to treat but which treatment to choose [22]. The active comparator design directly answers this question by providing evidence on comparative effectiveness and safety between therapeutic alternatives [22].

The New-User Design

The new-user design (also known as incident user design or initiator design) identifies patients at the time of treatment initiation and begins follow-up at this point [22] [25]. This approach offers several critical advantages over prevalent user designs, which include both new and existing users:

This design enables assessment of time-varying hazards and drug effects. The risk of many adverse events changes over time, often highest shortly after treatment initiation [22]. For instance, studies of TNF-Î± inhibitors in rheumatoid arthritis have demonstrated the highest infection risk occurs within the first 90 days of treatment [22]. Prevalent user designs miss these time-varying hazards because they include persons who have already tolerated the treatment [22].

The new-user approach also ensures appropriate confounding adjustment by clearly establishing a baseline measurement point. This allows investigators to accurately distinguish pretreatment covariates from posttreatment variables, preventing adjustment for mediators that lie on the causal pathway between treatment and outcome [22].

Additionally, this design eliminates immortal time bias when combined with an active comparator. Immortal time refers to follow-up period during which the outcome cannot occur because of the study design [22]. By aligning start of follow-up for both treatment groups at treatment initiation, the new-user design avoids this methodological pitfall [22].

Table 1: Key Advantages of Foundational Design Components

Design Component	Key Advantages	Methodological Threats Addressed
Active Comparator	- Increases overlap of measured characteristics- Reduces unmeasured confounding- Answers clinically relevant questions	- Confounding by indication- Healthy user/sick stopper effects- Channeling bias
New-User Design	- Captures time-varying hazards- Ensures appropriate covariate measurement- Defines accurate treatment duration	- Immortal time bias- Prevalent user bias- Depletion of susceptibles

Implementation Protocols

Cohort Definition and Eligibility Criteria

Implementing the ACNU design begins with defining a source population and establishing eligibility criteria that emulate the inclusion criteria of a target randomized trial [26]. The process involves:

Identifying the disease cohort: Define a population of patients with the medical condition for which the study and comparator drugs are indicated [24].
Establishing drug exposure definitions: Clearly define the pharmaceutical agents of interest, including specific formulations, dosages, and administration routes [27].
Applying inclusion/exclusion criteria: Apply criteria related to demographics, clinical characteristics, and healthcare utilization to create a study population that would be eligible for either treatment in clinical practice [22].
Defining the cohort entry date: For both treatment groups, the cohort entry date ("time zero") is the date of treatment initiation after meeting all eligibility criteria [27] [22].

Follow-Up Period and Outcome Assessment

The follow-up protocol must be specified a priori to ensure temporal precedence of exposure before outcome:

Start of follow-up: Follow-up begins immediately after the cohort entry date for both treatment groups [27] [22].
End of follow-up: Follow-up continues until the earliest of: outcome occurrence, treatment discontinuation, switching to comparator drug, end of study period, loss to follow-up, or death [27].
Risk window specification: Define biologically plausible risk windows based on the pharmacodynamics and pharmacokinetics of the drugs under study [27].
Outcome ascertainment: Implement validated algorithms to identify outcomes of interest, using specific combinations of diagnosis codes, procedures, and medications to maximize specificity and sensitivity [26].

Figure 1: Implementation Workflow for Active Comparator, New-User Design. This diagram illustrates the sequential process of defining study cohorts and follow-up periods according to ACNU principles.

Covariate Selection and Measurement

A critical step in implementing the ACNU design is the appropriate selection and measurement of potential confounders:

Define baseline period: Establish a fixed period before treatment initiation (e.g., 6-12 months) for assessing all covariates [22].
Identify confounder domains: Consider demographics, clinical characteristics, comorbidities, healthcare utilization patterns, concomitant medications, and laboratory values [1].
Apply causal knowledge: Prioritize variables that are risk factors for the outcome and associated with treatment assignment, while avoiding instruments (predictors of treatment only) and mediators (variables on causal pathway) [1] [28].
Address high-dimensional data: In databases with numerous potential covariates, consider algorithms like the high-dimensional propensity score (hdPS) that systematically select covariates based on their association with both treatment and outcome [1].

Analytical Framework: Integration with Propensity Score Methods

Propensity Score Estimation in ACNU Studies

The ACNU design creates an ideal foundation for propensity score methods by ensuring appropriate covariate measurement and temporal alignment [1] [22]. The propensity score is defined as the probability of treatment assignment conditional on observed baseline covariates [1] [28]. In ACNU studies:

Model specification: The propensity score model includes all measured baseline covariates identified in the design phase, typically using logistic regression for binary treatments [1] [28].
Modeling goal: The objective is not prediction perfection but achieving balance in covariates between treatment groups; thus, predictive performance metrics like c-statistics should not guide model specification [1].
Extended applications: For multiple active comparators, the generalized propensity score can be estimated using multinomial regression models [1].

Propensity Score Application and Balance Assessment

After estimation, propensity scores can be applied through various methods:

Matching: Creates comparable groups by matching each treated individual with one or more comparator individuals with similar propensity scores [28].
Weighting: Creates a pseudo-population using inverse probability of treatment weights where covariates are balanced between groups [1].
Stratification: Divides the study population into strata based on propensity score quantiles and estimates treatment effects within each stratum [28].
Covariate adjustment: Includes the propensity score as a covariate in the outcome regression model [28].

Table 2: Propensity Score Applications in ACNU Studies

Application Method	Implementation	Considerations for ACNU Studies
Propensity Score Matching	1:1 or variable ratio matching with caliper	- May reduce sample size- Optimizes comparability at individual level- Targets effect in the treated
Inverse Probability Weighting	Weights = 1/PS for treated, 1/(1-PS) for untreated	- Maintains original sample size- Creates pseudo-population- Can be unstable with extreme weights
Stratification	Subclassification into 5-10 quantiles	- Simple implementation- Allows effect modification assessment- May have residual within-stratum imbalance
Covariate Adjustment	Include PS as covariate in outcome model	- Simple approach- Assumes correct functional form- Less robust than other methods

Crucially, the success of propensity score methods depends on achieving covariate balance between treatment groups after application. Balance should be assessed using standardized differences rather than statistical significance tests, with differences <0.1 generally indicating adequate balance [1] [28].

Outcome Analysis and Effect Estimation

Following propensity score application and balance verification, outcome analysis proceeds:

Model specification: Choose an appropriate regression model (e.g., Cox proportional hazards, logistic, or Poisson regression) based on the outcome type and distribution [27].
Effect measure selection: Report both relative (hazard ratios, risk ratios) and absolute (risk differences) effect measures to facilitate clinical interpretation [27].
Precision estimation: Use robust variance estimators for weighting approaches and account for matched design in matched analyses [1].
Sensitivity analyses: Conduct analyses to assess potential impact of unmeasured confounding, including quantitative bias analysis and E-values [1].

Figure 2: Analytical Framework Integrating ACNU Design with Propensity Scores. This workflow demonstrates the iterative process of propensity score application with balance assessment as the critical decision point.

The Scientist's Toolkit: Essential Research Reagents

Table 3: Essential Methodological Reagents for ACNU Studies with Propensity Scores

Research Reagent	Function/Purpose	Implementation Considerations
Active Comparator Drugs	Therapeutic alternative with similar indications	- Should represent viable clinical alternative- Similar mechanism of action preferred but not required- Must have sufficient sample size
New-User Cohort	Population of treatment initiators	- Requires washout period to establish new use- Clear operational definition of initiation- Captures all eligible initiators in source population
Propensity Score Model	Predicts treatment probability given covariates	- Includes all measured confounders- Avoids overfitting- Focus on balance rather than prediction
Balance Metrics	Assesses comparability after PS application	- Standardized differences preferred over p-values- Threshold <0.1 indicates balance- Assess both individual covariates and overall balance
High-Dimensional Propensity Score (hdPS)	Algorithmic covariate selection in large databases	- Identifies covariates from data dimensions- Particularly useful in claims data- Requires sufficient sample size
1-Hydroxyoctadecane-d2	1-Hydroxyoctadecane-d2, MF:C18H38O, MW:272.5 g/mol	Chemical Reagent
1-(~2~H)Ethynyl(~2~H_5_)benzene	1-(~2~H)Ethynyl(~2~H_5_)benzene, CAS:25837-47-2, MF:C8H6, MW:108.17 g/mol	Chemical Reagent

The new-user design and active comparator design represent foundational methodological advances in pharmacoepidemiology that substantially strengthen the validity of real-world evidence. When implemented through the structured protocols outlined in this article and integrated with propensity score methods, these approaches enable observational studies to more closely approximate randomized trials, addressing pervasive biases like confounding by indication and healthy user effects.

The ACNU framework provides the methodological foundation for generating high-quality evidence on the comparative effectiveness and safety of medical products across their lifecycle. As pharmacoepidemiology continues to evolve with increasing access to large healthcare databases and advanced analytical techniques, these core design principles remain essential for producing reliable evidence to inform clinical practice and regulatory decision-making.

Pharmacoepidemiology, which assesses the use and effects of drugs in large populations, relies heavily on observational studies using secondary healthcare databases. Unlike randomized controlled trials (RCTs), where randomization balances both known and unknown prognostic factors, observational studies are susceptible to multiple systematic biases that can distort the true relationship between drug exposure and patient outcomes [29] [30]. Failure to adequately identify and mitigate these biases can lead to erroneous conclusions about a drug's safety or effectiveness, with significant implications for clinical practice and public health [29] [31]. Within the broader context of a thesis on propensity score methods in pharmacoepidemiological research, this document provides detailed application notes and protocols for three common and impactful biases: immortal time bias, selection bias (with a focus on channeling bias), and confounding by indication. The aim is to equip researchers with structured, practical methodologies to enhance the validity of their observational studies.

Table 1: Definitions and Impact of Key Biases

Bias Type	Definition	Common Impact on Effect Estimates
Immortal Time Bias	A period of follow-up during which the outcome under study cannot occur, by design, due to how exposure is defined [32] [33].	Often exaggerates treatment benefit; can artificially reverse the direction of effect [33] [34].
Channeling Bias	A selection bias where drugs with similar indications are preferentially prescribed to groups of patients with varying baseline prognoses or risk levels [29] [30].	Can create a spurious association by making one drug appear more harmful or beneficial due to the underlying risk profile of its users.
Confounding by Indication	Occurs when the underlying diagnosis or clinical reason for prescribing a drug is itself a risk factor for the outcome under study [29].	Severely confounds the exposure-outcome relationship, as the treatment indicator is a marker for the severity of the underlying illness.

Application Notes & Experimental Protocols

Immortal Time Bias

Application Notes

Immortal time bias is a pervasive time-related bias that arises from a misalignment between the start of follow-up (time-zero) and the assignment of exposure [32] [33]. It frequently occurs in pharmacoepidemiology when exposure is defined by a first prescription fill that happens some time after a patient qualifies for cohort entry (e.g., after a diagnosis or hospital discharge). The period between cohort entry and this first prescription is "immortal" because the patient must necessarily have survived and not experienced the outcome to have received the exposure [33]. When this immortal person-time is misclassified as exposed timeâ€”or excluded entirelyâ€”it artificially inflates the survival time of the exposed group, leading to a spurious protective effect [32]. This bias has been shown to substantially distort findings, sometimes even reversing the conclusions of a study [33].

Experimental Protocol for Identification and Mitigation

Objective: To design and analyze an observational cohort study that accurately accounts for immortal time between cohort entry and first drug exposure.

Materials and Data Requirements:

A longitudinal healthcare database with information on patient demographics, drug prescriptions, hospital admissions, and vital status.
Clearly defined:
- Cohort entry date (e.g., date of diagnosis, hospital discharge).
- Exposure of interest (e.g., a specific drug class).
- Study outcome (e.g., mortality, hospital readmission).

Procedure:

Cohort Definition: Define a population of incident users, where patients enter the cohort at the start of a new treatment episode [35]. Apply a washout period to ensure no prior use of the study drug.
Time-Zero Alignment: Align the start of follow-up (time-zero) and exposure assignment. A patient should be classified as exposed from the date of their first prescription fill onwards. All person-time before this date is considered unexposed.
Time-Dependent Analysis: Employ a statistical model that treats drug exposure as a time-varying covariate.
- In a Cox proportional hazards model, this means creating a dataset where each patient's follow-up is split into multiple rows based on changes in exposure status.
- Person-time from cohort entry until the first prescription is coded as "unexposed."
- Person-time from the first prescription onward is coded as "exposed."
Analysis: Run the Cox regression model using the time-varying exposure variable to estimate the hazard ratio for the outcome.

Validation: Where possible, specify a target trial that the observational study aims to emulate, ensuring alignment of time-zero, eligibility criteria, and exposure definition to prevent self-inflicted injuries like immortal time bias [5].

The following diagram illustrates the core methodological flaw and the recommended corrective analysis for immortal time bias.

Figure 1: Analytical Approaches for Immortal Time Bias

Channeling Bias

Application Notes

Channeling bias is a specific form of selection bias prevalent in comparative drug studies. It occurs when a newly marketed drug is "channeled" toward specific patient subgroups, such as those for whom established treatments have failed, those with more severe disease, or those with specific comorbidities [30]. Conversely, the older drug may be used predominantly in a more stable, "healthier" population. This creates a systematic imbalance in prognostic factors between the treatment groups at baseline. If these factors are also associated with the outcome, the resulting comparison is confounded. For example, the new drug may appear to have a higher rate of adverse events simply because it is prescribed to sicker patients [30].

Experimental Protocol for Mitigation Using Propensity Scores

Objective: To balance measured baseline covariates between patients initiating a new drug versus a comparator drug, thereby reducing channeling bias.

Materials and Data Requirements:

A healthcare database with rich information on patient characteristics, comorbidities, concomitant medications, and healthcare utilization.
Statistical software capable of performing logistic regression and propensity score matching/weighting.

Procedure:

Cohort Definition: Define two cohorts of incident users: one for the new drug and one for the active comparator. The cohorts should be contemporary and have similar indications [35].
Covariate Assessment: Identify and measure all potential pre-treatment confounders (e.g., age, sex, disease severity markers, prior medications, comorbidities) in a baseline period prior to the first prescription.
Propensity Score (PS) Estimation:
- Fit a logistic regression model where the dependent variable is treatment assignment (1=new drug, 0=comparator).
- The independent variables are all measured baseline covariates.
- The predicted probability from this model is the propensity scoreâ€”the conditional probability of receiving the new drug given the covariates [30] [36].
PS Implementation: Choose one of the following methods:
- Matching: Match each patient on the new drug to one or more patients on the comparator drug with a similar PS (using a caliper, e.g., 0.2 of the standard deviation of the PS logit) [36].
- Weighting: Use inverse probability of treatment weighting (IPTW), where weights are created as 1/PS for the new drug group and 1/(1-PS) for the comparator group. This creates a "pseudo-population" where the distribution of covariates is balanced between groups [37] [36].
Balance Assessment: After applying the PS method, check the balance of all covariates between the treatment groups. The standardized mean difference for each covariate should be <0.1 to indicate adequate balance [36].
Outcome Analysis: Analyze the association between treatment and outcome in the matched or weighted sample, using a regression model that may further adjust for any residual imbalance.

Limitations: Propensity scores can only adjust for measured confounders. Residual confounding from unmeasured variables (e.g., disease severity not fully captured in the database) may persist [30].

Confounding by Indication

Application Notes

Confounding by indication is perhaps the most fundamental challenge in pharmacoepidemiology. It arises because drugs are prescribed for specific medical conditions, and those conditions are often strong predictors of the study outcome [29]. For instance, studying the effect of antidepressants on mortality is complicated by the fact that depression itself is associated with an increased risk of death. The "indication" for the drug confounds the relationship between the drug (exposure) and the outcome. This bias is inherent in the non-randomized nature of treatment decisions and must be addressed through careful study design and analysis.

Experimental Protocol for Mitigation with an Active Comparator New-User Design

Objective: To mitigate confounding by indication by comparing two active drugs used for the same condition, and to balance baseline risks using propensity scores.

Materials and Data Requirements:

The same data requirements as in Section 2.2.2.

Procedure:

Active Comparator Selection: Identify a clinically relevant alternative drug used for the same primary indication as the drug of interest. This helps ensure that the two groups are more comparable in their underlying disease state than if one were compared to non-users [35].
New-User Design: Restrict the cohort to incident users of either drug, ensuring patients are observed from the start of their treatment episode. This avoids biases associated with prevalent users, who are "survivors" of the early treatment period and may be healthier [35].
Propensity Score Application: Follow the same steps for PS estimation and implementation as outlined in the protocol for channeling bias (Section 2.2.2). The goal is to create a balanced comparison group that has a similar probability of receiving one drug over the other, based on all measured baseline characteristics.
Sensitivity Analysis: Conduct sensitivity analyses to quantify how strong an unmeasured confounder would need to be to explain away the observed effect [36].

The following workflow integrates the protocols for addressing channeling bias and confounding by indication through the use of propensity score methods within an active comparator, new-user design.

Figure 2: Propensity Score Workflow for Channeling Bias and Confounding

The Scientist's Toolkit: Research Reagent Solutions

In the context of methodological research for mitigating bias, "research reagents" refer to the essential conceptual frameworks, study designs, and analytical techniques required to conduct a valid pharmacoepidemiologic study.

Table 2: Essential Methodological Toolkit for Bias Mitigation

Tool	Function & Application	Key Considerations
Active Comparator New-User Design	Foundational design that reduces confounding by indication and selection bias by comparing two active drugs and starting follow-up at treatment initiation [35].	Ensures comparability of treatment groups from the beginning of therapy.
Time-Dependent Cox Model	Statistical model used to correctly classify person-time and eliminate immortal time bias by treating drug exposure as a variable that changes over time [32] [33].	Requires careful data management to split patient follow-up into unexposed and exposed periods.
Propensity Score (PS)	A single score (probability of treatment) summarizing all measured baseline covariates; used to balance confounders across treatment groups [37] [30] [36].	Balances only measured covariates. Model specification and balance assessment are critical.
High-Dimensional Propensity Score (hd-PS)	An algorithm that automatically screens a large number of predefined covariates (e.g., diagnosis codes, procedure codes) in administrative data to supplement researcher-specified confounders [5].	Useful when rich clinical data are lacking; helps control for unmeasured confounding by proxy.
Standardized Difference	A balance metric used to assess the effectiveness of PS methods. It is not influenced by sample size, unlike p-values [36].	A value <0.1 after PS adjustment indicates adequate balance for a covariate.
Target Trial Emulation	A framework for designing observational studies by explicitly specifying the protocol of a hypothetical RCT that the study aims to emulate [5].	Helps avoid common biases (like immortal time) by forcing alignment of time-zero, eligibility, and treatment strategies.
Sulfadoxine D3	Sulfadoxine D3, CAS:1262770-70-6, MF:C12H14N4O4S, MW:313.35 g/mol	Chemical Reagent
Aphidicolin 17-acetate	Aphidicolin 17-acetate, MF:C8H6BrF2NO3, MW:282.04 g/mol	Chemical Reagent

From Theory to Practice: Implementing Propensity Score Analysis in Real-World Data

A Step-by-Step Guide to Propensity Score Estimation and Covariate Selection

In pharmacoepidemiological studies, researchers routinely use real-world data to assess the safety and effectiveness of pharmaceutical products. Unlike randomized controlled trials, observational studies are prone to confounding bias due to imbalanced baseline characteristics between treatment groups. Propensity score (PS) methods have emerged as a powerful statistical approach to address this challenge by creating pseudo-randomized conditions when analyzing observational data. A propensity score, defined as the probability of treatment assignment conditional on observed covariates, enables researchers to simulate the balancing properties of randomization.

The growing availability of large-scale healthcare databases has expanded opportunities for pharmacoepidemiological research, but it has also intensified methodological challenges related to confounding control and model specification. This guide provides a comprehensive framework for propensity score estimation and covariate selection, with particular emphasis on applications in drug development and safety research. We detail both established and emerging methodologies, including machine learning approaches and hybrid techniques that combine propensity scores with prognostic scores to enhance causal inference from non-randomized study designs.

Theoretical Foundations and Key Concepts

Defining the Propensity Score

The propensity score for subject i (i = 1, ..., N) is defined as the conditional probability of receiving the treatment given the observed covariates: e(Xi) = P(Ai = 1 | Xi) where Ai is the treatment indicator (1 for treatment, 0 for control) and X_i is the vector of observed pre-treatment covariates. Rosenbaum and Rubin demonstrated in 1983 that, under the strong ignorability assumption, treatment assignment and potential outcomes are independent conditional on the propensity score. This foundational property allows researchers to adjust for confounding by balancing the distribution of observed covariates across treatment groups based on the propensity score.

The strong ignorability assumption requires two conditions: first, that the treatment assignment is independent of potential outcomes given the covariates (unconfoundedness), and second, that every subject has a positive probability of receiving either treatment (positivity). In pharmacoepidemiological applications, these assumptions must be carefully considered in the context of the clinical question and available data. Violations of these assumptions, particularly unmeasured confounding, remain a fundamental limitation that propensity scores cannot fully address.

Covariate Selection Principles and Approaches

Causal diagrams (directed acyclic graphs) provide a theoretical framework for identifying an appropriate set of confounders for inclusion in the propensity score model. The goal is to include covariates that are associated with both treatment and outcome while avoiding instruments (variables affecting only treatment) and mediators (variables on the causal pathway between treatment and outcome). Including instrumental variables can increase variance without reducing bias, while including mediators can introduce overadjustment bias by blocking causal pathways.

For pharmacoepidemiological studies, researchers should prioritize covariates with known clinical relevance to the disease process and treatment decision-making. A structured approach to covariate selection might include:

Clinical knowledge-based selection: Using subject matter expertise to identify potential confounders
High-dimensional propensity score (hdPS): Data-adaptive algorithm for selecting covariates from large healthcare databases
Hybrid approaches: Combining clinical knowledge with automated variable selection techniques

Recent methodological research emphasizes that prioritizing covariates strongly associated with the outcome, rather than treatment, generally leads to better bias reduction in treatment effect estimates. This principle has motivated the development of prognostic score-based approaches to propensity score estimation and evaluation.

Methodological Approaches to Propensity Score Estimation

Traditional Statistical Methods

Logistic regression represents the most widely used approach for propensity score estimation. The model specification follows standard logistic regression framework, with treatment status as the dependent variable and selected covariates as independent variables: logit(P(Ai = 1 | Xi)) = Î²0 + Î²1X1i + ... + Î²pX_pi

The reference method in pharmacoepidemiology typically involves logistic regression with covariates selected based on clinical knowledge and prior literature. This approach performs well when the relationship between covariates and treatment assignment is linear and additive, and when all relevant confounders have been correctly identified and measured. However, model misspecification remains a concern, particularly when complex interactions or nonlinear relationships exist in the data.

Regularized regression methods, such as LASSO (Least Absolute Shrinkage and Selection Operator), introduce penalty terms to the likelihood function to handle high-dimensional covariate spaces: Î²Ì‚ = argminÎ² { -l(Î² | A, X) + Î»Î£|Î²_j| } where l(Î² | A, X) is the log-likelihood function and Î» is the tuning parameter that controls the strength of the penalty. LASSO performs both variable selection and shrinkage, making it particularly useful when dealing with numerous potential confounders. Simulation studies have shown that LASSO performs well in linear settings with small sample sizes and common treatment prevalence [38].

Machine Learning Approaches

Machine learning methods offer flexible alternatives to traditional logistic regression, particularly for capturing complex relationships in high-dimensional data. The following table summarizes key machine learning approaches for propensity score estimation:

Table 1: Machine Learning Methods for Propensity Score Estimation

Method	Key Mechanism	Advantages	Limitations	Performance Characteristics
LASSO	L1 regularization with variable selection	Automatic variable selection, handles correlated predictors	Shrinks coefficients toward zero, may exclude weak predictors	Best in linear settings with small samples and common treatment prevalence [38]
XgBoost	Gradient boosted decision trees	Captures complex nonlinearities and interactions, robust to outliers	Computationally intensive, requires careful tuning	Outperforms in nonlinear settings with large samples and low treatment prevalence [38] [39]
Multilayer Perceptron (MLP)	Neural network with multiple hidden layers	Models complex nonlinear relationships, handles high-dimensional data	Requires extensive tuning, prone to overfitting without proper validation	Performs similarly to other ML methods in complex data scenarios [38]

Model Averaging and Hybrid Approaches

Model averaging approaches integrate multiple propensity score estimates to improve robustness against model misspecification. The model-averaged propensity score is calculated as: Ä“(Xi) = Î£Î»m Ãªm(Xi) where Î»m are mixing parameters that sum to 1, and Ãªm(X_i) represents the propensity score estimate from candidate model m.

Recent methodological developments have introduced prognostic score-based model averaging, which selects the optimal mixing parameters by minimizing between-group differences in prognostic scores (predicted outcomes under control) rather than focusing solely on covariate balance. This approach recognizes that imbalance in prognostic scores is more strongly associated with bias in treatment effect estimates than imbalance in individual covariates. Simulation studies demonstrate that this method consistently yields lower bias and less variability in treatment effect estimates across various scenarios [40].

Practical Implementation Protocol

Step-by-Step Estimation Procedure

The following workflow diagram illustrates the comprehensive process for propensity score estimation and application:

Diagram 1: Propensity Score Analysis Workflow

Step 1: Define the Causal Question and Target Estimand Clearly specify the treatment comparison, outcome, target population, and causal contrast of interest. Determine whether the target estimand is the average treatment effect in the overall population (ATE), treated population (ATT), or overlap population (ATO). This decision guides the selection of appropriate propensity score methods [41].

Step 2: Assemble the Study Cohort and Define Variables Identify the source population, eligibility criteria, and index dates. Precisely define treatment exposure, outcome, and potential confounders using structured healthcare data. Implement a covariate assessment period preceding treatment initiation to ensure proper temporal ordering.

Step 3: Select Covariates for Inclusion Incorporate covariates that are potential common causes of treatment and outcome. Use clinical knowledge, literature review, and data-driven approaches. Consider using high-dimensional propensity score (hdPS) algorithms when dealing with large-scale healthcare data with numerous potential covariates [42].

Step 4: Estimate Propensity Scores Select an appropriate estimation method based on sample size, treatment prevalence, and data complexity. For conventional analyses, use logistic regression with pre-specified covariates. For high-dimensional data or complex relationships, consider machine learning approaches like LASSO or XgBoost with proper hyperparameter tuning.

Step 5: Evaluate Covariate Balance Assess the success of propensity score estimation by comparing balance between treatment groups before and after adjustment. Use standardized mean differences (aim for <0.1) and variance ratios. Consider incorporating prognostic score balance as an additional metric [40].

Step 6: Estimate Treatment Effects Use the propensity scores to create balanced groups through matching, weighting, or stratification. Estimate the treatment effect in the balanced sample using an appropriate outcome model. For time-to-event outcomes, consider using propensity score adjustment within a nested case-control design to address both immortal time bias and confounding [42].

Step 7: Conduct Sensitivity Analyses Evaluate the robustness of findings to potential unmeasured confounding, model specifications, and methodological choices. Vary propensity score estimation approaches, covariate sets, and balance thresholds to assess consistency of results.

Advanced Considerations for Pharmacoepidemiology

Handling Rare Treatments and Outcomes When treatment prevalence is low (<10%), disease risk scores (DRS) may outperform propensity scores in reducing bias, particularly in nonlinear data settings [38]. DRS represents the probability of the outcome conditional on confounders in the untreated population. In scenarios with very rare treatments, consider using overlap weighting, which naturally focuses on the population with clinical equipoise and avoids extreme weights [41].

Addressing Time-Dependent Confounding and Immortal Time Bias For studies with time-varying exposures, consider implementing propensity score adjustment within a nested case-control framework to simultaneously address immortal time bias and confounding. This approach involves matching cases to controls based on time and then applying propensity score methods to address residual confounding [42].

High-Dimensional Propensity Score Implementation The hdPS algorithm systematically selects and prioritizes covariates from large healthcare databases through seven defined steps: (1) identify data dimensions, (2) identify candidate covariates, (3) assess recurrence, (4) assign priorities, (5) select covariates, (6) estimate propensity scores, and (7) evaluate balance. For time-to-event outcomes, LASSO prioritization may outperform the traditional Bross formula for covariate selection [42].

Performance Evaluation and Validation

Comparative Performance Across Scenarios

Simulation studies provide valuable insights into the relative performance of different propensity score estimation methods under various data conditions. The following table summarizes evidence-based recommendations based on comprehensive simulation studies:

Table 2: Method Selection Guide Based on Data Scenarios

Data Scenario	Recommended PS Method	Recommended Estimation Approach	Evidence
Low treatment prevalence (<0.1)	Disease Risk Score (DRS)	XgBoost for nonlinear data	DRS shows lower bias than PS when treatment prevalence is below 0.1, especially in nonlinear data [38]
Moderate-high treatment prevalence (0.1-0.5)	Propensity Score	Logistic regression or LASSO	PS has comparable or lower bias than DRS in this range [38]
Linear data with small samples	Propensity Score	LASSO or logistic regression	DRS does not outperform PS in linear or small sample data [38]
High-dimensional covariates	Propensity Score	LASSO or hdPS	ML methods can outperform logistic regression for PS estimation [38] [42]
Focus on clinical equipoise population	Overlap Weighting	Logistic regression with overlap weights	Overlap weighting targets ATO with bounded weights and exact balance [41]

Balance Metrics and Diagnostics

Comprehensive balance assessment should include both traditional covariate balance metrics and prognostic score balance. The absolute standardized mean difference (ASMD) for each covariate j is calculated as: ASMDj = |XÌ„j,treatment - XÌ„j,control| / Ïƒj,treatment where XÌ„j,treatment and XÌ„j,control are the sample means of covariate j in the treatment and control groups, and Ïƒ_j,treatment is the standard deviation in the treatment group. The mean ASMD across all covariates provides a summary measure of overall balance.

Prognostic score balance assessment involves calculating the ASMD for the predicted outcomes under control: ASMDPS = |È²Ì‚treatment - È²Ì‚control| / ÏƒÅ¶,treatment where È²Ì‚treatment and È²Ì‚control are the mean prognostic scores in the treatment and control groups. Research has demonstrated that imbalance in prognostic scores is more strongly associated with bias in treatment effect estimates than imbalance in individual covariates [40].

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Methodological Tools for Propensity Score Analysis

Tool Category	Specific Methods/Software	Application Context	Implementation Considerations
Statistical Software	R (cobalt, MatchIt, glmnet), Python (sklearn, causalinference), SAS (PROC PSMATCH)	General implementation	R provides the most comprehensive set of specialized packages for propensity score analysis and balance diagnostics
Machine Learning Libraries	XgBoost, scikit-learn, mlr3	High-dimensional data, complex relationships	Require careful hyperparameter tuning; default settings often suboptimal [39]
High-Dimensional PS Algorithms	hdPS package in R	Healthcare claims data with numerous potential covariates	Implements 7-step algorithm for automated covariate selection and prioritization [42]
Balance Metrics	Absolute Standardized Mean Difference (ASMD), Kolmogorov-Smirnov statistic, Prognostic score difference	Evaluating covariate balance	Prognostic score balance more directly related to bias in effect estimates [40]
Sensitivity Analysis Methods	E-value, Rosenbaum bounds, Unmeasured confounder models	Assessing robustness to unmeasured confounding	E-value increasingly recommended for reporting in pharmacoepidemiological studies
WWL154	4-Nitrophenyl 4-(4-Methoxyphenyl)piperazine-1-carboxylate	Research chemical 4-Nitrophenyl 4-(4-Methoxyphenyl)piperazine-1-carboxylate (CAS 1338574-93-8). For Research Use Only. Not for human or veterinary use.	Bench Chemicals
Venlafaxine-d6	Venlafaxine-d6 Stable Isotope - 940297-06-3	Venlafaxine-d6 internal standard for bioanalysis. For Research Use Only. Not for human or veterinary diagnostic or therapeutic use.	Bench Chemicals

Propensity score methods represent a powerful approach for addressing confounding in pharmacoepidemiological studies, but their successful implementation requires careful attention to model specification, covariate selection, and balance assessment. This guide has outlined a structured framework for propensity score estimation that incorporates both established practices and recent methodological innovations. As the field evolves, researchers should consider scenario-specific recommendations, particularly regarding the choice between propensity scores and disease risk scores based on treatment prevalence, and the integration of machine learning methods for complex data environments. Most importantly, methodological decisions should be guided by the causal question of interest, with transparent reporting of all analytical choices and their potential limitations.

In pharmacoepidemiological studies, researchers are often faced with the challenge of using observational data to estimate the effects of treatments, interventions, and exposures on patient outcomes. Unlike randomized controlled trials (RCTs)â€”considered the gold standard for estimating treatment effectsâ€”observational studies involve treatment assignments that are nonrandom processes, often influenced by patient characteristics [10]. This leads to systematic differences in baseline characteristics between treated and untreated subjects, a phenomenon known as confounding [10] [43]. Propensity score (PS) methods have emerged as a powerful set of statistical tools to achieve comparability between treatment groups in terms of their observed covariates, thereby controlling for confounding in the estimation of treatment effects [43].

The propensity score, defined as the probability of treatment assignment conditional on a subject's observed baseline covariates, serves as a balancing score [10]. Conditional on the propensity score, the distribution of observed baseline covariates is expected to be similar between treated and untreated subjects, allowing the estimation of treatment effects in a way that mimics some of the key characteristics of an RCT [10]. This article provides a detailed comparison of the four primary propensity score methodsâ€”matching, stratification, inverse probability of treatment weighting (IPTW), and covariate adjustmentâ€”within the context of pharmacoepidemiological research, offering application notes and experimental protocols for implementation.

Theoretical Framework and Key Concepts

The Potential Outcomes Model and Causal Inference

The theoretical foundation for propensity score methods rests on the potential outcomes framework (also known as the Rubin Causal Model) [10]. In this framework, each subject has a pair of potential outcomes: the outcome under the active treatment, Y(1), and the outcome under the control treatment, Y(0). However, only one of these outcomes is observed for each subject, depending on the actual treatment received. The individual treatment effect is defined as Y(1) - Y(0), but this cannot be directly calculated [10].

Instead, researchers focus on aggregate measures, primarily the Average Treatment Effect (ATE), defined as E[Y(1) - Y(0)] across the entire population, and the Average Treatment Effect on the Treated (ATT), defined as E[Y(1) - Y(0)|Z=1], which is the average effect for those who actually received the treatment [10]. In randomized controlled trials, these measures coincide due to random assignment, but in observational studies, they generally differ, and the researcher must decide which is more relevant to their research question [10].

The Propensity Score and Strong Ignorability

The propensity score, formally defined as e = P(Z=1|X), where Z is the treatment indicator and X is a vector of observed covariates, allows for unbiased estimation of treatment effects under the strong ignorability assumption [10]. This assumption requires that: (1) treatment assignment is independent of potential outcomes conditional on observed covariates (Y(0), Y(1)) â•© Z|X (the "no unmeasured confounders" assumption), and (2) every subject has a nonzero probability of receiving either treatment (0 < P(Z=1|X) < 1) [10]. When these conditions are met, conditioning on the propensity score enables the estimation of unbiased average treatment effects.

Estimation of the Propensity Score

In practice, the true propensity score is unknown and must be estimated from the data, most commonly using logistic regression where treatment status is regressed on observed baseline characteristics [10]. The predicted probabilities from this model serve as the estimated propensity scores. While logistic regression remains the most frequently used method, alternative approaches include machine learning techniques such as bagging, boosting, recursive partitioning, random forests, and neural networks, which may offer advantages, particularly in high-dimensional settings [10] [4].

Table 1: Propensity Score Estimation Methods

Method	Description	Best Use Cases
Logistic Regression	Traditional generalized linear model with logit link function	Standard settings with limited covariates; requires manual specification of interactions and nonlinear terms
High-Dimensional Propensity Score (hdPS)	Automated algorithm for covariate selection and model specification in databases with many potential covariates	Pharmacoepidemiological studies using claims data with large number of diagnosis, procedure, and prescription codes
Machine Learning Methods (e.g., boosting, random forests)	Data-adaptive algorithms that can capture complex relationships without manual specification	High-dimensional data; complex confounding patterns; large sample sizes
Dimensionality Reduction Techniques (e.g., PCA, autoencoders)	Projects high-dimensional covariate space into lower-dimensional representation	Situations with extreme high-dimensionality where traditional methods may overfit

Recent research has explored dimensionality reduction techniques such as principal component analysis (PCA), logistic PCA, and autoencoders for propensity score estimation in high-dimensional data, such as healthcare claims databases [4]. These approaches have demonstrated superior performance in achieving covariate balance compared to traditional methods in some applications, with autoencoder-based propensity scores showing particular promise [4].

Comparative Analysis of Propensity Score Methods

Four primary propensity score methods are commonly used to remove the effects of confounding in observational studies: propensity score matching, stratification on the propensity score, inverse probability of treatment weighting (IPTW), and covariate adjustment using the propensity score [10]. Each method employs the propensity score differently to create conditions under which treatment effects can be validly estimated.

Performance Comparison Across Study Designs

Evaluations of these methods in real-world scenarios provide critical insights for method selection. A comparison using datasets from four large-scale cardiovascular observational studies found that the performance of these methods varies considerably depending on study characteristics [44].

Table 2: Comparative Performance of Propensity Score Methods Based on Cardiovascular Studies

Method	Performance Characteristics	Key Limitations	Recommended Context
Matching	Produced good balance; comparable estimates to covariate adjustment	Tended to give less precise estimates in some cases; reduces sample size	When ATT is of interest; sufficient overlap between treatment groups
Stratification	Performed poorly with few outcome events	Increased bias with limited events; suboptimal balance within strata	Large studies with ample outcome events across all strata
Inverse Probability Weighting (IPTW)	Gave imprecise estimates; undue influence to small number of observations	Unstable with substantial confounding; extreme weights problematic	When ATE is of interest; well-specified propensity score model
Covariate Adjustment	Performed well across all examples; precise estimates	Relies on correct model specification for both PS and outcome	General purpose application; requires careful model checking

This comparative analysis suggests that propensity score methods are not necessarily superior to conventional covariate adjustment, and care should be taken to select the most suitable method for a given research context [44]. The performance depends on factors such as sample size, prevalence of the outcome, degree of confounding, and overlap between treatment groups.

Application Protocols for Pharmacoepidemiology

Standardized Workflow for Propensity Score Analysis

Implementing propensity score methods requires a systematic approach to ensure valid and reproducible results. The following workflow outlines the key steps in a comprehensive propensity score analysis:

Figure 1. Standardized workflow for conducting a propensity score analysis in pharmacoepidemiological studies.

Protocol 1: Propensity Score Matching

Purpose: To create a matched sample where treated and untreated subjects have similar distributions of observed covariates by matching each treated subject to one or more untreated subjects with similar propensity scores.

Procedures:

Estimate propensity scores using an appropriate method (e.g., logistic regression including all pre-specified covariates)
Choose matching algorithm: Nearest-neighbor matching (with or without caliper), optimal matching, or genetic matching
Set matching parameters: Typically a caliper width of 0.2 standard deviations of the logit of the propensity score is recommended
Execute matching without replacement to maximize statistical precision or with replacement to improve balance
Assess matching performance by comparing covariate balance before and after matching using standardized mean differences (target <0.1) and variance ratios
Analyze matched sample using appropriate statistical methods (e.g., paired t-tests for continuous outcomes, conditional logistic regression for binary outcomes)

Considerations:

Most appropriate when the ATT is the causal contrast of interest
Matching on the logit of the propensity score often improves performance
Always report the number of treated subjects matched and the resulting effective sample size

Protocol 2: Stratification on the Propensity Score

Purpose: To divide the study population into strata (typically quintiles) based on the propensity score and then estimate treatment effects within each stratum before pooling.

Procedures:

Estimate propensity scores for all subjects
Create strata based on propensity score quantiles (typically 5 strata)
Assess balance within each stratum to ensure covariates are balanced between treatment groups
Check outcome events within each stratum to ensure adequate numbers for stable estimation
Estimate stratum-specific treatment effects using outcome models appropriate for the data type
Pool stratum-specific estimates to obtain an overall treatment effect estimate

Considerations:

Performance deteriorates with few outcome events per stratum [44]
May not completely eliminate confounding within strata if propensity score distributions differ substantially
Generally requires regression adjustment within strata to address residual confounding

Protocol 3: Inverse Probability of Treatment Weighting (IPTW)

Purpose: To create a pseudo-population in which treatment assignment is independent of observed covariates by weighting subjects by the inverse probability of receiving their actual treatment.

Procedures:

Estimate propensity scores for all subjects
Calculate weights: For treated subjects, weight = 1/PS; for untreated subjects, weight = 1/(1-PS)
Assess weight distribution: Identify extreme weights that may indicate poor overlap or model misspecification
Consider stabilized weights to improve efficiency: For treated subjects, stabilized weight = P(Z=1)/PS; for untreated subjects, stabilized weight = P(Z=0)/(1-PS)
Evaluate balance in the weighted sample using standardized mean differences
Analyze weighted data using appropriate weighted regression models

Considerations:

Can produce imprecise estimates and give undue influence to observations with extreme weights when substantial confounding is present [44]
Primarily estimates the ATE rather than ATT
Weight truncation may be necessary to handle extreme weights

Protocol 4: Covariate Adjustment Using the Propensity Score

Purpose: To control for confounding by including the propensity score as a covariate in the outcome regression model.

Procedures:

Estimate propensity scores for all subjects
Specify outcome model that includes both treatment assignment and the propensity score as covariates
Consider functional form of the propensity score in the model (linear, quadratic, restricted cubic splines)
Check model assumptions for the outcome regression (linearity, homoscedasticity, etc.)
Estimate treatment effect from the outcome model

Considerations:

Performed well across various scenarios in comparative studies [44]
Requires correct specification of both the propensity score model and the outcome model
Simpler implementation than other methods but provides less transparent balancing of covariates

Advanced Applications in Pharmacoepidemiology

High-Dimensional Propensity Scores in Claims Data

Pharmacoepidemiological studies frequently utilize administrative claims data, which contain hundreds of potential covariates in the form of diagnosis codes, procedure codes, and prescription records. The high-dimensional propensity score (hdPS) algorithm was developed specifically to address the challenges of such data environments [4] [6]. This algorithm automates the process of covariate selection and prioritization by identifying and selecting the most prevalent and imbalanced codes across a large number of candidate covariates.

A recent application in a study of disease-modifying drugs in multiple sclerosis implemented hdPS within a nested case-control framework to simultaneously address both immortal time bias and residual confounding [6]. This approach demonstrated a 28% reduction in mortality risk associated with exposure to DMDs (HR: 0.72, 95% CI: 0.62-0.84), with consistent results across sensitivity analyses [6].

Dimensionality Reduction Techniques

Emerging methodologies for handling high-dimensional covariates include dimensionality reduction techniques such as principal component analysis (PCA), logistic PCA, and autoencoders [4]. In a comparative study evaluating the association between dialysis and mortality in older heart failure patients, autoencoder-based propensity scores achieved superior covariate balance compared to traditional methods:

Table 3: Performance of Dimensionality Reduction Techniques for PS Estimation

Method	Covariates with SMD > 0.1	Balance Performance
Autoencoder-based PS	8	Best
PCA-based PS	20	Good
Logistic PCA-based PS	25	Moderate
High-Dimensional PS (hdPS)	37	Fair
Investigator-Specified PS	83	Poor

These advanced methods may offer improved covariate balance in pharmacoepidemiological studies using propensity score-matched designs in large healthcare databases [4].

The Researcher's Toolkit: Essential Components for PS Analysis

Software and Computational Tools

Implementing propensity score analyses requires appropriate statistical software and packages. While many software environments support basic propensity score methods, specialized packages offer enhanced functionality:

R: The MatchIt, optmatch, WeightIt, and CBPS packages provide comprehensive implementations of various propensity score methods
SAS: PROC PSMATCH and related procedures offer robust propensity score analysis capabilities
Stata: teffects psmatch, psmatch2, and pscore commands facilitate propensity score estimation and application
Python: The causalinference, psmpy, and dowhy libraries offer growing support for propensity score methods

Balance Assessment and Diagnostics

A critical step in any propensity score analysis is assessing whether the propensity score model has been adequately specified to achieve balance in observed covariates between treatment groups [10]. Key diagnostics include:

Standardized mean differences (SMD): Absolute values <0.1 indicate adequate balance for a covariate
Variance ratios: Values between 0.5 and 2 generally indicate acceptable balance
Visual assessments: Love plots, jitter plots, and distributional comparisons
Statistical tests: Kolmogorov-Smirnov tests for distributional differences

Recent methodological developments emphasize the superiority of balance diagnostics over traditional goodness-of-fit measures for propensity model selection [10]. The goal is not to maximize prediction of treatment assignment but to achieve balance in covariate distributions between treatment groups.

Propensity score methods offer powerful approaches for controlling confounding in pharmacoepidemiological studies, each with distinct strengths, limitations, and appropriate applications. Based on current evidence, no single method dominates across all scenarios, and method selection should be guided by study objectives, data characteristics, and the specific causal contrast of interest [44]. Covariate adjustment using the propensity score and matching generally perform well across diverse scenarios, while stratification and IPTW require more specific conditions for optimal performance [44].

Emerging methodologies, particularly those addressing high-dimensional confounding in claims data, continue to enhance the applicability of propensity score methods in pharmacoepidemiology [4] [6]. Regardless of the specific method chosen, rigorous implementation following established protocolsâ€”including thoughtful covariate selection, careful balance assessment, and comprehensive sensitivity analysesâ€”remains essential for producing valid and reliable evidence from observational studies in clinical nutrition and pharmacoepidemiological research [43].

In pharmacoepidemiology, accurate estimation of treatment effects using real-world data is fundamentally challenged by confounding bias. The high-dimensional propensity score (hdPS) addresses this by automating the selection and adjustment for hundreds of candidate covariates from healthcare databases such as administrative claims and electronic health records (EHRs) [45]. This method empirically identifies and prioritizes proxy variables for unmeasured or poorly measured confounders, moving beyond the limitations of investigator-specified covariate sets alone [45] [46]. The growing complexity and volume of real-world data have spurred the integration of machine learning (ML) techniques with the hdPS framework, enhancing its ability to manage high-dimensional data and uncover complex relationships [46]. This document details advanced protocols for implementing hdPS and machine learning for covariate selection, providing researchers and drug development professionals with practical tools for robust comparative effectiveness and safety research.

Background and Theoretical Foundations

The traditional propensity score, defined as the probability of treatment assignment conditional on observed covariates, relies on investigator-specified variables chosen from domain knowledge [46]. This approach may omit crucial confounders, particularly those that are unmeasured or imperfectly captured. The hdPS algorithm, introduced in 2009, builds upon this by conceptualizing information in healthcare databases as proxies for underlying clinical constructs [45] [47]. It is a semi-automated, data-driven procedure that systematically generates and ranks a large number of covariates from predefined data dimensions (e.g., diagnoses, drug prescriptions, procedures) [45] [48].

A key conceptual advantage of hdPS is its use of proxy measure adjustment. For confounders like frailty that are difficult to measure directly, hdPS can leverage proxies such as the use of a wheelchair or oxygen canisters, which are correlated with the underlying confounder [45]. By automatically identifying and incorporating such proxies, hdPS aims to improve confounding control beyond what is possible with traditional methods [45] [47]. The method is data source-independent and has been successfully applied across diverse healthcare systems, including those in North America, Europe, and Japan [45].

Comparative Performance of hdPS and Machine Learning Methods

Recent research has systematically evaluated the performance of traditional hdPS against various multivariate statistical and machine learning methods within the hdPS framework. Performance varies based on epidemiological scenarios, such as the prevalence of exposure and outcome, and the choice of metric (e.g., bias, Mean Squared Error (MSE), or coverage) [46].

The table below summarizes the comparative performance of different methods based on a plasmode simulation study using real-world data structure [46].

Table 1: Performance Comparison of Variable Selection Methods within the hdPS Framework

Method	Key Strengths	Key Limitations	Ideal Use Case
Bross-based hdPS	Low bias, balanced approach, well-established [46].	May miss complex variable interactions [46].	Standard applications prioritizing bias minimization.
Hybrid hdPS	Balanced bias and MSE, combines strengths of different approaches [46].	Coverage can vary by scenario [46].	Scenarios seeking a robust balance between bias and precision.
XGBoost	Strong precision, good coverage, handles complex patterns [46].	Higher bias, especially with rare exposures; "black-box" [46].	Applications where precision is the primary goal.
LASSO / Elastic Net	Effective in high-dimensional settings, automatic variable selection [46].	Performance can be outperformed by other ML methods [46].	High-dimensional data with many correlated covariates.
Genetic Algorithm (GA)	Automates model selection [46].	Consistently high bias and MSE; least reliable [46].	Not generally recommended based on current evidence.
Forward/Backward Selection	Low bias, comparable coverage to sophisticated ML, computationally efficient [46].	May not capture all complex relationships [46].	Computationally efficient alternative with good bias control.
Nonan-1-ol-d4	Nonan-1-ol-d4, MF:C9H20O, MW:148.28 g/mol	Chemical Reagent	Bench Chemicals
Nadolol D9	Nadolol D9, MF:C17H27NO4, MW:318.46 g/mol	Chemical Reagent	Bench Chemicals

The findings indicate no single method dominates all others. The choice depends on study priorities: XGBoost is effective for precision, while Bross-based hdPS and traditional forward/backward selection are better for minimizing bias [46]. Simpler methods often provide a viable, computationally efficient alternative to complex ML models [46].

Detailed Experimental Protocols

Protocol 1: Implementing the Standard hdPS Algorithm

This protocol outlines the five core steps of the standard hdPS implementation, which requires careful pre-specification of parameters in the study protocol and statistical analysis plan [45] [47].

Table 2: Key Decisions for Standard hdPS Implementation

Step	Decision Point	Options & Recommendations
1. Specify Data Dimensions	Identify types of patient data for variable generation.	Typical dimensions: diagnoses, procedures, drug prescriptions. Report each dimension and its clinical aspect [45] [47].
2. Generate Pre-exposure Features	Define code granularity and apply prevalence filter.	Truncate codes (e.g., 3-digit ICD-10). Select top 200 most prevalent codes per dimension. Justify granularity and filter use [47].
3. Assess Feature Recurrence	Create binary indicators for frequency.	Standard: indicators for â‰¥once, â‰¥median, â‰¥75th percentile. Can consider proximity to exposure start. Report chosen cut-offs [47].
4. Prioritize Covariates	Select ranking method for variable selection.	Default: Bross formula (exposure- and outcome-associated). For rare outcomes: exposure-based ranking. Report method used [46] [47].
5. Select Covariates & Estimate PS	Determine number of hdPS variables and PS model.	Typical: top 200-500 hdPS variables. Combine with investigator-specified confounders in logistic regression. Report final variable count and software [45] [48] [47].

Protocol 2: Integrating Machine Learning for Covariate Prioritization

This protocol modifies Step 4 of the standard hdPS, replacing the Bross formula with an ML-based prioritization to capture complex, multivariate relationships [46].

Procedure:

Input Data Preparation: Follow Steps 1-3 of Protocol 1 to generate the pool of candidate binary covariates. Combine this pool with the investigator-specified covariates.
Model Training for Prioritization: Instead of univariate Bross scores, use a machine learning algorithm to predict the exposure (treatment). Algorithms such as LASSO, Elastic Net, or XGBoost are suitable as they perform built-in variable selection or importance ranking [46].
Variable Ranking: Rank all covariates based on the variable importance metric from the trained ML model (e.g., coefficient magnitude in LASSO/Elastic Net, or gain in XGBoost).
Covariate Selection: Select the top k covariates from the ML-based ranking. The value of k can be determined via cross-validation aimed at optimizing the model's predictive performance or set to a predefined number (e.g., 500) for consistency [46].
Propensity Score Estimation: Use the selected covariates in a logistic regression model to estimate the propensity score. Alternatively, the predicted probabilities from the ML model itself can be used as the propensity score, though this may complicate variance estimation [46].

Protocol 3: A Hybrid hdPS-ML Workflow

A robust approach combines the data-driven covariate generation of hdPS with the multivariate selection capabilities of ML, followed by a final PS estimation using traditional regression.

Procedure:

Initial Covariate Screening: Execute Steps 1-4 of the standard hdPS protocol (Protocol 1) to generate a prioritized list of, for example, 1000 candidate covariates using the Bross formula.
Refined ML Selection: Input the pre-selected 1000 covariates into a machine learning algorithm with variable selection properties, such as LASSO or Elastic Net. Use the outcome variable (not the exposure) as the response in this ML model to identify a subset of covariates most predictive of the outcome from the hdPS-prescreened set [46].
Final Model Specification: Create the final set for PS estimation by combining:
- All investigator-specified covariates (deemed essential based on subject-matter knowledge).
- The subset of hdPS-derived covariates selected by the ML model in the previous step.
Estimate Propensity Score: Estimate the propensity score using a logistic regression model with the final covariate set from Step 3. This "doubly refined" approach leverages the strengths of both hdPS and ML while maintaining the interpretability of a parametric PS model [46].

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Software and Analytical Tools for hdPS and ML Implementation

Tool Name	Category	Function & Application
R hdPS Package [47]	Software Package	Implements the core hdPS algorithm in R for automated covariate identification and prioritization from data dimensions.
SAS hdPS Macro [47]	Software Package	Provides a SAS macro for implementing the hdPS procedure within the SAS analytics environment.
Aetion Platform	Software Platform	A commercial platform that incorporates hdPS capabilities for rapid-cycle analytics on real-world evidence.
XGBoost [46]	ML Library	A gradient boosting framework that provides high precision in variable selection and can be integrated for hdPS covariate prioritization.
glmnet [46]	ML Library	A software library for fitting LASSO and Elastic Net models, useful for multivariate variable selection within the hdPS covariate pool.
Glyceryl tri(hexadecanoate-2,2-D2)	Glyceryl tri(hexadecanoate-2,2-D2), CAS:241157-06-2, MF:C51H98O6, MW:813.4 g/mol	Chemical Reagent
N-Octadecyl-D37 alcohol	N-Octadecyl-D37 alcohol, MF:C18H38O, MW:307.7 g/mol	Chemical Reagent

Workflow Visualization and Diagnostics

hdPS and Machine Learning Integration Workflow

The following diagram illustrates the core steps and decision points in a generalized hdPS workflow, highlighting where machine learning can be integrated.

Essential Diagnostic and Reporting Checklist

Transparent reporting and rigorous diagnostics are critical for validating hdPS analyses and combating "black-box" criticisms [47]. The following checklist synthesizes key reporting items and diagnostic tools.

Table 4: hdPS Reporting Checklist and Diagnostic Tools

Reporting Item	Description	Diagnostic Tool / Action
Data Dimensions	Clearly list all data dimensions used.	Report the aspect of care each dimension captures and coding systems used [47].
Covariate Prioritization	Specify the method used for ranking.	State whether Bross, ML, or other method was used and justify the choice [47].
Covariate Count	Report the number of hdPS variables selected.	Justify the chosen number (e.g., 500) and conduct sensitivity by varying this number (e.g., k=200, 500) [48] [47].
Balance Diagnostics	Assess the reduction in confounding.	Create a "Table 1" to compare baseline characteristics before and after PS adjustment. Report standardized mean differences [45] [47].
Software	Document the software used.	Name the specific software package (R, SAS, Aetion) and version [47].
Sensitivity Analyses	Evaluate robustness of findings.	Vary key parameters (e.g., number of hdPS covariates) and assess impact on treatment effect estimate [47].

Pharmacoepidemiologic studies investigating the effects of time-dependent drug exposures on rare outcomes face significant methodological challenges. Two predominant issues are immortal time bias and residual confounding, which can severely distort effect estimates if not properly addressed [6]. The nested case-control (NCC) design offers an efficient framework for studying rare events within established cohorts, while propensity score (PS) methods provide powerful approaches to control for confounding. However, traditional PS methods treat exposure as a binary, time-fixed variable, which is often misaligned with real-world clinical practice where treatments are initiated and modified at different times during patient follow-up [49].

This integration of time-dependent PS methods within NCC designs represents a significant methodological advancement for addressing complex exposure scenarios in pharmacoepidemiology. The hybrid "NCC-hdPS" approach (incorporating high-dimensional propensity scores) has recently demonstrated utility in simultaneously dealing with both immortal time bias and residual confounding, substantially improving the validity of causal effect estimates from observational data [6]. These developments are particularly relevant within modern pharmacoepidemiologic frameworks such as the ICH E9(R1) estimand framework, which emphasizes precise definition of treatment effects despite intercurrent events [50].

Methodological Foundations

The Nested Case-Control Design

The nested case-control study incorporates the strengths of both cohort and case-control designs by embedding case-control methodology within an established prospective cohort [51] [52]. In this design, all cases of the outcome event are identified from the cohort, and for each case, a small number of controls are randomly selected from those cohort members who remain at risk at the time of the case's event (the risk set) [52]. This approach has several key advantages over standard case-control designs:

Same population source: Cases and controls are sampled from the same underlying population, minimizing selection bias [52]
Temporal clarity: Exposure status is ascertained prior to the outcome event, preserving causal interpretation [52]
Efficiency: Particularly valuable when exposure assessment is costly or labor-intensive [51]
Time-matching: Cases can be matched to controls at the time of the outcome event, addressing time-dependent exposures and confounding [52]

A recent simulation study comparing cohort and NCC designs for time-varying exposures found that once tied event times were correctly accounted for using exact methods, NCC estimates were very similar to those from full cohort analysis, supporting the validity of this approach [53].

Time-Dependent Propensity Scores

Conventional PS methods generate a single probability of treatment assignment for each individual at study entry, ignoring the time-varying nature of many treatments [49]. Time-dependent propensity score methods address this limitation by modeling the probability of treatment initiation at each point in time during follow-up, considering the evolving clinical characteristics of patients [49] [54].

Two primary approaches have been developed for time-dependent PS estimation:

Cox-based PS: Treatment initiation is modeled as a time-to-event process using Cox proportional hazards regression, where the probability of receiving treatment at time t is estimated conditional on not having been treated before time t [49] [54]
Logistic regression with time strata: Treatment status is modeled using logistic regression within specific time windows, creating a piecewise approach to time-dependent confounding [49]

Simulation studies have demonstrated that conventional PS methods ignoring time-to-exposure property introduce significant bias, while time-dependent PS matching can achieve results approaching the true treatment effect [49]. After time-dependent PS matching, the matched cohort can be analyzed with conventional Cox regression or conditional logistic regression models with time strata, performing comparably to correctly specified Cox regression models with time-varying covariates [49].

Integrated Workflow: NCC with Time-Dependent hdPS

The integration of time-dependent high-dimensional propensity score (hdPS) within a nested case-control framework provides a robust approach to address both immortal time bias and residual confounding simultaneously. The workflow can be visualized as follows:

Figure 1: Integrated workflow for nested case-control design with time-dependent propensity score

Protocol for Implementing the Combined Design

Cohort Definition and Risk Set Specification

Base cohort: Define a retrospective cohort from appropriate data sources (e.g., administrative claims, electronic health records, disease registries) with clearly specified inclusion/exclusion criteria [6] [50]
Time axis: Establish the time scale for analysis (e.g., time since diagnosis, calendar time, age), with time zero appropriately defined based on the research question [52]
Risk sets: At each distinct event time, define the risk set as all cohort members who are still under follow-up and have not experienced the event prior to that time [52]

Case and Control Selection

Case definition: Identify all individuals who experience the outcome event during follow-up [52]
Control selection: For each case, randomly select controls from the corresponding risk set (typically 3-5 controls per case) using sampling without replacement [6] [52]
Matching factors: Consider matching on potential confounders (e.g., age, sex, disease severity) and time factors (calendar time, disease duration) to improve efficiency [52]

Time-Dependent hdPS Estimation

Variable selection: Identify candidate covariates for PS estimation, including baseline characteristics, time-fixed confounders, and time-varying covariates [6]
Exposure modeling: Model the time to exposure initiation using Cox proportional hazards regression, including all identified covariates as predictors [49] [54]
PS estimation: For each individual in the risk set, estimate the probability of being exposed at the event time of the case, conditional on not being exposed before that time [49]

Analysis of Matched Sets

Conditional logistic regression: Analyze the matched case-control sets using conditional logistic regression, conditioning on the matching strata [6] [53]
Effect estimation: Report hazard ratios with 95% confidence intervals to quantify the association between exposure and outcome [6]
Sensitivity analyses: Assess robustness of findings using different PS estimation techniques, control-to-case ratios, and methods for handling tied event times [6] [53]

Application in Multiple Sclerosis Research

Case Study: DMDs and Mortality in MS

A recent study applied the integrated NCC-hdPS approach to examine the relationship between disease-modifying drugs (DMDs) and all-cause mortality in multiple sclerosis patients, demonstrating the utility of this methodology [6].

Table 1: Study characteristics and results from MS mortality study using NCC-hdPS design

Study Component	Details
Data Source	Retrospective cohort of 19,360 individuals with MS in British Columbia, Canada
Exposure	Disease-modifying drugs (DMDs) for MS
Outcome	All-cause mortality
NCC Components	3,209 cases matched to 12,293 controls (1:4 ratio)
hdPS Application	High-dimensional propensity score to address residual confounding
Primary Result	28% reduction in mortality risk (HR: 0.72, 95% CI: 0.62-0.84)
Sensitivity Analyses	Consistent results across different PS techniques (HR range: 0.70-0.77)

The implementation successfully addressed both immortal time bias (through the NCC framework) and residual confounding (through hdPS), providing a more valid estimate of the treatment effect than conventional approaches [6].

Simulation Evidence

Simulation studies have quantitatively compared the performance of cohort and NCC designs for estimating time-varying exposure effects:

Table 2: Performance characteristics of cohort versus nested case-control designs for time-varying exposures

Performance Metric	Cohort Design	Nested Case-Control Design
Relative Bias	Small	Bias toward null (decreases with more controls)
Precision	Greater	Moderate loss of precision
Impact of Event Proportion	Minimal	Marked increase in bias with higher event rates
Handling of Tied Events	Robust with exact methods	Bias with Breslow's/Efron's methods; reduced with exact method
Confounder Control	Multivariable adjustment	Matching on confounders reduces bias

These simulations confirm that NCC estimates are very similar to full cohort analysis once ties are correctly accounted for, supporting the validity of the NCC design for time-varying exposures [53].

The Scientist's Toolkit

Table 3: Essential methodological tools for implementing NCC with time-dependent PS

Tool/Resource	Function	Implementation Notes
TDPSM() Function	Performs time-dependent PS matching [49]	Available R function; iteratively matches treated subjects to at-risk controls across time strata
hdPS Algorithm	Automates covariate selection and PS estimation [6]	Identifies candidate covariates from large datasets; requires specification of dimensions and parameters
Conditional Logistic Regression	Analyzes matched case-control data	Conditions on matching strata; available in standard statistical packages
tmerge() Function	Creates counting process dataset for survival analysis [49]	Expands data into multiple intervals for time-varying exposures and covariates
Risk Set Sampling	Selects controls from appropriate population	Must specify sampling with/without replacement and exclusion of cases from own risk set
Mahalanobis Distance Matching	Selects controls with similar characteristics [52]	Accounts for correlation between matching factors; useful for collinear variables

Discussion

Methodological Considerations

The integration of time-dependent PS methods within NCC designs requires careful attention to several methodological aspects. Time scale selection is crucial, as the choice between time-on-study, age, or calendar time can substantially impact results [52]. Control selection strategies must balance statistical efficiency with computational feasibility, with evidence suggesting that increasing the control-to-case ratio to 5:1 or more can reduce bias toward the null [53]. Proper handling of tied event times is essential, as standard approximations (Breslow's, Efron's) can introduce bias, while exact methods perform better [53].

The hdPS component enhances confounding control by incorporating a large number of empirically identified covariates, which is particularly valuable in complex clinical settings with numerous potential confounders [6]. However, this approach requires careful parameter specification and sensitivity analyses to ensure robust findings.

Alignment with Modern Pharmacoepidemiologic Frameworks

The NCC-hdPS approach aligns well with emerging frameworks for observational research. The ICH E9(R1) estimand framework emphasizes precise definition of treatment strategies, intercurrent events, and target populations [50]. The time-dependent PS explicitly addresses the "treatment" attribute by appropriately handling time-varying exposures, while the NCC design's risk set sampling naturally accommodates various strategies for handling intercurrent events.

Similarly, the target trial emulation framework benefits from this integrated approach, as the NCC design embedded within a cohort mirrors the structure of a randomized trial, with cases and controls sampled from a clearly defined study population [50] [52]. The time-dependent PS further strengthens the emulation by ensuring appropriate comparison groups that account for treatment timing.

The integration of time-dependent propensity score methods within nested case-control designs represents a significant methodological advancement for pharmacoepidemiologic studies of time-varying exposures. This hybrid approach simultaneously addresses two major challengesâ€”immortal time bias and residual confoundingâ€”that frequently complicate observational drug safety and effectiveness research.

The availability of reproducible code [6] and specialized functions [49] facilitates implementation of these methods, making them increasingly accessible to researchers. As pharmacoepidemiology continues to evolve toward more rigorous causal inference frameworks, this integrated methodology offers a powerful tool for generating valid evidence from real-world data, particularly for studying dynamic treatment regimens and their effects on rare outcomes.

Future methodological development should focus on extending these approaches to more complex exposure patterns (e.g., repeated, intermittent, or cumulative exposures), refining software implementation for computational efficiency, and further integrating with modern causal inference frameworks to enhance the robustness of observational drug safety research.

Propensity score (PS) methods have become a cornerstone in pharmacoepidemiology for addressing confounding bias in non-randomized studies of treatment effectiveness and safety. These methods facilitate the estimation of causal treatment effects from observational data by creating balanced comparison groups, mimicking some properties of randomized controlled trials (RCTs). This application note provides a detailed protocol for the practical implementation of propensity score methods, from initial model building to final treatment effect estimation, specifically tailored for pharmacoepidemiological research.

Theoretical Foundations and Assumptions

Core Concepts

The propensity score is defined as the probability of a study participant being assigned to a treatment group, conditional on their measured baseline covariates [28]. In pharmacoepidemiology, this translates to the probability of receiving a specific drug given patient characteristics, comorbidities, concomitant medications, and other potential confounders. Propensity scores serve as a dimension-reducing balancing score, creating treatment and reference groups with comparable distributions of measured pretreatment covariates [28].

Causal Inference Assumptions

The valid application of propensity score methods rests on several critical assumptions [28] [55]:

Conditional Ignorability: All common causes of treatment assignment and outcome must be measured and included in the propensity score model.
Positivity: Every patient must have a non-zero probability of receiving either treatment, formalized as 0 < P(A=1|X=x) < 1 for all x.
Stable Unit Treatment Value Assumption (SUTVA): The potential outcomes of any unit must not be affected by the treatments assigned to other units.

Table 1: Core Assumptions for Valid Causal Inference with Propensity Scores

Assumption	Definition	Practical Implication in Pharmacoepidemiology
Conditional Ignorability	No unmeasured confounding given covariates	All clinically relevant confounders must be measured
Positivity	All patients have chance of receiving either treatment	Avoid including patients with absolute contraindications
SUTVA	No interference between patients	Treatment of one patient doesn't affect another's outcome

Propensity Score Workflow: A Step-by-Step Protocol

Study Design and Covariate Selection

The initial phase involves emulating a target trial through careful study design [5]. Implement a new-user design to avoid prevalent user bias by identifying patients at the initiation of treatment. Define a clean period without treatment exposure before cohort entry and ensure all covariates are measured during this baseline period.

Covariate selection should be guided by subject matter knowledge rather than purely algorithmic approaches [28] [56]. Include variables that are risk factors for the outcome and associated with treatment assignment. A recent benchmarking study demonstrated that traditional logistic regression with a priori confounder selection based on clinical knowledge produced estimates that more closely aligned with RCT results compared to machine learning approaches with data-driven variable selection [56].

Table 2: Essential Components of Propensity Score Study Design

Design Element	Protocol Specification	Rationale
Population	Define inclusion/exclusion criteria	Ensure clinical homogeneity
Treatment Groups	New users of treatments being compared	Avoid prevalent user bias
Covariate Assessment	Fixed baseline period before treatment initiation	Ensure proper temporal ordering
Outcome Definition	Clearly specified with validated algorithms	Maximize validity of endpoint ascertainment

Propensity Score Estimation

Estimate the propensity score using an appropriate statistical model. For binary treatments, logistic regression is most common, though machine learning methods are increasingly used:

While machine learning approaches like generalized boosting models have been explored, recent evidence suggests they may not outperform traditional logistic regression and can potentially introduce overadjustment bias when combined with data-driven confounder selection [56].

Propensity Score Application Methods

Choose an appropriate method for implementing the propensity scores:

Matching: Pair treated and untreated subjects with similar propensity scores using algorithms (e.g., 1:1 nearest-neighbor matching with caliper)
Stratification: Divide subjects into strata based on propensity score quantiles
Weighting: Create a pseudo-population using inverse probability of treatment weights
Covariate adjustment: Include the propensity score as a covariate in outcome models

Recent methodological research supports matching as it most closely approximates the conditions of a randomized experiment [55]. A caliper width of 0.2 times the standard deviation of the logit of the propensity score has been shown to effectively eliminate over 90% of confounding bias [55].

After applying propensity scores, assess covariate balance between treatment groups using standardized mean differences (aim for <0.1) and variance ratios. Visual assessment using love plots is recommended. If imbalance persists, consider refining the propensity score model by adding interaction terms or higher-order terms for continuous variables.

Outcome Analysis

Once adequate balance is achieved, proceed with outcome analysis. In matched designs, use methods that account for the matched nature of the data, such as conditional logistic regression or robust variance estimators. The specific model should be chosen based on the outcome type (e.g., logistic regression for binary outcomes, Cox regression for time-to-event outcomes).

Workflow Visualization

The following diagram illustrates the complete propensity score analysis workflow:

Advanced Applications and Extensions

Multiple Treatment Comparisons

Conventional propensity score methods address binary treatments, but many clinical decisions involve choosing among multiple therapeutic options. The generalized propensity score (GPS) extends the framework to multiple treatments, whether ordinal (e.g., different drug doses) or categorical (e.g., different drug classes) [57]. Research has shown that simple extensions of binary propensity score methods can produce misleading results when applied to multiple treatments, and specialized matching procedures are required [57].

High-Dimensional Covariate Settings

In pharmacoepidemiology utilizing healthcare claims data, researchers often face high-dimensional covariate spaces. The high-dimensional propensity score (hdPS) algorithm provides a systematic approach to empirically identify and adjust for potential confounders from large healthcare databases [5]. However, recent evidence suggests that domain knowledge should guide confounder selection even in high-dimensional settings [56].

Case Study: Cardiovascular Safety of Diabetes Medications

A recent study applied advanced propensity score methods to examine the cardiovascular safety of second-line noninsulin antihyperglycemic treatments added to metformin in type 2 diabetes [57]. Using data from the Clinical Practice Research Datalink (CPRD), researchers compared multiple treatment regimens including metformin plus sulfonylureas, thiazolidinediones, or dipeptidyl peptidase-4 inhibitors.

Implementation Protocol

The analysis employed generalized propensity scores with Mahalanobis distance matching to address the multiple treatment comparisons. The protocol included:

Population: Adults with type 2 diabetes initiating second-line therapy after metformin
Covariates: Demographics, comorbidities, concomitant medications, healthcare utilization
Matching: Fuzzy matching on the generalized propensity score with caliper
Outcome: Major adverse cardiovascular events (MACE) and mortality over 3 years

Key Findings

The propensity score analysis revealed that metformin plus gliclazide (sulfonylurea) increased the 3-year risk of MACE compared to metformin plus pioglitazone (thiazolidinedione), and increased mortality risk compared to both metformin plus pioglitazone and metformin plus sitagliptin (DPP-4 inhibitor) [57]. These findings demonstrate how propensity score methods can provide clinically relevant comparative effectiveness evidence from observational data.

Table 3: Research Reagent Solutions for Propensity Score Analysis

Tool Category	Specific Solutions	Application Context
Statistical Software	R (package: MatchIt), SAS, Stata	Primary analysis platforms
Propensity Score Estimation	Logistic regression, Generalized boosting models, Random forests	Model building for treatment probability
Balance Assessment	Standardized mean differences, Variance ratios, Love plots	Evaluating covariate balance post-matching
Outcome Analysis	Conditional logistic regression, Cox regression with robust variances, G-computation	Treatment effect estimation

Validation and Sensitivity Analysis

Benchmarking Against RCTs

Whenever possible, validate propensity score analyses against existing RCT evidence. A recent study benchmarking observational analyses using propensity scores against the PARADIGM-HF randomized trial found that traditional logistic regression with clinical knowledge-based confounder selection most closely aligned with trial results [56].

Sensitivity Analysis for Unmeasured Confounding

Conduct sensitivity analyses to quantify how strong an unmeasured confounder would need to be to explain away the observed treatment effect. Methods such as the E-value approach or probabilistic sensitivity analysis should be routinely implemented.

This application note provides a comprehensive protocol for implementing propensity score methods in pharmacoepidemiological research. The workflow emphasizes proper study design, clinically informed covariate selection, rigorous balance assessment, and appropriate outcome analysis. When correctly applied, propensity score methods offer a powerful tool for generating real-world evidence on treatment effects, though they require careful attention to methodological assumptions and limitations. Recent advances in multiple treatment comparisons and high-dimensional confounding adjustment continue to enhance the utility of these methods for drug development and comparative effectiveness research.

Navigating Pitfalls and Enhancing Robustness in PS Analysis

The "Propensity Score Matching (PSM) Paradox," a phenomenon where increased pruning of matched pairs based on propensity score distance reportedly leads to greater covariate imbalance and bias, has sparked considerable debate in pharmacoepidemiologic methodology. This application note examines this paradox through the lens of practical pharmacoepidemiologic research, where large healthcare databases and complex confounding structures present unique challenges. We synthesize recent empirical evidence suggesting that with proper implementationâ€”including appropriate caliper sizes and balance diagnosticsâ€”PSM remains a valuable tool for controlling confounding in observational drug safety and effectiveness studies. Detailed protocols for assessing covariate balance and minimizing model dependence are provided to guide researchers in applying robust PSM methodologies.

Propensity score matching has become a cornerstone method in pharmacoepidemiology due to its ability to control for numerous confounders present in healthcare databases such as insurance claims and electronic health records [37] [5]. The method creates balanced comparison groups by matching treated patients with untreated patients with similar probabilities (propensity scores) of receiving the treatment based on observed covariates [58]. However, King and Nielsen (2019) identified what they termed the "PSM Paradox"â€”the counterintuitive finding that after achieving initial balance, further pruning of matched pairs with the largest propensity score distances can increase rather than decrease covariate imbalance, model dependence, and bias [59] [60].

This paradox has particularly significant implications for pharmacoepidemiologic studies, which often rely on large, complex datasets to evaluate drug safety and effectiveness in real-world populations [61] [62]. Understanding whether this paradox represents a fundamental methodological flaw or a misuse of PSM is crucial for maintaining the validity of evidence generated from observational pharmacoepidemiologic research.

Examining the Contours of the Paradox

Theoretical Mechanism of the Paradox

The PSM paradox arises from a fundamental property of propensity scores: while PSM guarantees balance on the propensity score itself, it only guarantees balance on the underlying covariates asymptotically [59] [61]. In finite samples, particularly those with already good balance, pruning matched pairs based solely on propensity score distance may inadvertently remove pairs that, despite having slightly different propensity scores, are well-matched on actual covariates [59] [60]. This occurs because PSM attempts to approximate a completely randomized experiment rather than a more efficient fully blocked randomized experiment, making it "uniquely blind" to imbalance that could be eliminated by methods that directly balance covariates [60].

Table 1: Key Studies Investigating the PSM Paradox

Study	Data Source	Key Findings on PSM Paradox
King & Nielsen (2019) [59] [60]	Political science data; simulations	PSM increases imbalance, model dependence, and bias by approximating completely randomized experiments
Wan (2025) [55] [63]	Simulations and analytical formulas	Paradox stems from misuse of imbalance metrics; not a legitimate concern with proper PSM implementation
Franklin et al. (2018) [61]	PACE and MAX insurance claims databases	Imbalance sometimes increased after pruning, but standard calipers prevented deterioration of balance

Empirical Evidence in Pharmacoepidemiologic Settings

Research specifically examining the PSM paradox in pharmacoepidemiologic contexts has yielded nuanced findings. A 2018 study investigated the paradox using two healthcare claims databases: the Pharmaceutical Assistance Contract for the Elderly (PACE) with 49,919 beneficiaries and the Medicaid Analytic eXtract (MAX) with 886,996 completed pregnancies [61]. The authors created multiple 1:1 propensity-score-matched datasets while manipulating key parameters including covariate set richness, exposure prevalence, and matching algorithms.

The findings demonstrated that while covariate imbalance sometimes increased after progressive pruning of matched sets, the application of standard propensity score calipers (typically 0.2 standard deviations of the logit propensity score) consistently stopped pruning near the lowest region of the imbalance trend [61]. This resulted in improved balance compared to the pre-matched data set, leading the authors to conclude that "PSM does not appear to induce increased covariate imbalance when standard propensity score calipers are applied in these types of pharmacoepidemiologic studies" [61].

Essential Methodological Protocols for PSM Implementation

Propensity Score Estimation and Matching Workflow

The following diagram illustrates the comprehensive workflow for proper PSM implementation, emphasizing balance assessment and model refinement:

Balance Assessment Protocol

Comprehensive balance assessment is critical for detecting and addressing the potential PSM paradox. The following protocol should be implemented after initial matching:

Calculate Standardized Mean Differences (SMD): For each covariate, compute SMD before and after matching using the formula:

Continuous variables: SMD = (xÌ„â‚œáµ£â‚‘â‚â‚œâ‚‘ð’¹ - xÌ„á¶œáµ’â¿áµ—Ê³áµ’Ë¡) / âˆš[(sÂ²â‚œáµ£â‚‘â‚â‚œâ‚‘ð’¹ + sÂ²á¶œáµ’â¿áµ—Ê³áµ’Ë¡)/2]

Dichotomous variables: SMD = (pâ‚œáµ£â‚‘â‚â‚œâ‚‘ð’¹ - pá¶œáµ’â¿áµ—Ê³áµ’Ë¡) / âˆš[(pâ‚œáµ£â‚‘â‚â‚œâ‚‘ð’¹(1-pâ‚œáµ£â‚‘â‚â‚œâ‚‘ð’¹) + pá¶œáµ’â¿áµ—Ê³áµ’Ë¡(1-pá¶œáµ’â¿áµ—Ê³áµ’Ë¡))/2]

An SMD <0.1 is generally considered indicative of good balance [64].
Evaluate Variance Ratios: For continuous variables, calculate the ratio of variances between treatment groups after matching. A ratio close to 1.0 indicates good balance, with values below 2.0 generally acceptable [64].
Assess Prognostic Scores: Regress the outcome on covariates in the control group only, then use this model to predict outcomes for all subjects (prognostic scores). Compare SMD of prognostic scores between groups, as this measure highly correlates with bias [64].
Visual Inspection: Create Love plots displaying SMD for all covariates before and after matching, and examine distributional balance through density plots for continuous variables and histograms for categorical variables [64].

Table 2: Balance Assessment Metrics and Interpretation

Metric	Calculation	Interpretation	R Package/Function
Standardized Mean Difference (SMD)	Difference in means divided by pooled standard deviation	<0.1 indicates balance; >0.1 indicates meaningful imbalance	`cobalt::bal.tab()`
Variance Ratio	Ratio of variances in treatment vs control groups	1.0 indicates perfect balance; <2.0 generally acceptable	`cobalt::bal.tab()`
Prognostic Score SMD	SMD of predicted outcomes under control condition	Correlates highly with bias; lower values preferred	Custom calculation using outcome model
Empirical CDF	Difference in cumulative distribution functions	Smaller values indicate better distributional balance	`MatchIt::summary()`

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Software Tools for PSM Implementation in Pharmacoepidemiology

Tool/Software	Primary Function	Application in PSM	Key Features
R MatchIt Package [59] [58]	Data preprocessing via matching	Performs various PSM algorithms (nearest neighbor, optimal, full)	Supports multiple distance measures, caliper imposition, and matching with or without replacement
R cobalt Package [64]	Covariate balance assessment	Generates balance statistics and Love plots after matching	Computes SMD, variance ratios, and other balance metrics with publication-quality graphics
R tableone Package [64]	Descriptive statistics	Creates baseline characteristic tables before and after matching	Automatically calculates SMD for group comparisons appropriate for observational studies
High-Dimensional Propensity Score (hdPS) [61] [5]	Automated covariate selection	Identifies potential confounders in large healthcare databases	Uses algorithm to select covariates based on their potential for bias reduction

Strategic Approaches to Mitigate Paradox Effects

Appropriate Caliper Selection

The PSM paradox primarily manifests when researchers continue pruning matched pairs beyond what is necessary to achieve balance [55]. To prevent this:

Implement a caliper width of 0.2 standard deviations of the logit propensity score as this has been shown to eliminate over 90% of confounding bias [55] [63]
Avoid excessively narrow calipers that unnecessarily discard valuable data, particularly when balance has already been achieved
Use variable ratio calipers when exposure prevalence differs substantially between groups

When balance diagnostics indicate residual imbalance, consider these propensity score model refinements:

Test for and add clinically relevant interaction terms between covariates
Incorporate non-linear terms for continuous covariates using polynomial or spline functions
Utilize machine learning algorithms like random forests or generalized boosted models to flexibly estimate propensity scores without strong parametric assumptions [5]
Employ high-dimensional propensity score algorithms to systematically select covariates from large healthcare databases [61] [5]

Alternative Matching Strategies

When PSM repeatedly produces unsatisfactory balance despite model refinements:

Consider Mahalanobis distance matching either alone or in combination with propensity scores
Evaluate coarsened exact matching for key categorical confounders
Implement full matching or optimal matching to maximize sample size while improving balance
Use propensity score weighting as a complement or alternative to matching

The PSM paradox represents an important methodological consideration rather than a fatal flaw in propensity score approaches. In pharmacoepidemiologic studies using typical large healthcare databases, proper PSM implementation with appropriate calipers and comprehensive balance assessment effectively controls confounding without succumbing to the paradoxical deterioration of balance [61] [55]. Researchers should view the paradox as a reminder of the importance of rigorous balance diagnostics and thoughtful matching strategy selection rather than as a reason to abandon PSM entirely. When properly implemented with attention to caliper selection, model specification, and balance assessment, PSM remains a valuable method for generating valid evidence on drug safety and effectiveness in real-world populations.

Propensity score matching (PSM) is a cornerstone methodological approach in pharmacoepidemiological studies for mitigating confounding bias in observational treatment comparisons. A fundamental challenge in designing a robust PSM study involves optimizing the matched sample size through strategic decisions regarding caliper width, matching ratios, and pruning techniques. These interconnected choices directly impact the bias-variance trade-off, influencing the precision and validity of resultant treatment effect estimates [65] [66]. This application note provides detailed protocols for optimizing sample size in pharmacoepidemiological research, synthesizing current methodological evidence to guide researchers and drug development professionals.

Key Concepts and Definitions

The propensity score is defined as the probability of treatment assignment conditional on a subject's observed baseline covariates [10]. In practice, the propensity score is frequently estimated using logistic regression, though machine learning methods are increasingly employed [10] [67]. PSM uses these scores to construct a matched sample where treated and untreated subjects have similar covariate distributions, thereby approximating the balance achieved in randomized controlled trials [10].

The optimization challenge centers on the intrinsic relationship between sample size and match quality. Excessively narrow matching criteria may prune too many subjects, reducing statistical power and potentially increasing bias if the pruned sample is non-representative [55] [65]. Conversely, overly broad criteria retain more subjects but can produce poor covariate balance, introducing residual confounding [65]. Strategic implementation of caliper widths, matching ratios, and principled pruning is therefore essential for deriving valid causal inferences from observational data.

Optimization Strategies and Evidence-Based Protocols

Caliper Width Selection

The caliper width defines the maximum permitted distance in propensity scores (or their logit transformation) for a valid match. This parameter represents a critical balance between bias reduction and sample retention.

Evidence-Based Recommendation

Extensive Monte Carlo simulations demonstrate that a caliper width of 0.2 standard deviations of the logit of the propensity score generally optimizes the bias-variance trade-off [65]. This specification eliminates at least 98% of the bias in the crude estimator while maintaining confidence intervals with appropriate coverage rates [65]. This finding holds consistently when estimating differences in means for continuous outcomes and risk differences for binary outcomes [65] [68].

Practical Implementation Protocol

Estimate Propensity Scores: Fit a logistic regression model with treatment status as the dependent variable and all pre-specified baseline confounders as independent variables.
Calculate Logit Scores: Transform the predicted probabilities using the logit function: logit(ps) = ln(ps / (1 - ps)).
Compute Pooled Standard Deviation: Calculate the standard deviation of the logit propensity scores separately for treated and untreated groups, then compute the pooled estimate.
Apply Caliper: Implement nearest-neighbor matching using the specified caliper width (e.g., 0.2 Ã— pooled SD of logit PS).

For studies comparing three treatment groups, similar principles apply, with matching performed based on multiple propensity scores derived from multinomial logistic regression [68].

Matching Order and Ratio

The matching order in greedy nearest-neighbor algorithms can significantly impact result stability, particularly in small-to-medium samples.

Evidence-Based Recommendation

Avoid random order matching due to its potential for cherry-picking and result instability across multiple analyses [69]. Instead, pre-specify deterministic matching orders. Simulation studies indicate that "lowest to highest" score matching provides superior stability, or researchers can report the median estimate from multiple random matches [69].

Practical Implementation Protocol

Pre-specify Matching Order: In your statistical analysis plan, specify a non-random matching order ("smallest" for lowest-to-highest).
Implement 1:1 Matching: Begin with a 1:1 matching ratio without replacement, which is most common in pharmacoepidemiology.
Consider Variable Ratio: If control subjects are abundant, variable-ratio matching can increase precision while maintaining balance [67].

Pruning and Common Support

Pruning involves excluding subjects lacking suitable matches in the alternative treatment group, thereby defining the region of common support.

Evidence-Based Recommendation

Pruning should be performed judiciously to eliminate only non-overlapping regions of the propensity score distribution [66]. Excessive pruning after initial balance is achieved is unnecessary and can be counterproductive, potentially increasing imbalance and biasâ€”a phenomenon termed the "PSM paradox" [55]. Once balance on baseline covariates is achieved with a reasonable caliper, further narrowing provides no benefit and sacrifices valuable data [55].

Practical Implementation Protocol

Visualize Propensity Score Distributions: Plot density curves or histograms of propensity scores for treated and untreated groups to identify regions of poor overlap.
Apply Common Support Restriction: Prune only subjects in the non-overlapping regions of the propensity score distribution.
Document Pruning Impact: Report the number of subjects excluded due to common support restrictions and compare baseline characteristics between retained and excluded subjects.

Comprehensive Optimization Workflow

The following diagram illustrates the integrated workflow for optimizing sample size in propensity score matching studies:

Comparative Strategies Table

The table below summarizes the key optimization strategies, their impact on sample size, and empirical evidence supporting their implementation:

Table 1: Evidence-Based Strategies for Optimizing Sample Size in Propensity Score Matching

Strategy	Recommended Specification	Impact on Sample Size	Key Evidence
Caliper Width	0.2 Ã— SD of logit PS	Eliminates ~98% bias while retaining adequate sample	Austin (2010) [65]
Matching Order	Deterministic (lowest to highest)	Reduces cherry-picking & improves stability	Maruo et al. (2025) [69]
Pruning Approach	Restrict to common support only	Prevents excessive pruning & PSM paradox	Li et al. (2025) [55]
Balance Assessment	Standardized differences < 0.1	Ensures adequacy of final matched sample	Austin (2011) [10]

The Scientist's Toolkit

Table 2: Essential Methodological Components for Propensity Score Matching Studies

Research Component	Function in PSM	Implementation Considerations
Logistic Regression	Estimates propensity scores	Baseline covariates must be pre-specified; consider machine learning alternatives with high-dimensional data [10] [67]
Standardized Differences	Assesses covariate balance	More appropriate than hypothesis tests; target <0.1 for adequate balance [68]
Caliper Implementation	Controls match quality	Apply to logit of propensity score; 0.2 SD generally optimal [65]
Common Support Restriction	Defines analyzable population	Prune only non-overlapping regions; document excluded subjects [55] [66]
Sensitivity Analyses	Assesses unmeasured confounding	Quantifies how strong unmeasured confounder would need to be to alter conclusions [66]

Optimizing sample size through strategic implementation of caliper width, matching ratios, and pruning techniques is fundamental to deriving valid causal inferences from pharmacoepidemiological studies. The evidence-based protocols presented herein support the use of a 0.2 standard deviation caliper on the logit of the propensity score, deterministic matching orders, and judicious pruning limited to regions of non-overlap. By adhering to these structured approaches, researchers can enhance the robustness, reproducibility, and regulatory acceptance of real-world evidence generated through propensity score matching methods.

Observational studies using real-world data, such as electronic health records (EHR) and insurance claims data, are essential for assessing treatment effectiveness and safety in real-world populations. However, these studies face significant methodological challenges that can compromise the validity of their findings if not properly addressed. Unmeasured confounding, missing data, and complex time-varying treatment patterns represent three fundamental obstacles to obtaining reliable causal inferences in pharmacoepidemiological research. This article explores advanced propensity score methodologies to address these challenges within the broader context of a thesis on advancing propensity score methods in pharmacoepidemiology. We provide structured protocols, quantitative comparisons, and practical implementation guidance to enhance the rigor of observational research in drug development and outcomes research.

Core Methodological Challenges in Pharmacoepidemiology

The Triad of Analytical Challenges

The validity of pharmacoepidemiological studies is frequently threatened by interconnected methodological challenges that can introduce substantial bias in treatment effect estimates. The following table summarizes these key challenges and their implications:

Table 1: Core Methodological Challenges in Pharmacoepidemiological Studies

Challenge Category	Specific Manifestations	Impact on Validity	Common Data Sources Affected
Unmeasured Confounding	Omitted variables, imperfect measurement, unrecorded patient characteristics	Residual selection bias, distorted treatment effect estimates	All observational data sources
Data Missingness	Partially observed confounders, incomplete clinical variables, systematic missing data	Incomplete confounding control, selection bias	EHR (especially lifestyle factors, lab values)
Complex Treatment Patterns	Treatment switching, dose escalation, combination therapy, non-adherence	Immortal time bias, time-dependent confounding, informative censoring	Claims data, pharmacy records, EHR

These challenges are particularly pronounced in studies investigating mental health disorders, where treatment pathways are often complex and multidimensional. For example, studies of major depressive disorder (MDD) have documented numerous treatment patterns including persistence, discontinuation, switching, dose escalation, augmentation, and combination therapy [70]. Similarly, research on post-traumatic stress disorder (PTSD) treatment reveals complex patterns of pharmacotherapy management with frequent modifications [71].

Interrelationships Between Challenges

These methodological challenges often coexist and interact in ways that compound their threat to validity. For instance, missing data on key confounders such as ethnicity or chronic kidney disease stage (which can be missing in over 50% of records in EHR studies) exacerbates problems of unmeasured confounding [72]. Simultaneously, complex treatment patterns over time can introduce immortal time bias if not properly accounted for in the study design [6]. The convergence of these issues requires integrated methodological approaches rather than piecemeal solutions.

Advanced Methodological Approaches

High-Dimensional Propensity Score (hdPS) with Nested Case-Control Design

The integration of hdPS with nested case-control (NCC) designs represents a sophisticated approach to simultaneously address unmeasured confounding and immortal time bias. This method was effectively implemented in a study of disease-modifying drugs (DMDs) for multiple sclerosis and all-cause mortality [6].

Experimental Protocol: hdPS-NCC Implementation

Step 1: Cohort Definition and Follow-up

Define a retrospective cohort with clear inclusion/exclusion criteria
Establish baseline period for covariate assessment
Define follow-up period from index date until outcome occurrence, censoring, or end of study
Identify all outcome cases (e.g., mortality events) during follow-up

Step 2: Nested Case-Control Sampling

For each case, randomly select 4-5 controls matched on time-in-study using incidence density sampling
This sampling approach inherently addresses immortal time bias by ensuring comparable follow-up time between cases and controls
In the multiple sclerosis application, this resulted in 3,209 cases matched to 12,293 controls [6]

Step 3: High-Dimensional Propensity Score Estimation

Identify all candidate covariates across five data dimensions: diagnoses, procedures, medications, etc.
Apply the hdPS algorithm to empirically identify and prioritize covariates based on their potential for confounding
Select top covariates for inclusion in the propensity score model (typically 100-500 variables)
Estimate propensity scores using logistic regression including both investigator-specified and empirically-identified covariates

Step 4: Treatment Effect Estimation

Conduct conditional logistic regression conditioned on the matched sets
Incorporate the estimated propensity score as either a matching variable or regression covariate
Calculate hazard ratios with appropriate confidence intervals

Step 5: Sensitivity Analyses

Vary parameters of the hdPS algorithm (number of covariates, preprocessing preferences)
Test different matching ratios (1:1, 1:3, 1:4)
Evaluate robustness of findings across analytical scenarios

In the multiple sclerosis application, this approach demonstrated a 28% reduction in mortality risk associated with DMD exposure (HR: 0.72, 95% CI: 0.62-0.84), with consistent results across sensitivity analyses (HRs: 0.70-0.77) [6].

Workflow Visualization

Missingness Pattern Approach for Partial Confounder Data

The Missingness Pattern Approach (MPA) provides a framework for handling partially observed confounders in propensity score analysis, which is particularly relevant for EHR studies where missing data is common.

Theoretical Foundation

MPA operates by estimating propensity scores separately within each missingness pattern present in the data. Unlike simple missing indicator methods, MPA acknowledges that the mechanism of missingness may be informative and incorporates this information into the analysis. The approach requires that the missingness mechanism is conditionally independent of the outcomes given the observed data and treatment assignment [72].

In a study of angiotensin-converting enzyme inhibitors (ACEI/ARBs) and acute kidney injury, two key confounders had substantial missingness: ethnicity (59.0%) and chronic kidney disease stage (52.9%) [72]. Only 21% of patients had complete data for both variables, making complete-case analysis problematic.

Experimental Protocol: MPA Implementation

Step 1: Missingness Pattern Identification

Identify all variables with missing values, particularly potential confounders
Document the proportion missing for each variable
Create missingness indicator variables for each confounder with missing data
Identify all unique missingness patterns in the dataset

Step 2: Propensity Score Estimation Within Patterns

Stratify the analysis by missingness pattern
Within each pattern, include only the confounders that are observed
Estimate propensity scores separately within each stratum using the available confounders
Alternatively, include interaction terms between treatment and missingness indicators in a unified model

Step 3: Treatment Effect Estimation

Conduct outcome analysis within each missingness pattern
Pool stratum-specific estimates using appropriate weights (e.g., inverse variance weighting)
Alternatively, include the propensity score and missingness indicators in the outcome model

Step 4: Assumption Validation

Evaluate the plausibility of missing-at-random assumptions using causal diagrams
Assess whether missingness is associated with treatment assignment or outcomes after accounting for observed variables
Conduct sensitivity analyses to evaluate the impact of potential violations of missingness assumptions

Table 2: Performance Comparison of Missing Data Handling Methods in Propensity Score Analysis

Method	Key Assumptions	Advantages	Limitations	Suitable Scenarios
Complete Case Analysis	Missing completely at random	Simple implementation	Inefficient, potentially biased	<5% missing, MCAR plausible
Missing Indicator Method	Missingness independent of outcome	Uses all observations	Can introduce severe bias	Generally not recommended
Multiple Imputation	Missing at random	Uses all information, valid uncertainty	Computationally intensive	MAR plausible, multivariate missingness
Missingness Pattern Approach (MPA)	Missingness independent of outcome given observed data	Incorporates missingness information	Complex implementation, pattern sparsity	Large samples, informative missingness

Dimensionality Reduction for Propensity Score Specification in High-Dimensional Data

High-dimensional healthcare data presents both opportunities and challenges for confounding control. While abundant variables enable more complete confounding control, they also increase the risk of model misspecification and finite-sample bias. Dimensionality reduction techniques offer promising approaches to improve propensity score specification in such settings.

Comparative Performance of Dimensionality Reduction Methods

A recent study compared dimensionality reduction techniques for propensity score estimation in claims data analyzing the association between dialysis and mortality in heart failure patients with advanced chronic kidney disease [73]. The study included 485 dialysis-exposed and 1,455 unexposed individuals after matching.

Table 3: Performance Comparison of Dimensionality Reduction Techniques for Propensity Scores

Method	Covariates with SMD >0.1	Key Implementation Features	Relative Advantages	Computational Requirements
Investigator-Specified	83	Domain knowledge-driven selection	Clinical interpretability	Low
High-Dimensional Propensity Score (hdPS)	37	Algorithmic covariate prioritization	Automated, reproducible	Moderate
Principal Component Analysis (PCA)	20	Linear dimensionality reduction	Computational efficiency	Moderate
Logistic PCA	25	Nonlinear dimensionality reduction for binary data	Handles binary features well	High
Autoencoders	8	Neural network-based representation learning	Optimal balance achievement	Very high

The study found that autoencoder-based propensity scores achieved superior covariate balance, with only 8 covariates showing standardized mean differences (SMD) > 0.1 compared to 83 for investigator-specified covariates [73]. Despite these differences in balance, hazard ratios for mortality were similar across methods, suggesting that the primary benefit of advanced dimensionality reduction techniques lies in improved covariate balance rather than substantially different effect estimates.

Implementation Protocol: Autoencoder-Based Propensity Scores

Step 1: Data Preprocessing

Organize high-dimensional covariates into a structured matrix
Handle missing values using appropriate imputation methods
Standardize continuous variables to mean 0 and variance 1
Convert categorical variables to appropriate representations

Step 2: Autoencoder Architecture Specification

Define encoder structure with progressively decreasing layers (e.g., 500 â†’ 250 â†’ 100 units)
Create bottleneck layer with significantly reduced dimensions (e.g., 10-50 units)
Define symmetric decoder structure for reconstruction
Select appropriate activation functions (e.g., tanh, relu)

Step 3: Model Training

Implement reconstruction loss minimization (mean squared error)
Apply regularization techniques to prevent overfitting (e.g., dropout, L2 penalty)
Use adaptive learning rate methods (e.g., Adam optimizer)
Employ early stopping based on validation set performance

Step 4: Propensity Score Estimation

Extract latent representations from the bottleneck layer
Use these reduced-dimensional representations as covariates in propensity score model
Estimate propensity scores using logistic regression with latent factors as predictors
Apply conventional propensity score methods (weighting, matching) using the estimated scores

The Scientist's Toolkit: Essential Research Reagents

Successful implementation of advanced propensity score methods requires both methodological expertise and appropriate analytical tools. The following table details essential components of the methodological toolkit for addressing confounding and data challenges in pharmacoepidemiology.

Table 4: Essential Research Reagents for Advanced Propensity Score Analysis

Tool Category	Specific Tools/Techniques	Primary Function	Implementation Considerations
Bias Addressing Methods	Nested case-control design, Marginal structural models, Sequential conditioning	Address time-related biases (immortal time, time-dependent confounding)	Requires specialized design and analysis techniques
Missing Data Handling	Missingness Pattern Approach (MPA), Multiple imputation, Pattern mixture models	Handle partially observed confounders	Dependent on missingness mechanism assumptions
High-Dimensional Control	hdPS algorithm, Autoencoders, PCA, Regularized regression	Control for extensive covariate sets	Computational intensity, requires validation
Software & Computing	R (hdPS, ggplot2), Python (TensorFlow), SAS macros, High-performance computing	Implement complex analytical workflows	Reproducibility, computational resources
Balance Diagnostics	Standardized mean differences, Empirical covariate balance metrics, Love plots	Assess propensity score performance	Pre-specified balance thresholds required

Integrated Analytical Framework for Complex Treatment Patterns

Analysis of complex treatment patterns requires specialized frameworks that can capture the temporal sequencing of treatment events and their relationship to outcomes. The following workflow illustrates an integrated approach for analyzing pharmacotherapy treatment patterns and their association with outcomes, as applied in a study of 252,179 veterans with major depressive disorder [74].

In the MDD application, this framework revealed that ten prescription patterns accounted for nearly 70% of treatment pathways among individuals starting antidepressants at 20-39 mg fluoxetine equivalents [74]. The analysis further identified specific associations between dosage changes and clinical outcomes, providing insights for personalized treatment monitoring.

Addressing unmeasured confounding, missing data, and complex treatment patterns requires sophisticated methodological approaches that extend beyond conventional propensity score methods. The integrated application of high-dimensional propensity scores with nested case-control designs, missingness pattern approaches, and dimensionality reduction techniques provides a robust framework for generating more valid evidence from pharmacoepidemiological studies. These advanced methods enable researchers to better approximate the conditions of randomized trials using observational data, enhancing the evidence base for drug development and clinical decision-making. Future methodological research should focus on developing more accessible implementations of these techniques and establishing comprehensive guidelines for their application across diverse therapeutic areas and data sources.

In pharmacoepidemiology, intercurrent events (ICEs) are events that occur after treatment initiation and affect either the interpretation or the existence of the measurements associated with the clinical question of interest [50]. In observational studies, common ICEs include treatment discontinuation, switching to alternative medications, and terminal events such as death [50]. These events present significant methodological challenges for estimating causal treatment effects from real-world data, necessitating robust analytical frameworks and precise definitional approaches.

The ICH E9(R1) estimand framework provides a structured approach for precisely defining treatment effects by accounting for ICEs through its five core attributes: (1) the target population, (2) the variable (endpoint), (3) the treatment conditions, (4) the strategy for handling intercurrent events, and (5) the population-level summary measure [75] [50]. This framework brings clarity and transparency to the scientific question of interest, ensuring alignment between study objectives, design, conduct, and analysis. Although originally developed for randomized controlled trials, the estimand framework's principles are increasingly recognized as relevant for observational pharmacoepidemiologic studies, where ICEs are ubiquitous [50].

Propensity score methods serve as a powerful companion to this framework by enabling researchers to adjust for confounding in observational studies, thus creating analysis sets where treated and reference groups have comparable distributions of measured baseline covariates [28] [76]. When combined with a clearly defined estimand, propensity score methods help emulate a target trial, reducing bias in treatment effect estimation even in the presence of complex ICE patterns [5].

Classifying Intercurrent Events and Defining Strategies

Classification of Common Intercurrent Events

In pharmacoepidemiologic studies, ICEs can be systematically categorized to inform appropriate analytical strategies. Table 1 outlines common ICE types, their characteristics, and examples.

Table 1: Classification of Intercurrent Events in Pharmacoepidemiology

ICE Category	Definition	Common Examples	Key Characteristics
Treatment Discontinuation	Cessation of the study treatment before the planned endpoint	Early discontinuation due to adverse events, cost, or patient preference [50]	May be informative or non-informative; often requires strategy specification
Treatment Switching	Initiation of an alternative or additional therapy	Addition of rescue medication, switch to competitor drug [75]	Complicates isolation of initial treatment effect; common in chronic diseases
Terminal Events	Events that preclude subsequent measurement of the outcome	Death from any cause [50]	Precludes existence of outcome measurement for some strategies
Administrative Events	Events unrelated to treatment efficacy or safety	Relocation, loss to follow-up due to external factors, pandemic-related disruptions [77]	Often considered treatment-unrelated; may be addressed with hypothetical strategies

A critical advancement in ICE management involves classifying events as either treatment-related or treatment-unrelated [77]. Treatment-related ICEs (e.g., discontinuation due to adverse events or lack of efficacy) are considered informative about treatment response and often warrant strategies that classify them as treatment failures. Conversely, treatment-unrelated ICEs (e.g., discontinuation due to relocation or insurance changes) may be addressed through hypothetical strategies that envision scenarios where these events did not occur [77].

Estimand Strategies for Handling ICEs

The ICH E9(R1) addendum describes several primary strategies for handling ICEs, each yielding a different interpretation of the treatment effect [75] [50]:

Treatment Policy Strategy: The ICE is considered irrelevant to the treatment effect assessment. Outcomes are analyzed regardless of whether ICEs occurred, reflecting the real-world effectiveness of initiating treatment.
Hypothetical Strategy: Estimates the treatment effect under a scenario where the ICE would not have occurred, requiring explicit assumptions about the ICE mechanism.
Composite Strategy: Incorporates the occurrence of the ICE directly into the outcome definition, typically by assigning a poor outcome value to patients who experience the ICE.
While-on-Treatment Strategy: Focuses on the treatment effect before the occurrence of an ICE, capturing only the initial treatment response.
Principal Stratum Strategy: Restricts the analysis to the subgroup of patients who would not experience the ICE under either treatment assignment.

Table 2: Mapping ICE Types to Appropriate Handling Strategies

ICE Type	Recommended Strategy	Interpretation of Treatment Effect	Key Considerations
Treatment Discontinuation (unrelated)	Hypothetical [77]	Effect if patients had remained on treatment	Requires assumption that ICE is conditionally independent of outcome
Treatment Discontinuation (related)	Composite [77]	Effect including discontinuation as failure	Clinically relevant for tolerability assessment
Treatment Switching	Treatment Policy or Hypothetical [75]	Effect of initial treatment strategy	Choice depends on whether switching is part of routine care
Death	Composite [77]	Effect including mortality	Necessary when death precludes outcome measurement
Administrative Censoring	Hypothetical	Effect in absence of administrative constraints	Useful for addressing non-informative censoring

The following diagram illustrates the decision pathway for selecting appropriate strategies based on ICE characteristics:

Quantitative Evidence from Recent Studies

Recent large-scale studies have quantified the associations between intercurrent events and clinical outcomes, providing empirical support for their role as potential surrogate endpoints or prognostic factors.

Table 3: Quantified Associations Between Intercurrent Events and Clinical Outcomes in Non-Diabetic CKD (n=504,924) [78]

Intercurrent Event	Clinical Outcome	Hazard Ratio	95% Confidence Interval	Clinical Interpretation
Outpatient heart failure diagnosis	Hospitalization for heart failure	12.92	12.67â€“13.17	Strong predictor of future HF hospitalization
CKD stage 4 diagnosis	Kidney failure/need for dialysis	3.75	3.69â€“3.81	Moderate predictor of renal failure progression
Potassium-removing resin dispensation	Worsening of CKD stage	4.83	4.51â€“5.17	Electrolyte management as marker of disease severity
eGFR decline (laboratory subset, n=295,174)	Hospitalization for heart failure	Progressive increase with eGFR decline	N/A	Continuous relationship between renal function and CV risk
eGFR decline (laboratory subset, n=295,174)	Kidney failure/need for dialysis	Progressive increase with eGFR decline	N/A	Strong association between renal function decline and failure

This study demonstrates how intercurrent events can serve as early indicators of more severe clinical outcomes, supporting their use in pharmacoepidemiologic research to understand disease progression and treatment effects [78].

Propensity Score Methods for Addressing Confounding in ICE Analysis

Propensity score methods are essential tools for addressing confounding in observational studies of treatment effects, particularly when ICEs are present [28]. The propensity score is defined as the probability of treatment assignment conditional on observed baseline covariates [28] [5]. These methods create analysis sets where treated and reference groups have comparable distributions of measured covariates, mimicking the balance achieved through randomization in clinical trials.

The four primary propensity score approaches are:

Propensity Score Matching: Creates matched sets of treated and untreated subjects with similar propensity scores, enabling direct comparison of outcomes [76].
Propensity Score Weighting: Uses inverse probability of treatment weights to create a pseudo-population where treatment assignment is independent of measured covariates [76].
Propensity Score Stratification: Groups subjects into strata based on propensity score quantiles and estimates treatment effects within each stratum.
Covariate Adjustment Using Propensity Score: Includes the propensity score as a covariate in outcome regression models to adjust for confounding.

When applied to time-to-event outcomes in the presence of ICEs, propensity score methods can estimate both marginal hazard ratios and absolute risk differences, providing comprehensive information about treatment effects [76].

Implementation Protocol for Propensity Score Analysis with ICEs

Protocol 1: Comprehensive Propensity Score Analysis with ICE Handling

Objective: To estimate the effect of a target treatment on a time-to-event outcome while appropriately addressing intercurrent events through the estimand framework and propensity score methods.

Step 1: Define the Target Estimand

Specify all five attributes of the estimand framework:
- Target Population: Clearly define inclusion/exclusion criteria
- Treatment Conditions: Define index dates, exposure definitions, and comparator groups
- Endpoint: Precisely define the outcome measurement and timing
- Intercurrent Events: Pre-specify all relevant ICEs and handling strategies
- Population Summary: Define the target measure of association (e.g., hazard ratio) [75] [50]

Step 2: Assemble the Cohort and Define Variables

Identify all eligible subjects meeting the target population definition
Operationalize all baseline covariates, exposure definitions, and outcome measurements
Document the occurrence and timing of all pre-specified ICEs during follow-up

Step 3: Estimate Propensity Scores

Select covariates for inclusion based on subject matter knowledge
Estimate propensity scores using logistic regression or machine learning methods
Assess propensity score distribution overlap between treatment groups
Evaluate covariate balance before and after propensity score application [28]

Step 4: Apply Propensity Score Method

Implement the chosen propensity score method (matching, weighting, or stratification)
Verify that adequate covariate balance has been achieved (standardized differences <0.1)
Document the final analysis population characteristics

Step 5: Implement ICE Handling Strategy

Apply the pre-specified ICE handling strategy consistent with the estimand:
- Treatment Policy: Analyze outcomes regardless of ICE occurrence
- Composite: Incorporate ICEs into the outcome definition
- Hypothetical: Use appropriate methods (e.g., inverse probability censoring weighting)
For competing ICEs, implement appropriate statistical methods to address the competing risks [77]

Step 6: Estimate Treatment Effects

Fit appropriate outcome models to estimate the treatment effect
For time-to-event outcomes, use Cox regression or other survival models
Report both relative and absolute measures of effect [76]
Conduct sensitivity analyses to assess robustness to unmeasured confounding

Advanced Considerations and Recent Methodological Developments

Handling Competing Intercurrent Events

A significant methodological challenge arises when multiple ICEs compete to occur first, where the initial ICE censors subsequent events [77]. For example, if a patient discontinues treatment due to relocation (treatment-unrelated ICE), it remains unknown whether an adverse event (treatment-related ICE) would have occurred later in the follow-up period. This competing ICE structure has been largely overlooked in previous methodologies but is ubiquitous in practice [77].

Novel approaches now address this challenge by:

Classifying ICEs as treatment-related or treatment-unrelated
Applying a composite strategy to treatment-related ICEs
Applying a hypothetical strategy to treatment-unrelated ICEs
Developing weighting, outcome regression, and doubly robust estimators that account for the competing nature of these events [77]

Table 4: Essential Analytical Tools for ICE Analysis in Pharmacoepidemiology

Tool Category	Specific Methods	Primary Application	Key References
Causal Frameworks	Estimand Framework (ICH E9[R1]), Target Trial Emulation	Study design and objective specification	[75] [50]
Confounding Control	Propensity score matching, Inverse probability of treatment weighting, High-dimensional propensity scoring	Addressing measured confounding in observational studies	[28] [76] [5]
ICE Handling Methods	Composite strategy, Hypothetical strategy (using IPCW), Treatment policy strategy	Addressing specific ICE types according to classification	[77] [50]
Sensitivity Analysis	E-value analysis, Monte Carlo sensitivity analysis, Proxy approach	Assessing robustness to unmeasured confounding and model assumptions	[5]

Integrated Analytical Workflow for ICE Handling

The following diagram illustrates a comprehensive workflow for handling ICEs in pharmacoepidemiologic studies, integrating both the estimand framework and propensity score methods:

Handling intercurrent events appropriately is essential for generating valid evidence from pharmacoepidemiologic studies. The integration of the ICH E9(R1) estimand framework with robust propensity score methods provides a comprehensive approach for addressing these challenges. By pre-specifying ICE handling strategies through the estimand framework and implementing them using appropriate propensity score methods, researchers can produce more transparent, interpretable, and clinically relevant evidence from real-world data. Future methodological developments should focus on standardized approaches for classifying ICEs, handling competing events, and enhancing sensitivity analyses for unmeasured confounding in complex real-world settings.

In pharmacoepidemiological studies, estimating the causal effect of drug exposures on health outcomes is often challenged by confounding bias. Propensity score (PS) methods have emerged as powerful tools to address this issue by balancing observed baseline covariates between treated and untreated patients, thereby mimicking the conditions of a randomized experiment [10]. The validity of any PS analysis, however, hinges critically on the appropriate selection of covariates for the propensity score model. Mis-specification, including the inclusion of inappropriate variables or model overfitting, can introduce substantial bias or imprecision into treatment effect estimates [79] [80]. This application note provides detailed protocols for covariate selection, focusing specifically on the critical tasks of identifying and excluding instrumental variables and preventing model overfitting, within the context of pharmacoepidemiological research.

Scientific Foundation: Principles of Covariate Selection

Variable Types and Their Impact

The propensity score is defined as a patient's probability of receiving the treatment of interest conditional on their observed baseline covariates [10]. The goal of covariate selection is to include a sufficient set of variables to achieve conditional exchangeability between treatment groups without introducing new biases or statistical inefficiencies. Covariates can be categorized based on their relationships with the treatment and outcome, which determines their appropriateness for inclusion.

Table 1: Types of Covariates and Recommendations for Propensity Score Models

Covariate Type	Relationship to Treatment & Outcome	Inclusion Recommendation	Rationale
True Confounder	Associated with both treatment and outcome	Always include	Necessary to eliminate confounding bias [79]
Predictor of Outcome Only	Associated with outcome but not treatment	Always include	Increases precision without increasing bias [79] [81]
Predictor of Treatment Only (Potential Instrument)	Associated with treatment but not outcome	Generally exclude	Decreases precision without reducing bias; can increase variance [79] [81]
Risk Factor for Outcome	Causes or proxies for outcome risk factors	Include	Helps control for prognostic differences
Pre-Treatment Measures	Measured before treatment assignment	Include	Not affected by the treatment [81]
Post-Treatment Measures	Measured after treatment assignment	Exclude	Could be affected by the treatment itself [81]

Theoretical Rationale for Excluding Instruments

Instrumental variables (IVs)â€”variables that predict treatment assignment but are independent of the outcomeâ€”should generally be excluded from PS models. While IV methods are a distinct causal inference approach that leverages such variables, they play a harmful role in PS analysis [82]. Including IVs in a PS model is detrimental for both theoretical and practical reasons. Theoretically, since IVs are unrelated to the outcome, they cannot confound the treatment-outcome relationship and thus their adjustment is unnecessary for bias reduction [79]. Practically, their inclusion can substantially increase the variability of the treatment effect estimate. This occurs because IVs create divisions in the data that do not correspond to prognostic differences, leading to inefficient comparisons and less precise effect estimates [79] [81]. In scenarios with limited sample size, this variance inflation can be severe enough to overwhelm any minimal bias reduction, resulting in higher mean squared error [80].

Protocol 1: Identifying and Handling Instrumental Variables

Experimental Workflow

The following diagram illustrates the systematic decision process for identifying and handling potential instrumental variables during covariate selection.

Diagram 1: Instrumental Variable Identification Workflow

Detailed Methodology

Step 1: Pre-Specify Covariates Using Subject-Matter Knowledge

Action: Before analyzing the data, compile a list of potential covariates based on clinical knowledge, literature review of the disease domain, and understanding of treatment prescribing patterns [83] [81].
Rationale: Data-driven variable selection increases the risk of including spurious relationships and instruments.
Documentation: Maintain a structured log documenting the theoretical justification for each candidate covariate, specifically noting the hypothesized relationship to both treatment assignment and outcome.

Step 2: Assess Temporal Priority

Action: Verify that all considered covariates were measured and recorded prior to the treatment assignment or exposure [81].
Rationale: Variables measured after treatment initiation may be mediators or consequences of the treatment, and adjusting for them can introduce bias by blocking causal pathways [81].
Protocol: Create a data timeline for each patient, marking the sequence of covariate measurement, treatment initiation, and outcome occurrence.

Step 3: Evaluate Covariate-Outcome Relationships

Action: For covariates strongly associated with treatment, formally assess their relationship with the outcome.
Statistical Testing: While preliminary analyses (e.g., univariate regression of each covariate on the outcome) can be informative, prioritize theoretical knowledge over statistical significance due to the multiple testing problem [81].
Decision Rule: If a covariate has a strong known association with treatment but no plausible direct effect on the outcome (e.g., physician prescribing preference, healthcare system characteristics), classify it as an instrumental variable and exclude it [81].

Step 4: Apply the Inclusion Principle When Uncertain

Action: When uncertainty exists about whether a treatment-associated covariate affects the outcome, default to inclusion rather than exclusion [81].
Rationale: Excluding a true confounder can perpetuate bias, while including a weak instrument has less severe consequences in large samples [79].

Protocol 2: Preventing Model Overfitting

Experimental Workflow

The following diagram outlines the protocol for preventing overfitting during propensity score model estimation.

Diagram 2: Overfitting Prevention Protocol

Quantitative Guidance and Diagnostics

Overfitting occurs when a model captures random noise in the data rather than true underlying relationships, leading to poor performance in subsequent effect estimation. In PS analysis, overfitting can produce propensity scores that fail to balance covariates and inflate the variance of treatment effect estimates [80].

Table 2: Overfitting Prevention Strategies and Diagnostic Metrics

Strategy	Operationalization	Threshold/Guideline	Interpretation
Events-per-Variable (EPV)	Number of patients in the rarest treatment category divided by number of parameters in PS model	EPV â‰¥ 10-20 [80]	EPV < 10 indicates high overfitting risk; EPV < 5 indicates severe overfitting risk
Covariate Prioritization	Rank covariates by theoretical importance as confounders	Include strongest confounders first when limiting parameters	Ensures most important confounders are adjusted even when sample size is limited
Regularization Methods	Use penalized regression (e.g., lasso, ridge) or machine learning (e.g., boosting, random forests)	Cross-validate hyperparameters	Reduces overfitting by penalizing model complexity; particularly useful with many covariates [10]
Balance Diagnostics	Calculate standardized mean differences after PS adjustment	Absolute value < 0.1 [83]	Good balance indicates well-specified model regardless of statistical fit
Variance Inflation Assessment	Compare standard errors of treatment effect from overfit vs. parsimonious models	Qualitative comparison	Substantially larger SEs in complex model suggest overfitting [80]

Detailed Methodology

Step 1: Calculate Effective Sample Size

Action: Determine the number of patients in the smaller treatment group (either treated or untreated), as this represents the limiting sample for PS model estimation.
Example: In a study with 2,000 treated and 500 untreated patients, the effective sample size for EPV calculation is 500.

Step 2: Apply Events-per-Variable (EPV) Rule

Action: Divide the effective sample size by the number of parameters (including main effects and interaction terms) in the proposed PS model.
Protocol Adjustment: If EPV < 10-20, reduce the number of parameters by:
- Removing lower-priority covariates (e.g., those with weaker confounding evidence)
- Combining correlated covariates into composite scores
- Using dimension reduction techniques before PS estimation

Step 3: Implement Regularization for High-Dimensional Data

Action: When many important confounders must be considered relative to sample size, use regularization techniques.
Options:
- Penalized Logistic Regression: LASSO or ridge regression to shrink coefficient estimates
- Machine Learning Algorithms: Boosted regression, random forests, or neural networks for propensity score estimation [10]
Validation: Use cross-validation to tune hyperparameters and avoid over-optimism.

Step 4: Diagnose Overfitting Empirically

Action: After PS application (matching, weighting, or stratification), assess whether overfitting has occurred.
Primary Diagnostic: Check covariate balance using standardized mean differences for all covariates [83].
Secondary Diagnostic: Compare the standard error of the treatment effect estimate from the PS-adjusted model to simpler models; substantial inflation suggests overfitting [80].

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Methodological Tools for Propensity Score Analysis

Research Tool	Function	Application Notes
Directed Acyclic Graphs (DAGs)	Visual representation of causal assumptions	Identifies minimal sufficient adjustment set; reveals instrumental variables and mediators [83]
Standardized Mean Differences	Quantifies covariate balance between groups	Primary diagnostic for PS model adequacy; target <0.1 for all covariates [83]
Events-per-Variable (EPV) Calculator	Determines maximum sustainable model complexity	Critical for preventing overfitting; implemented in R, SAS, or Stata
Regularized Regression Methods	Estimates PS while preventing overfitting	Essential for high-dimensional covariate situations; available via glmnet in R [10]
Machine Learning Packages	Flexible PS estimation with cross-validation	Boosted regression (twang package), random forests; can handle non-linearities [10]
Balance Assessment Software	Comprehensive diagnostic tools	cobalt package in R provides unified balance assessment across PS methods [83]

Performance Characteristics

The following table summarizes the expected performance outcomes when correctly implementing these covariate selection protocols compared to common pitfalls.

Table 4: Performance Characteristics of Covariate Selection Approaches

Selection Approach	Bias	Variance	Mean Squared Error	Balance Achievement
Including all confounders and outcome predictors	Low	Moderate	Low	High
Including instruments	Unchanged or slightly reduced	High	High	Variable
Excluding weak confounders	Moderate to high	Low	Moderate to high	Low to moderate
Overfit model (low EPV)	Low to moderate	Very high	High	Poor despite good apparent fit
Protocol-compliant selection	Low	Moderate	Low	High

Appropriate covariate selection is fundamental to valid causal inference using propensity score methods in pharmacoepidemiology. By systematically excluding instrumental variables and preventing model overfitting through the protocols outlined herein, researchers can produce more reliable estimates of drug effects from observational data. These application notes provide actionable guidance for implementing these critical methodological safeguards, contributing to more robust evidence generation in drug development and safety research.

Ensuring Validity: Balance Assessment, Framework Alignment, and Method Comparison

Gold Standards for Assessing Covariate Balance After PS Application

Propensity score (PS) methods are fundamental in pharmacoepidemiology to control for confounding in observational studies. The validity of these methods hinges on achieving adequate covariate balance between treatment groups after PS application. This protocol outlines gold-standard methodologies for assessing covariate balance, providing a structured framework for researchers to evaluate the success of propensity score adjustment. We detail diagnostic tools, quantitative metrics, visualization techniques, and interpretation guidelines essential for confirming that balanced distributions of baseline characteristics have been achieved, thereby strengthening causal inference in drug safety and effectiveness research.

In pharmacoepidemiology, propensity score (PS) methods have become a cornerstone for estimating treatment effects using observational data, confronting formidable obstacles that arise from the use of large, complex healthcare databases [1]. The propensity score, defined as the probability of treatment assignment conditional on observed baseline covariates, functions as a balancing score [10]. Its core property is that conditional on the true propensity score, the distribution of observed baseline covariates is similar between treated and untreated subjects [10].

The "propensity score tautology" asserts that we know we have a consistent estimate of the propensity score when matching on the propensity score balances the raw covariates [84]. Thus, the appropriateness of the specification of the propensity score is assessed by examining whether its application has resulted in a sample where the distribution of measured baseline covariates is similar between treatment groups [84]. Careful testing of propensity scores is required before using them to estimate treatment effects [85]. This document establishes standardized protocols for conducting these critical balance assessments.

Quantitative Balance Metrics

Standardized Mean Differences

The standardized mean difference (SMD) is the most widely recommended metric for assessing balance in baseline covariates after propensity score application. It quantifies the difference between group means in standardized units, making it comparable across covariates with different measurement scales [84].

Table 1: Standardized Mean Difference Interpretation Guidelines

SMD Value	Balance Interpretation	Recommended Action
< 0.1	Adequate balance	Proceed to outcome analysis
0.1 - 0.25	Moderate imbalance	Consider model respecification
> 0.25	Substantial imbalance	Revise propensity score model

For a continuous covariate, the SMD is calculated as:

[ \text{SMD} = \frac{\bar{x}{\text{treat}} - \bar{x}{\text{control}}}{\sqrt{\frac{s^2{\text{treat}} + s^2{\text{control}}}{2}}} ]

where (\bar{x}) represents the group mean and (s^2) the group variance [84]. For binary covariates, the SMD is calculated as the difference in proportions divided by the pooled standard deviation.

In small sample sizes, chance imbalance can cause large SMD deviations. Austin suggests measuring the empirical distribution of SMD to account for chance imbalance [86]. For studies with many covariates or small sample sizes, consider testing whether SMD statistically significantly exceeds the 0.1 threshold rather than using a simple nominal threshold [86].

Advanced Distributional Diagnostics

Balance assessment should extend beyond means to include the entire covariate distribution. Variance ratios (ratio of variances in treated versus control groups) should be close to 1, with values outside 0.5-2.0 indicating potential imbalance [84]. Higher-order moments and interactions should also be assessed [84].

Table 2: Comprehensive Balance Assessment Metrics

Metric Category	Specific Tests	Target Value
Central Tendency	Standardized mean differences, t-tests	SMD < 0.1
Distributional Shape	Variance ratios, quantile-quantile plots	Ratio 0.5-2.0
Extreme Values	Five-number summaries, boxplots	Similar distributions
Multivariate Balance	Interaction terms, higher-order moments	Non-significant differences

Visual Balance Diagnostics

Visual assessments provide intuitive understanding of balance that complements quantitative metrics.

Love Plots

Love plots (also called balance plots) display SMD values for all covariates before and after propensity score application, providing an immediate overview of balance improvement [87]. These visualizations powerfully communicate whether results pass statistical tests that determine balance [87].

Distributional Plots

Covariate distribution plots including histograms, density plots, boxplots, and quantile-quantile (Q-Q) plots allow direct comparison of the entire distribution of key covariates between treatment groups before and after propensity score application [84]. If matching was successful, covariate distributions should be more similar in the matched sample than in the unmatched sample [87].

Covariate Balance Assessment Workflow

Implementation Protocols

Comprehensive Balance Assessment Protocol

Calculate standardized mean differences for all baseline covariates included in the propensity score model and other potential confounders [84]
Compute variance ratios for continuous covariates to assess distributional similarity beyond means [84]
Generate visual diagnostics including Love plots, Q-Q plots, and side-by-side boxplots [87] [84]
Assess higher-order terms and interactions if clinically relevant [84]
Formally test balance using appropriate statistical tests when sample size permits [86]
Document all findings including both pre- and post-adjustment balance metrics

Special Considerations for Pharmacoepidemiology

New-user designs: Anchor balance assessment at initial prescription dispensing after period of non-use [1]
High-dimensional covariates: With large numbers of covariates, use Bonferroni correction or statistical testing of SMD thresholds to avoid false rejections due to chance imbalance [86]
Database heterogeneity: In distributed network studies, assess balance within each database and through meta-analysis [86]
Target population: Consider whether balance is needed for average treatment effect (ATE) or average treatment effect on treated (ATT) populations [1]

The Scientist's Toolkit

Table 3: Essential Tools for Balance Assessment

Tool Category	Specific Solutions	Primary Function
Statistical Software	R MatchIt, Cobalt packages	Implement propensity score matching and generate balance diagnostics [87]
Balance Metrics	Standardized mean differences, variance ratios	Quantify balance for continuous and categorical covariates [84]
Visualization Packages	ggplot2, Cobalt balance plots	Create Love plots, distribution comparisons [87]
High-Dimensional PS	High-dimensional propensity score, machine learning algorithms	Address confounding when hundreds of covariates are available [1]

Balance Visualization Framework

Rigorous assessment of covariate balance is not merely an optional diagnostic step but a fundamental requirement for valid causal inference using propensity score methods in pharmacoepidemiology. The gold standards outlined in this documentâ€”combining quantitative metrics with visual diagnosticsâ€”provide a comprehensive framework for evaluating the success of propensity score applications. Implementation of these protocols will enhance the credibility of observational drug studies and support more reliable decision-making in drug development and safety assessment. Future methodological developments will likely focus on improving balance assessment for high-dimensional covariates and complex longitudinal treatment regimens.

The increasing complexity of causal inference in pharmacoepidemiology demands rigorous frameworks to align observational research with clinical trial standards. This protocol details the integration of ICH E9(R1) estimands with target trial emulation (TTE) and advanced propensity score (PS) methods to enhance the validity of real-world evidence. The ICH E9(R1) framework provides a structured approach to precisely define treatment effects of interest, accounting for intercurrent events that complicate interpretation. Simultaneously, TTE creates a quasi-experimental structure within observational data by emulating a hypothetical randomized trial that would ideally answer the research question. When combined with sophisticated PS approaches like high-dimensional propensity score (hdPS) and machine learning techniques, these frameworks address critical biases including immortal time bias and residual confounding that frequently plague observational studies [6] [88].

This guidance document provides practical application notes and experimental protocols for implementing these integrated frameworks within pharmacoepidemiological studies, with specific examples drawn from recent research advancements. The structured approach ensures clarity in defining causal effects of interest while implementing robust methodological safeguards against common biases.

Conceptual Framework and Integration

Core Principles and Relationships

The integration of ICH E9(R1) estimands, target trial emulation, and propensity score methods creates a synergistic framework for robust causal inference. Estimands provide the precise mathematical definition of the treatment effect, including how intercurrent events are handled, while TTE provides the structural design that mimics an randomized controlled trial (RCT), and PS methods enable statistical adjustment to approximate randomization in observational data [88] [89].

ICH E9(R1) Estimand Components in Observational Studies

The estimand framework comprises five attributes that collectively provide a precise definition of the treatment effect for a specific clinical question: treatment, population, variable, population-level summary, and handling of intercurrent events. In pharmacoepidemiology, intercurrent events such as treatment switching, discontinuation, or initiation of concomitant medications are particularly common and must be addressed through appropriate strategies [89].

Table: Estimand Attributes and Their Application in Pharmacoepidemiology

Estimand Attribute	Definition	Observational Study Considerations	Example
Treatment	Intervention and conditions being compared	Clearly define exposure timing, dose, and duration	Disease-modifying drugs vs. no treatment
Population	Target population of interest	Specify inclusion/exclusion criteria that can be implemented in real-world data	Patients with multiple sclerosis, aged 18-65
Variable	Outcome or endpoint	Ensure accurate and consistent measurement in real-world data	All-cause mortality
Population-level Summary	How treatment effect is summarized	Choose appropriate statistical measure	Hazard ratio
Handling of Intercurrent Events	Strategy for events affecting interpretation	Implement causal methods to address informative censoring	Treatment discontinuation handled via principal stratification

Experimental Protocols

Protocol 1: Integrated Estimand Specification and TTE Design

This protocol establishes the foundation for aligning observational studies with the ICH E9(R1) framework through target trial emulation.

Objectives and Hypotheses

Primary Objective: To emulate the target trial of disease-modifying drugs (DMDs) on all-cause mortality in multiple sclerosis patients, defining the estimand according to ICH E9(R1) principles.
Hypothesis: Compared to no treatment, DMDs reduce all-cause mortality with a hazard ratio of 0.70 after addressing immortal time bias and residual confounding.

Target Trial Protocol

Eligibility Criteria: Define inclusion/exclusion criteria mirroring an ideal RCT, including: confirmed MS diagnosis, age 18-65, no prior DMD treatment, and minimum 1-year baseline data.
Treatment Strategies: Specify treatment arms as "initiation of DMDs" versus "no DMD treatment" with clear operational definitions for exposure.
Treatment Assignment: In the emulation, implement a pseudo-randomization process using hdPS to balance baseline covariates.
Outcome Definition: All-cause mortality ascertained from death registries with precise timing.
Follow-up Period: Define start (eligibility date) and end (death, end of study period, or 5 years) with intent-to-treat principle maintained.
Causal Contrast: Specify the comparison of survival outcomes between treatment strategies regardless of subsequent treatment changes (treatment policy strategy for intercurrent events) [6] [88].

Protocol 2: hdPS with Nested Case-Control for Bias Mitigation

This protocol combines high-dimensional propensity scores with a nested case-control design to simultaneously address immortal time bias and residual confounding, as demonstrated in multiple sclerosis research [6].

Cohort Definition and Follow-up

Data Source: Utilize administrative claims or electronic health records with sufficient sample size (e.g., 19,360 patients in the reference study).
Baseline Period: Establish a 1-year baseline period for assessing covariates prior to eligibility date.
Time-zero: Define cohort entry at the first date meeting all eligibility criteria (addressing immortal time bias by ensuring this occurs after exposure assessment).
Follow-up: Track patients from time-zero until outcome occurrence, censoring, or study end.

hdPS Implementation Steps

Covariate Selection: Identify candidate covariates across five data dimensions: diagnoses, procedures, medications, and other available domains.
Dimension Reduction: Apply autoencoder or other dimensionality reduction techniques for improved covariate balance compared to conventional hdPS [73].
PS Estimation: Calculate propensity scores using logistic regression or machine learning algorithms, including both investigator-specified and empirically identified covariates.
Covariate Balance: Assess balance using standardized mean differences (target <0.1 for all covariates).

Nested Case-Control Implementation

Case Definition: Identify all patients experiencing the outcome event during follow-up.
Control Selection: Randomly select up to 4 controls per case from patients at risk at the time of the case event, matched on time-in-study and other potential confounders.
Analysis: Apply conditional logistic regression adjusted for the propensity score to estimate hazard ratios.

Table: Comparison of Propensity Score Estimation Techniques

Method	Covariates with SMD >0.1	Implementation Complexity	Recommended Use Case
Investigator-specified	83	Low	Studies with strong prior knowledge of confounders
High-dimensional PS (hdPS)	37	Medium	Routine pharmacoepidemiology with claims data
Principal Component Analysis (PCA)	20	Medium-High	High-dimensional covariate spaces
Autoencoders	8	High	Optimal covariate balance prioritized

Analytical Workflow

The complete analytical workflow integrates estimand specification, study design, and analytical methods to produce valid causal estimates.

The Scientist's Toolkit

Research Reagent Solutions

Table: Essential Methodological Components for Implementation

Component	Function	Implementation Example
High-dimensional Propensity Score (hdPS)	Empirically identifies and adjusts for confounders in large healthcare databases	Automatically selects covariates from diagnosis, procedure, and prescription data using predefined algorithms [6] [73]
Nested Case-Control Design	Addresses immortal time bias and increases computational efficiency	For each case, select 4 controls matched on time-in-study and other potential confounders [6]
Dimensionality Reduction Techniques	Improves propensity score specification and covariate balance	Autoencoders outperform conventional methods, reducing covariates with imbalance (SMD >0.1) from 83 to 8 [73]
Sensitivity Analysis Framework	Tests robustness of findings to key assumptions	Vary hdPS parameters, control matching strategies, and model specifications [6]
Reproducible Code	Facilitates methodology adoption and verification	Share R or Python scripts implementing the complete analytical pipeline [6]

Application Notes

Case Study: Disease-Modifying Drugs in Multiple Sclerosis

A recent study exemplifies the integrated framework, examining DMDs and all-cause mortality in 19,360 MS patients [6]. The implementation demonstrated:

Estimand Specification: Treatment effect defined as the difference in survival between DMD initiators and non-initiators, addressing intercurrent events through a treatment policy strategy.
Immortal Time Bias Resolution: Using the nested case-control design with appropriate time-zero definition.
Residual Confounding Control: hdPS achieved substantial confounding adjustment, with only 37 covariates remaining imbalanced (SMD >0.1) compared to 83 with investigator-specified PS.
Mortality Effect: The integrated approach demonstrated a 28% reduction in mortality risk (HR: 0.72, 95% CI: 0.62-0.84) associated with DMD exposure.

Performance Metrics and Validation

The robustness of this approach was validated through comprehensive sensitivity analyses:

Effect Consistency: Hazard ratios remained stable across different PS estimation techniques (HR range: 0.70-0.77).
Covariate Balance: Dimensionality reduction techniques, particularly autoencoders, achieved superior balance compared to conventional methods.
Bias Reduction: The combination of NCC and hdPS simultaneously addressed both time-related and confounding biases.

This structured approach provides pharmacoepidemiologists with a robust framework for generating evidence that aligns with regulatory standards while addressing the inherent limitations of observational data.

Within pharmacoepidemiology, researchers are tasked with estimating the effects of drugs and medical products using observational data, where treatment assignment is not random [28]. This reality introduces confounding by indication, a phenomenon where treatment decisions are influenced by a patient's prognosis, systematically distorting the relationship between treatment and outcome [1]. For decades, traditional multivariable regression has been the cornerstone for controlling confounding in observational studies. More recently, propensity score (PS) methods have emerged as a powerful alternative, offering a different approach to achieving causal inference [90].

This article provides application notes and protocols for comparing these two analytical frameworks. We frame this within a broader thesis on PS methods, positing that while traditional regression adjusts for confounders in the outcome model, PS methods excel by focusing first on modeling the treatment assignment process. This creates a design stage that separates covariate balancing from outcome analysis, potentially offering advantages in transparency, diagnostics, and handling of high-dimensional data [91] [81].

Theoretical Foundations and Key Differences

Conceptual Frameworks

The fundamental goal of both PS methods and traditional regression is to achieve conditional exchangeabilityâ€”that is, to make the treatment and control groups comparable on all measured baseline covariates, as if randomization had occurred [1].

Traditional Regression Adjustment: This method models the outcome directly. It estimates the association between treatment and outcome while simultaneously adjusting for confounding covariates by including them in the outcome model. For a binary outcome, a logistic regression model might take the form: Outcome ~ Treatment + Covariate1 + Covariate2 + ... + CovariateK [90]. The model relies on correctly specifying the functional form between the outcome and each confounder.
Propensity Score Methods: The PS is defined as a patient's probability of receiving the treatment of interest, conditional on their measured baseline covariates [28]. It is estimated using a model (e.g., logistic regression) where treatment status is the outcome: Treatment ~ Covariate1 + Covariate2 + ... + CovariateK. The resulting PS, a single composite score, is then used to balance covariates across treatment groups via matching, weighting, or stratification [90]. This approach separates the design of the study (creating balanced groups) from the analysis of the outcome [91].

Comparative Advantages and Disadvantages

The core differences between the methods lead to distinct practical advantages and limitations, which are summarized in Table 1 below.

Table 1: Comparative advantages and disadvantages of traditional regression and propensity score methods.

Feature	Traditional Regression Adjustment	Propensity Score Methods
Primary Approach	Adjusts for confounders in the outcome model [90]	Models the treatment assignment process; separates design from analysis [91]
Dimensionality	Adjusts for all confounders simultaneously in the outcome model; can be high-dimensional [28]	Reduces multiple covariates to a single score for balancing (dimension reduction) [91]
Handling of Rare Outcomes	Efficient with a sufficient number of outcome events per covariate (e.g., rule of 10) [28]	Advantageous when outcome events are rare, as the treatment model is not limited by outcome rarity [28] [81]
Covariate Balance Assessment	Balance of covariates between groups is typically not formally assessed after regression [91]	Allows for direct checking of covariate balance between groups after matching/weighting (e.g., using standardized mean differences) [90] [28]
Extrapolation	Can extrapolate using the linearity assumption, even in regions with no overlap [92]	Limits analysis to regions of common support; avoids extrapolation by trimming non-overlapping PS tails [92] [81]
Key Assumptions	Correct model specification (functional form, no interactions) for the outcome [90]	Correct model specification for the treatment assignment; requires positivity and overlap [28]

A key conceptual difference lies in their approach to extrapolation. Regression inherently extrapolates based on its modeled assumptions, which can be problematic if the treatment and control groups are very different. In contrast, PS methods allow for a visual inspection of the score distributions, enabling researchers to identify and trim patients with no comparable counterparts in the other group, thus focusing the inference on the region of common support [92] [81].

Quantitative Performance Comparison

Empirical studies and simulations have been conducted to evaluate the relative performance of these methods. A clinical case study on gram-negative bloodstream infections directly compared four methods, as summarized in Table 2.

Table 2: Results from a clinical case study comparing methods for estimating the effect of IV-to-oral antibiotic transition on 30-day mortality (Adapted from [90]).

Analytical Method	Odds Ratio for 30-Day Mortality	Key Interpretation Notes
Multivariable Logistic Regression	0.84	Adjusted for confounders in the outcome model.
Propensity Score Matching (PSM)	0.84	Created matched cohorts with similar baseline characteristics.
Propensity Score Inverse Probability of Treatment Weighting (IPTW)	0.95	Created a pseudo-population; can be influenced by extreme weights.
Propensity Score Stratification	0.87	Stratified analysis by propensity score quintiles.

While the point estimates were broadly similar in this example, the authors noted relevant differences in interpretation, particularly for IPTW, which can be sensitive to patients with very high or low propensity scores [90]. A systematic review of artificial intelligence in pharmacoepidemiology found that in 50% of comparisons, machine learning techniques (including some for PS estimation) outperformed traditional pharmacoepidemiological methods [93].

Detailed Experimental Protocols

Protocol 1: Implementing Propensity Score Matching

Propensity Score Matching (PSM) is a widely used method to create a balanced cohort for analysis. The following workflow outlines the key steps, and Figure 1 provides a visual representation of the process.

Figure 1: Propensity Score Matching Workflow

Step-by-Step Methodology:

Data Preprocessing and Covariate Selection:
- Aim: Prepare a clean dataset with all necessary variables.
- Actions: Identify and address outliers and missing values through imputation or other methods [91]. Select covariates for the PS model based on subject matter knowledge. Include variables that are risk factors for the outcome and/or associated with treatment assignment. Avoid variables that are consequences of the treatment [28] [81].
- Tools: R packages: Hmisc, DMwR for imputation [91].
Propensity Score Estimation:
- Aim: Model the probability of treatment assignment for each patient.
- Actions: Typically, a logistic regression model is used where the dependent variable is treatment assignment (e.g., 1 for treated, 0 for control). The independent variables are the selected pretreatment covariates. The goal is to achieve balance, not prediction per se, so overfitting is less of a concern and interactions/higher-order terms can be included to improve balance [90] [28].
- Tools: Logistic regression in R (glm), MatchIt package [91].
Matching:
- Aim: Create a matched sample where treated and control units have similar PS distributions.
- Actions: Use a matching algorithm such as 1:1 nearest-neighbor matching without replacement. A caliper (e.g., 0.2 standard deviations of the logit of the PS) is often applied to prevent poor matches [90]. This step may drop unmatched units, refining the sample to the region of common support.
- Tools: MatchIt package in R [91].
Assessing Covariate Balance:
- Aim: Diagnose whether the matching procedure successfully balanced the covariates.
- Actions: Do not rely on p-values from significance tests. Instead, use standardized mean differences (SMDs) for each covariate. An SMD of less than 10% (or 0.1) generally indicates good balance [90] [28]. Generate love plots or other balance diagnostic charts.
- Tools: tableone, cobalt packages in R [91]. If balance is inadequate, return to Step 2 and respecify the PS model.
Estimating the Treatment Effect:
- Aim: Quantify the association between treatment and outcome in the balanced sample.
- Actions: In the matched cohort, analyze the outcome using methods appropriate for the matched data structure. For a binary outcome, a simple logistic regression model (Outcome ~ Treatment) can be used. For a more robust estimate, use a model that adjusts for any residual imbalance in key covariates (doubly robust estimation) [90].
- Tools: Standard regression models in R or Stata.
Sensitivity Analysis:
- Aim: Assess how sensitive the findings are to potential unmeasured confounding.
- Actions: Perform analyses such as testing how strong an unmeasured confounder would need to be to nullify the observed effect [91]. Trimming the tails of the PS distribution (e.g., excluding patients outside the 2.5th-97.5th percentiles of the opposite group's PS) is another sensitivity check for residual confounding [81].
- Tools: Specialist R packages or functions.

Protocol 2: Traditional Multivariable Regression Adjustment

This protocol serves as a benchmark comparison for the PSM approach.

Step-by-Step Methodology:

Model Specification:
- Aim: Define the regression model that relates the outcome to the treatment and confounding variables.
- Actions: Specify a model where the outcome is a function of the treatment indicator and all pre-specified confounders. For a binary outcome, this is typically a logistic model: Outcome ~ Treatment + Covariate1 + Covariate2 + ... + CovariateK. The functional form (e.g., linear, squared terms) for continuous covariates must be considered [90].
Model Fitting and Assumption Checking:
- Aim: Fit the model and check for violations of regression assumptions.
- Actions: Fit the model to the entire dataset (no matching). Check for multicollinearity among covariates and assess model fit using appropriate diagnostics (e.g., residual plots for linear models). Ensure the number of outcome events per variable is adequate to prevent overfitting (a common rule of thumb is 10-15 events per variable) [90] [28].
Treatment Effect Estimation:
- Aim: Obtain the adjusted estimate of the treatment effect.
- Actions: Extract the coefficient for the treatment variable and its confidence interval from the fitted model. This represents the log-odds ratio (for logistic regression) of the outcome associated with the treatment, adjusted for all other covariates in the model.

Successful implementation of these methods requires both statistical software and methodological rigor. Table 3 lists key "research reagents" for a pharmacoepidemiologist.

Table 3: Essential tools and resources for comparative analyses of PS methods and traditional regression.

Tool / Resource	Type	Function / Purpose	Key Considerations
R Statistical Software	Software Environment	Free, open-source platform for statistical computing and graphics. Supports all PS methods and traditional regression.	Extensive package ecosystem (e.g., `MatchIt`, `cobalt`) [91]. Steep learning curve but highly flexible.
Stata	Software Environment	Commercial software popular in epidemiology and economics.	Has dedicated modules for PS analysis (e.g., `psmatch2`). Often favored for its reproducibility and command-based interface [91] [94].
`MatchIt` Package (R)	Software Tool	Implements a variety of matching methods for causal inference, including PSM.	Simplifies the process of matching and subsequent balance checking [91]. A core package for PSM in R.
`cobalt` Package (R)	Software Tool	Designed for covariate balance assessment and presentation.	Provides superior plots and tables for balance diagnostics (e.g., love plots) after matching or weighting [91].
Standardized Mean Difference (SMD)	Metric	Quantifies the balance of a covariate between treatment groups, standardized by the pooled standard deviation.	The primary metric for assessing balance in PS analyses. SMD <0.1 indicates good balance [90] [28].
"Table One"	Reporting Standard	A table presenting the baseline characteristics of the study population, overall and by treatment group.	Essential for both before and after matching/weighting to demonstrate the success of the balancing method [91].

Both traditional regression adjustment and propensity score methods are valid approaches for controlling confounding in pharmacoepidemiological studies. The choice between them is not a matter of one being universally superior, but rather depends on the specific research context, data structure, and analytical goals.

PS methods offer distinct advantages in their ability to visually inspect and ensure covariate balance, to separate study design from analysis, and to handle scenarios with rare outcomes or a large number of confounders. Traditional regression remains a powerful, efficient, and straightforward tool, particularly when the model is correctly specified and sample sizes are large.

The modern pharmacoepidemiologist should be proficient in both methodologies. The protocols and tools outlined herein provide a foundation for conducting rigorous comparative analyses, ultimately leading to more reliable evidence on the real-world effects of medicinal products.

In pharmacoepidemiological research, the validity of findings from non-randomized, post-market safety studies is paramount. The propensity score (PS), defined as the probability of a patient receiving a specific treatment conditional on their measured baseline covariates, has become a cornerstone method for controlling for confounding in such analyses [28]. By balancing the distribution of covariates between treated and reference groups, PS methods emulate the random allocation of a clinical trial, thus providing more reliable estimates of treatment effects on safety outcomes in observational data [1] [28]. This document outlines application notes and detailed protocols for implementing propensity score methods, framed within the context of recent case studies from the pharmacoepidemiological literature.

Contemporary Practice in Safety Studies: A Case Review

A recent review of 25 post-marketing safety studies published in 2020 provides a snapshot of current methodological practices [95] [96]. The findings demonstrate the integral role of confounding control methods in the field.

Table 1: Characteristics of Recent Pharmacoepidemiologic Safety Studies (2020)

Study Characteristic	Category	Number of Studies (N=25)	Percentage
Study Design	Cohort Studies	19	76%
	Nested Case-Control Studies	6	24%
Primary Confounder Control Method	Propensity Score (PS) Methods	7 (of 19 cohort)	~37%
	Covariate Adjustment via Modeling	9 (of 19 cohort)	~47%
	Both PS & Covariate Adjustment	2 (of 19 cohort)	~11%
	Matching & Covariate Adjustment	6 (of 6 case-control)	100%

This review confirms that all recent studies employed robust methods to control for confounding, with propensity score techniques being a prevalent choice [95]. Furthermore, the analysis through the lens of the ICH E9(R1) estimand framework revealed that while studies consistently defined key attributes like treatment, outcome, and population, the handling of intercurrent events (ICEs)â€”such as drug discontinuation or treatment switchingâ€”was often discussed without using the formal ICE terminology. This highlights an area where methodological reporting can be further standardized [95].

Experimental Protocols for Propensity Score Application

The following protocol provides a step-by-step guide for implementing propensity score methods in a pharmacoepidemiologic safety study.

Protocol 1: Propensity Score-Based Analysis for a Post-Market Safety Study

Objective: To estimate the comparative risk of a specific safety outcome (e.g., gastrointestinal bleeding) between initiators of a new drug (e.g., COX-2 inhibitors) and initiators of an active comparator (e.g., non-selective NSAIDs) using real-world data.

Key Assumptions:

Exchangeability: All common causes of treatment assignment and the outcome are measured in the data (no unmeasured confounding).
Positivity: Every patient has a non-zero probability of receiving either treatment, conditional on their covariates.
SUTVA: The treatment and outcome of one patient are not affected by the treatment assignment of another [28].

Procedure:

Study Population Definition & Design
- Implement a new-user (incident user) design to minimize selection bias. Define the cohort entry date as the first prescription dispensing of either the study drug or comparator after a defined period of non-use [1].
- Clearly define the target population based on the clinical question of interest. Common targets are the Average Treatment Effect in the Overall Sample (SATE) or the Average Treatment Effect on the Treated (SATT) [1].
Covariate Selection and PS Model Building
- Identify Covariates: Select pre-treatment covariates (e.g., demographics, comorbidities, prior medications, healthcare utilization) that are risk factors for the outcome and/or associated with treatment selection. This should be guided by subject matter knowledge and clinical expertise [1] [28].
- Model Specification: Estimate the propensity score using a logistic regression model where the dependent variable is treatment assignment (e.g., 1=COX-2 inhibitor, 0=NSAID) and the independent variables are the selected pre-treatment covariates.
- Goal: The goal is not prediction but covariate balance. Avoid using model fit statistics (e.g., c-statistic) to guide covariate selection [1].
Implementing the Propensity Score
- Choose one of the following primary methods:
  - Matching: Create matched sets (e.g., 1:1 nearest-neighbor matching within a caliper) of treated and reference patients with similar PS. Discard unmatched patients.
  - Weighting: Create a pseudo-population using inverse probability of treatment weights (IPTW). The weight for a patient is defined as 1/PS for the treated and 1/(1-PS) for the reference group.
  - Stratification: Stratify the entire study population into quintiles or deciles based on the PS. Analyze outcomes within each stratum.
Assessing Covariate Balance
- After applying the PS (via matching, weighting, or stratification), assess the balance of all pre-treatment covariates between the treatment groups.
- Use standardized mean differences (SMD), where a value < 0.1 after adjustment generally indicates good balance. Visually inspect the distribution of covariates and PS before and after adjustment.
Outcome Analysis
- In the balanced sample (matched, weighted, or stratified), estimate the association between treatment and the safety outcome.
- For time-to-event outcomes (e.g., time to GI bleed), use a Cox proportional hazards model. For binary outcomes, use a logistic model. Incorporate the PS method used (e.g., use a robust variance estimator for IPTW models; include stratification variables in the outcome model).
Sensitivity Analysis
- Conduct analyses to assess the robustness of findings, such as using different PS models, calipers for matching, or accounting for unmeasured confounding.

The workflow for this protocol is summarized in the diagram below.

The Scientist's Toolkit: Essential Reagents & Materials

Successful implementation of propensity score analyses requires both conceptual and technical tools. The following table details key "research reagents" for this field.

Table 2: Essential Research Reagents for Propensity Score Analysis

Item / Concept	Category	Function / Explanation
High-Dimensional Propensity Score (hdPS)	Algorithm	An automated algorithm to select and adjust for a large number of candidate covariates from administrative databases, acting as a proxy to reduce unmeasured confounding [1].
Standardized Mean Difference (SMD)	Diagnostic Metric	A statistical measure used to quantify the balance of a covariate between treatment groups after PS application, independent of sample size. An SMD <0.1 indicates good balance.
Inverse Probability of Treatment Weighting (IPTW)	PS Application Method	A weighting technique that creates a pseudo-population where the distribution of measured covariates is independent of treatment assignment, allowing for direct estimation of marginal risk differences [28].
Stable Unit Treatment Value Assumption (SUTVA)	Core Assumption	The fundamental assumption that one patient's outcome is unaffected by the treatment assignment of another patient, and that there is only one version of each treatment [1] [28].
Estimand Framework (ICH E9(R1))	Regulatory & Conceptual Framework	A structured approach to ensure alignment between a study's objective and its design, analysis, and interpretation. It precisely defines how to handle intercurrent events (e.g., treatment discontinuation) when estimating the treatment effect [95].

Advanced Application: Integrating the Estimand Framework

Modern pharmacoepidemiology is moving towards integrating the ICH E9(R1) estimand framework with established causal inference methods like propensity scores. This integration brings clarity to the definition of the causal question, particularly regarding intercurrent events (ICEs) [95].

An ICE-handling strategy must be pre-specified. The following diagram illustrates the logical relationship between a common ICE, strategies to handle it, and the corresponding scientific question, all within a study employing PS for confounding control.

For example, to answer a question aligned with the treatment policy strategy (addressing the effect of treatment assignment regardless of discontinuation), the outcome analysis would include events that occur after the ICE, with no special adjustment. In contrast, a hypothetical strategy might require the use of complex g-methods like inverse probability of censoring weighting to adjust for post-baseline factors associated with the ICE [95].

The growing use of real-world evidence (RWE) in pharmacoepidemiology has created an urgent need for standardized approaches that ensure methodological rigor and reproducibility. Simultaneously, advances in propensity score (PS) methods have provided powerful tools to address confounding in observational studies, but their application remains inconsistent. This application note explores the integration of two harmonized protocol templatesâ€”STaRT-RWE and HARPERâ€”with advanced propensity score methods to create a robust framework for generating reliable real-world evidence on treatment effects. These frameworks address critical gaps in transparency and reproducibility while providing structured guidance for implementing complex confounding adjustment techniques [97] [98] [99].

The synergy between standardized protocols and sophisticated analytic methods represents a path forward for pharmacoepidemiological research. By embedding advanced propensity score techniques within structured research protocols, investigators can enhance study validity, improve communication of methodological choices, and strengthen the evidence base for healthcare decision-making [4] [41].

Understanding the Protocol Frameworks

The STaRT-RWE Framework

The Structured Template for Planning and Reporting on the Implementation of Real World Evidence Studies (STaRT-RWE) was developed by a public-private consortium to guide the design and conduct of reproducible RWE studies. This template serves as a comprehensive tool for communicating study methods with sufficient specificity to reduce misinterpretation [98]. The framework is compatible with multiple study designs, data sources, reporting guidelines, and bias assessment tools, making it particularly valuable for complex pharmacoepidemiologic studies investigating treatment safety and effectiveness.

STaRT-RWE addresses a critical need in the field, as evidenced by a recent systematic review which uncovered significant deficiencies in RWE study reporting. This review found that studies inadequately reported empirically defined covariates, power and sample size calculation, attrition, sensitivity analyses, and other key methodological parameters [100]. By providing a structured approach to documenting these elements, STaRT-RWE has the potential to substantially improve the robustness and credibility of RWE studies.

The HARPER Template

The HARmonized Protocol Template to Enhance Reproducibility (HARPER) emerged from a joint task force convened by the International Society for Pharmacoepidemiology (ISPE) and ISPORâ€“The Professional Society for Health Economics and Outcomes Research. This template builds upon existing efforts, including STaRT-RWE, to create a harmonized protocol structure specifically for RWE studies that evaluate treatment effects [97] [101] [99].

HARPER is designed to create a shared understanding of intended scientific decisions through a common text, tabular, and visual structure. The template's over-arching principle is to achieve sufficient clarity regarding data, design, analysis, and implementation to help investigators thoroughly consider and document their choices and rationale for key study parameters that define the causal question [99]. The template includes nine main sections with structured free text, tables, and figures, encouraging researchers to provide context and rationale for investigative decisions [101].

Table: Comparison of STaRT-RWE and HARPER Protocol Templates

Feature	STaRT-RWE	HARPER
Developer	Public-private consortium	Joint ISPE/ISPOR Task Force
Primary Focus	Planning and reporting RWE studies	Enhancing reproducibility of RWE studies
Structure	Guided template with tables, design diagram	Common text, tabular, and visual structure
Key Innovation	Library of published studies	Integration of rationale and context for decisions
Compatibility	Multiple study designs, data sources	Builds on existing templates including STaRT-RWE
Implementation Status	Used for critical appraisal of published studies	Pilot testing with international stakeholders

The Role of Standardized Protocols in Propensity Score Applications

Addressing Methodological Heterogeneity

Standardized protocols like STaRT-RWE and HARPER play a crucial role in addressing the methodological heterogeneity that currently plagues propensity score applications in pharmacoepidemiology. These templates provide structured guidance for documenting key decisions in the propensity score process, including variable selection, model specification, and implementation approach (matching, weighting, or stratification) [102].

Recent research demonstrates the significant impact of propensity score methodological choices on study results. For example, a 2025 study compared various dimensionality reduction techniques for propensity score estimation in high-dimensional claims data and found that autoencoder-based PS achieved the best covariate balance (8 covariates with SMD > 0.1), followed by principal component analysis (PCA) (20 covariates), logistic PCA (25 covariates), high-dimensional propensity score (hdPS) (37 covariates), and investigator-specified approaches (83 covariates) [4]. Without standardized documentation of these methodological choices, as facilitated by STaRT-RWE and HARPER, comparing results across studies and assessing their validity becomes challenging.

Enhancing Causal Inference Transparency

The HARPER template specifically enhances transparency in causal inference by requiring researchers to explicitly define their target estimand and align their analytic methods with this causal question [102] [41]. This is particularly important for propensity score weighting methods, where different approaches target different populations and estimands.

For example, overlap weighting has gained popularity for its ability to produce bounded, stable weights and achieve exact mean covariate balance. However, this method specifically targets the average treatment effect in the overlap population (ATO)â€”a statistically defined subgroup of patients with overlapping treatment probabilities near 0.5, those considered to be in clinical equipoise [41]. The HARPER protocol requires researchers to explicitly justify why this estimand is appropriate for their research question, preventing misinterpretation of results that may not be generalizable to the entire study population.

Advanced Propensity Score Methods: Case Studies and Protocols

Dimensionality Reduction Techniques for Propensity Scores

High-dimensional healthcare data presents both opportunities and challenges for propensity score estimation. A 2025 cohort study compared various dimensionality reduction techniques for improving propensity score specification when studying the association between dialysis and mortality in older patients with heart failure and advanced chronic kidney disease [4].

Table: Performance Comparison of Propensity Score Methods in Achieving Covariate Balance

Propensity Score Method	Covariates with SMD > 0.1	Key Features	Applications
Autoencoder-based PS	8	Best performance in covariate balance	High-dimensional claims data
Principal Component Analysis (PCA)	20	Linear dimensionality reduction	Large-scale pharmacoepidemiology
Logistic PCA	25	Nonlinear dimension reduction	Complex confounding patterns
High-dimensional Propensity Score (hdPS)	37	Automated variable selection	Routine healthcare data
Investigator-specified	83	Subject-matter knowledge	Targeted confounding adjustment

The study utilized Optum's de-identified Clinformatics Data Mart Database and included 485 dialysis-exposed and 1455 unexposed individuals after matching. While hazard ratios for in-hospital mortality were similar across PS methods, the substantial differences in achieved covariate balance highlight the importance of method selection and documentation [4].

Experimental Protocol: Autoencoder-Based Propensity Score Estimation

Research Reagent Solutions:

Data Source: Optum's de-identified Clinformatics Data Mart Database or equivalent claims data
Software: R or Python with keras/tensorflow for autoencoder implementation
Balance Assessment: Standardized Mean Differences (SMD) calculation package
Matching Algorithm: Optimal matching with caliper restriction

Methodology:

Data Preparation: Structure data according to the OMOP Common Data Model or similar standardized structure
Feature Selection: Identify high-dimensional covariates including diagnoses, procedures, medications
Autoencoder Architecture: Implement a symmetric neural network with bottleneck layer for dimensionality reduction
Propensity Score Estimation: Train logistic regression model using encoded features as inputs
Matching: Perform 1:3 matching without replacement using nearest neighbor algorithm
Balance Assessment: Calculate SMD for all covariates before and after matching
Outcome Analysis: Apply Cox proportional hazards model to estimate hazard ratios

This protocol can be documented within the HARPER template's "Data Analysis" section, with detailed rationale provided for each architectural decision and parameter specification [4] [102].

Addressing Complex Biases with Integrated Designs

Advanced propensity score applications often require integrated designs to address multiple biases simultaneously. A 2025 study in multiple sclerosis research implemented a high-dimensional propensity score approach within a nested case-control framework to address both immortal time bias and residual confounding when examining the relationship between disease-modifying drugs and all-cause mortality [6].

The study used a retrospective cohort of 19,360 individuals with multiple sclerosis in British Columbia, Canada. The nested case-control analysis addressed immortal time bias, while hdPS was applied to handle residual confounding. This integrated approach demonstrated a 28% reduction in mortality risk associated with exposure to DMDs (HR: 0.72, 95% CI: 0.62-0.84) [6].

Diagram: Integrated NCC-hdPS Framework for Addressing Multiple Biases

Experimental Protocol: Integrated NCC-hdPS Design

Research Reagent Solutions:

Cohort Data: Retrospective cohort with comprehensive follow-up (e.g., administrative claims, electronic health records)
hdPS Algorithm: Automated variable selection for confounder identification
Matching Framework: Incidence density sampling for case-control selection
Sensitivity Analysis Package: Multiple parameter specifications for robustness testing

Methodology:

Cohort Definition: Apply inclusion/exclusion criteria to define study population
Case Identification: Identify all cases with outcome of interest during follow-up
Control Sampling: Select up to 4 controls per case using incidence density sampling
hdPS Implementation:
- Identify candidate covariates from data dimensions (diagnoses, procedures, medications)
- Apply predetermined algorithm for variable prioritization
- Estimate propensity scores using selected covariates
Matching: Implement propensity score matching within NCC framework
Outcome Analysis: Conditional logistic regression to estimate odds ratios
Sensitivity Analyses: Test robustness across different hdPS parameters and matching strategies

This integrated design can be comprehensively documented using the STaRT-RWE template's structured approach to study design specification, particularly through the use of design diagrams and detailed variable definitions [6] [98].

Implementation Framework: Integrating Advanced Methods with Standardized Protocols

Structured Documentation of Propensity Score Workflows

The integration of advanced propensity score methods with standardized protocols requires systematic documentation of key methodological decisions. The HARPER template provides specific sections for this purpose, enabling researchers to clearly communicate their analytic choices and the rationale behind them [102].

Table: HARPER Template Sections for Propensity Score Documentation

HARPER Section	Propensity Score Application	Documentation Requirements
Research Question	Target estimand definition	Specify whether estimating ATE, ATT, or ATO
Variables	Confounding variables	Rationale for covariate selection approach
Data Analysis	PS estimation and implementation	Detailed specification of model and matching/weighting
Sensitivity Analyses	Methodological robustness	Alternative PS specifications or implementations
Limitations	Positivity violations and model misspecification	Assessment of PS method limitations

For overlap weighting specifically, researchers should document:

The explicit justification for targeting the ATO
Characteristics of both the unweighted and overlap-weighted populations
Assessment of the clinical interpretability of the overlap population
Comparison with other weighting methods when appropriate [41]

Visualizing the Integrated Framework

The STaRT-RWE template emphasizes the use of visualizations to communicate study design, an approach that can be extended to propensity score workflows.

Diagram: Integrated PS-Protocol Framework for RWE Studies

The integration of STaRT-RWE and HARPER protocol templates with advanced propensity score methods represents a significant advancement for pharmacoepidemiological research. These harmonized frameworks provide the structure necessary to ensure methodological transparency, enhance reproducibility, and facilitate critical appraisal of study validity. As propensity score methods continue to evolve in complexityâ€”from dimensionality reduction techniques to integrated bias-reduction designsâ€”standardized protocols offer the foundation for their responsible implementation and clear communication.

The path forward requires widespread adoption of these templates throughout the research community, along with continued refinement based on implementation experience. By embedding advanced propensity score applications within these structured frameworks, researchers can generate more reliable real-world evidence capable of informing healthcare decisions with greater confidence.

Conclusion

Propensity score methods are indispensable tools for mitigating confounding and strengthening causal inference in pharmacoepidemiology. When grounded in robust design principles like the new-user design and carefully executed through matching, weighting, or advanced techniques like hdPS, they can yield estimates that more closely approximate those from randomized trials. Success hinges on rigorous balance assessment, thoughtful handling of intercurrent events, and transparency in reporting. Future progress lies in the continued integration of machine learning for variable selection, broader adoption of formalized frameworks like ICH E9(R1) and target trial emulation, and the development of robust methods to address unmeasured confounding. Embracing these practices and directions will enhance the credibility and utility of real-world evidence for informing drug safety and regulatory decision-making.