Formulation

Given the empirical distribution \(\hat P\) from the training data \(\{(x_i, y_i)\}_{i \in [N]}\), we consider the following (distance-based) distributionally robust optimization formulations under the machine learning context. In general, DRO optimizes over the worst-case loss and satisfies the following structure:

\[ \min_{f \in \mathcal{F}}\max_{Q \in \mathcal{P}}\mathbb{E}_Q[\ell(f(X), Y)], \]

where \(\mathcal{P}\) is denoted as the ambiguity set. Usually, it satisfies the following structure:

\[ \mathcal{P}(d, \epsilon) = \{Q: d(Q, \hat P) \leq \epsilon\}. \]

Here, \(d(\cdot, \cdot)\) is a notion of distance between probability measures and \(\epsilon\) captures the size of the ambiguity set.

Given each function class \(\mathcal{F}\), we classify all the models into the following cases, where each case can be further classified given each distance type \(d\).

We design our package based on the principle pipeline “Data -> Model -> Evaluation / Diagnostics” and discuss them one by one as follows:

Data Module

Synthetic Data Generation

Following the general pipeline of “Data -> Model -> Evaluation / Diagnostics”, we first integrate different kinds of synthetic data generating mechanisms into dro, including:

Python Module Function Name Description




dro.src.data.dataloader_classification
classification_basic Basic classification task
classification_DN21 Following Section 3.1.1 of
"Learning Models with Uniform Performance via Distributionally Robust Optimization"
classification_SNVD20 Following Section 5.1 of
"Certifying Some Distributional Robustness with Principled Adversarial Training"
classification_LWLC Following Section 4.1 (Classification) of
"Distributionally Robust Optimization with Data Geometry"





dro.src.data.dataloader_regression
regression_basic Basic regression task
regression_DN20_1 Following Section 3.1.2 of
"Learning Models with Uniform Performance via Distributionally Robust Optimization"
regression_DN20_2 Following Section 3.1.3 of
"Learning Models with Uniform Performance via Distributionally Robust Optimization"
regression_DN20_3 Following Section 3.3 of
"Learning Models with Uniform Performance via Distributionally Robust Optimization"
regression_LWLC Following Section 4.1 (Regression)
of "Distributionally Robust Optimization with Data Geometry"

Model Module

Exact Fitting: Linear

We discuss the implementations of different classification and regression losses, where \(f(X) = \theta^{\top}X + b\).

Classification:

  • SVM (Hinge) Loss (svm): \(\ell(f(X), Y) = \max\{1 - Y f(X), 0\}.\)

  • Logistic Loss (logistic): \(\ell(f(X), Y) = \log(1 + \exp(-Y f(X))).\)

Note that in classification tasks, \(Y \in \{-1, 1\}\).

Regression:

  • Least Absolute Deviation (lad): \(\ell(f(X), Y) = |Y - f(X)|\).

  • Ordinary Least Squares (ols): \(\ell(f(X), Y) = (Y - f(X))^2\).

Above, we designate the model_type as the names in parentheses.

Across the linear module, we designate the vector \(\theta = (\theta_1,\ldots, \theta_p)\) as theta and \(b\) as b.

Besides this, we support other loss types.

Solvers support: The built-in solvers in cvxpy (where we set Mosek during our test).

We support DRO methods including:

  • WDRO: (Basic) Wasserstein DRO, Satisificing Wasserstein DRO;

  • Standard \(f\)-DRO: KL-DRO, \(\chi^2\)-DRO, TV-DRO;

  • Generalized \(f\)-DRO: CVaR-DRO, Marginal DRO (CVaR), Conditional DRO (CVaR);

  • MMD-DRO;

  • Bayesian-based DRO: Bayesian-PDRO, PDRO;

  • Mixed-DRO: Sinkhorn-DRO, HR-DRO, MOT-DRO, Outlier-Robust Wasserstein DRO (OR-Wasserstein DRO).

Exact or Approximate Fitting: Kernel

We allow kernelized distributionally robust regression or classification via .update_kernel(). More specifically, we allow all of the four types of losses (svm, logistic, lad, ols). More specifically, in each case above, we replace \(f(X) = \theta^{\top}X + b\) with \(f(X) = \sum_{i \in [N]}\alpha_i K(x, x_i)\) where \(K(\cdot,\cdot)\) is the kernel and \(\{\alpha\}_{i \in [N]}\) are the parameters to be determined in the optimization problem.

We mimic the standard scikit-learn kernel interface with the following hyperparameters:

  • metric: standard kernel metrics when calculating kernel between instances in a feature array, including additive_chi2, chi2, linear, poly, polynomial, rbf;

  • kernel_gamma: Parameter gamma of the pairwise kernel specified by metric. It should be positive, or scale, auto.

  • n_components: Exact fitting – None; Approximate fitting – int, which denotes the reduced number of data points to construct the kernel mapping in Nystroem approximation (recommend to use when \(n\) is large).

Approximate Fitting: Neural Network

Given the complexity of neural networks, many of the explicit optimization algorithms are not applicable. And we implement four DRO methods in an “approximate” way, including:

  • \(\chi^2\)-DRO;

  • CVaR-DRO;

  • Wasserstein DRO: we approximate it via adversarial training;

  • Holistic Robust DRO.

Furthermore, the model architectures supported in dro include:

  • Linear Models;

  • Vanilla MLP;

  • AlexNet;

  • ResNet18.

And the users could also use their own model architecture (please refer to the update function in BaseNNDRO).

Approximate Fitting: Tree-based Ensemble DRO

Due to the popular use of tree-based ensemble models in many applciations, we implement two DRO methods in an “approximate” way as a preliminary setting, including:

  • KL-DRO

  • CVaR-DRO

  • Chi2-DRO

The current model architectures of \(f(X)\) supported in dro include:

  • LightGBM

  • XGBoost

Evaluation

In some of linear DRO models, we provide additional interfaces for understanding the worst-case model performance (refer to the worst_distribution function in each derivative DRO) and evaluating the true model performance in terms of the true MSE estimated from the fitted data (refer to evaluate function in BaseLinearDRO).

Reference

  • Daniel Kuhn, Soroosh Shafiee, and Wolfram Wiesemann. Distributionally robust optimization. arXiv preprint arXiv:2411.02549, 2024.