Formulation¶
Given the empirical distribution \(\hat P\) from the training data \(\{(x_i, y_i)\}_{i \in [N]}\), we consider the following (distance-based) distributionally robust optimization formulations under the machine learning context. In general, DRO optimizes over the worst-case loss and satisfies the following structure:
where \(\mathcal{P}\) is denoted as the ambiguity set. Usually, it satisfies the following structure:
Here, \(d(\cdot, \cdot)\) is a notion of distance between probability measures and \(\epsilon\) captures the size of the ambiguity set.
Given each function class \(\mathcal{F}\), we classify all the models into the following cases, where each case can be further classified given each distance type \(d\).
We design our package based on the principle pipeline “Data -> Model -> Evaluation / Diagnostics” and discuss them one by one as follows:
Data Module¶
Synthetic Data Generation¶
Following the general pipeline of “Data -> Model -> Evaluation / Diagnostics”, we first integrate different kinds of synthetic data generating mechanisms into dro
, including:
Python Module | Function Name | Description |
---|---|---|
dro.src.data.dataloader_classification |
classification_basic | Basic classification task |
classification_DN21 | Following Section 3.1.1 of "Learning Models with Uniform Performance via Distributionally Robust Optimization" |
|
classification_SNVD20 | Following Section 5.1 of "Certifying Some Distributional Robustness with Principled Adversarial Training" |
|
classification_LWLC | Following Section 4.1 (Classification) of "Distributionally Robust Optimization with Data Geometry" |
|
dro.src.data.dataloader_regression |
regression_basic | Basic regression task |
regression_DN20_1 | Following Section 3.1.2 of "Learning Models with Uniform Performance via Distributionally Robust Optimization" |
|
regression_DN20_2 | Following Section 3.1.3 of "Learning Models with Uniform Performance via Distributionally Robust Optimization" |
|
regression_DN20_3 | Following Section 3.3 of "Learning Models with Uniform Performance via Distributionally Robust Optimization" |
|
regression_LWLC | Following Section 4.1 (Regression) of "Distributionally Robust Optimization with Data Geometry" |
Model Module¶
Exact Fitting: Linear¶
We discuss the implementations of different classification and regression losses, where \(f(X) = \theta^{\top}X + b\).
Classification:
SVM (Hinge) Loss (
svm
): \(\ell(f(X), Y) = \max\{1 - Y f(X), 0\}.\)Logistic Loss (
logistic
): \(\ell(f(X), Y) = \log(1 + \exp(-Y f(X))).\)
Note that in classification tasks, \(Y \in \{-1, 1\}\).
Regression:
Least Absolute Deviation (
lad
): \(\ell(f(X), Y) = |Y - f(X)|\).Ordinary Least Squares (
ols
): \(\ell(f(X), Y) = (Y - f(X))^2\).
Above, we designate the model_type
as the names in parentheses.
Across the linear module, we designate the vector \(\theta = (\theta_1,\ldots, \theta_p)\) as theta
and \(b\) as b
.
Besides this, we support other loss types.
Solvers support: The built-in solvers in cvxpy
(where we set Mosek
during our test).
We support DRO methods including:
WDRO: (Basic) Wasserstein DRO, Satisificing Wasserstein DRO;
Standard \(f\)-DRO: KL-DRO, \(\chi^2\)-DRO, TV-DRO;
Generalized \(f\)-DRO: CVaR-DRO, Marginal DRO (CVaR), Conditional DRO (CVaR);
MMD-DRO;
Bayesian-based DRO: Bayesian-PDRO, PDRO;
Mixed-DRO: Sinkhorn-DRO, HR-DRO, MOT-DRO, Outlier-Robust Wasserstein DRO (OR-Wasserstein DRO).
Exact or Approximate Fitting: Kernel¶
We allow kernelized distributionally robust regression or classification via .update_kernel()
. More specifically, we allow all of the four types of losses (svm
, logistic
, lad
, ols
). More specifically, in each case above, we replace \(f(X) = \theta^{\top}X + b\) with \(f(X) = \sum_{i \in [N]}\alpha_i K(x, x_i)\) where \(K(\cdot,\cdot)\) is the kernel and \(\{\alpha\}_{i \in [N]}\) are the parameters to be determined in the optimization problem.
We mimic the standard scikit-learn kernel interface with the following hyperparameters:
metric: standard kernel metrics when calculating kernel between instances in a feature array, including
additive_chi2
,chi2
,linear
,poly
,polynomial
,rbf
;kernel_gamma: Parameter gamma of the pairwise kernel specified by metric. It should be positive, or
scale
,auto
.n_components: Exact fitting –
None
; Approximate fitting – int, which denotes the reduced number of data points to construct the kernel mapping in Nystroem approximation (recommend to use when \(n\) is large).
Approximate Fitting: Neural Network¶
Given the complexity of neural networks, many of the explicit optimization algorithms are not applicable. And we implement four DRO methods in an “approximate” way, including:
\(\chi^2\)-DRO;
CVaR-DRO;
Wasserstein DRO: we approximate it via adversarial training;
Holistic Robust DRO.
Furthermore, the model architectures supported in dro
include:
Linear Models;
Vanilla MLP;
AlexNet;
ResNet18.
And the users could also use their own model architecture (please refer to the update
function in BaseNNDRO
).
Approximate Fitting: Tree-based Ensemble DRO¶
Due to the popular use of tree-based ensemble models in many applciations, we implement two DRO methods in an “approximate” way as a preliminary setting, including:
KL-DRO
CVaR-DRO
Chi2-DRO
The current model architectures of \(f(X)\) supported in dro
include:
LightGBM
XGBoost
Evaluation¶
In some of linear DRO models, we provide additional interfaces for understanding the worst-case model performance (refer to the worst_distribution
function in each derivative DRO) and evaluating the true model performance in terms of the true MSE estimated from the fitted data (refer to evaluate
function in BaseLinearDRO
).
Reference¶
Daniel Kuhn, Soroosh Shafiee, and Wolfram Wiesemann. Distributionally robust optimization. arXiv preprint arXiv:2411.02549, 2024.