Wasserstein DRO¶
In Wasserstein DRO, \(\mathcal{P}(d, \epsilon) = \{Q: W(Q, \hat P)\leq \epsilon\}\).
Specifically, Wasserstein distance is defined as follows: For \(Z_1 = (X_1, Y_1), Z_2 = (X_2, Y_2)\),
where for lad
, svm
, logistic
, the inner distance is captured by the norm: \(d((X_1, Y_1), (X_2, Y_2)) = \|(X_1 - X_2, Y_1 - Y_2)\|.\)
For ols
, the inner distance is captured by the norm square:
\(d((X_1, Y_1), (X_2, Y_2)) = \|(X_1 - X_2, Y_1 - Y_2)\|\).
No matter in each case, the norm is defined on the product space \(\mathcal{X} \times \mathbb{R}\) by:
Here \(\|x\|_{\Sigma, p} = \|\Sigma^{1/2}x\|_p\). Furthermore, we can show that the dual norm \(\|(\cdot, \cdot)\|_*\) is given by:
with \(\frac{1}{p} + \frac{1}{q} = 1\).
The norm is defined by the parameter tuple \((\Sigma, p, \kappa)\),
Standard Version¶
Hyperparameter¶
\(\Sigma\): the feature importance perturbation matrix with the dimension being the (dimX, dimX).
\(p\): Norm parameter for controlling the perturbation moment of X.
\(\kappa\): Robustness parameter for the perturbation of Y.
\(\epsilon\): Ambiguity size of Wasserstein ball.
For OLS, Theorem 1 in [1] there provides the corresponding reformulations. We set \(\kappa = \infty\), i.e., not allow changes in \(Y\).
For loss functions in LAD, Logistic, and SVM models, [2] provide to reformulate the problem.
Worst case Distribution¶
When we set compute_type == asymp
, [2] provides asymptotic worst-case distribution for linear regression and classification (Theorems 9 and 20), where we take \(\gamma \in (0, 1]\) as the hyperparameter to tune in each case.
compute_type == exact
is depreciated right now.
Robust Satisficing Version¶
For the Satisficing Wasserstein-DRO model [3], we solve the following constrained optimization problem, where DRO is set as the constraint counterpart:
For (approximated) regression / classification, we can show the optimization problem above is equivalent to:
Hyperparameter¶
In the satisfying version, we do not set \(\epsilon\) as the hyperparameter but as an optimization goal such that to minimize the worst-case performance.
\(\Sigma\): the feature importance perturbation matrix with the dimension being the (dimX, dimX).
\(p\): Norm parameter for controlling the perturbation moment of X.
\(\kappa\): Robustness parameter for the perturbation of Y.
\(\tau\): \(\tau \geq 1\) is set as the multiplication of the best empirical performance with \(E_{(X, Y)\sim \hat P_n}[\ell(\theta_{ERM};(X, Y))]\).
Reference¶
[1] Blanchet, Jose, et al. “Data-driven optimal transport cost selection for distributionally robust optimization.” 2019 winter simulation conference (WSC). IEEE, 2019.
[2] Shafieezadeh-Abadeh, Soroosh, Daniel Kuhn, and Peyman Mohajerin Esfahani. “Regularization via mass transportation.” Journal of Machine Learning Research 20.103 (2019): 1-68.
[3] Long, Daniel Zhuoyu, Melvyn Sim, and Minglong Zhou. “Robust satisficing.” Operations Research 71.1 (2023): 61-82.