DRO with a mixture of distance metrics¶
Recall the training distribution as \(\hat P\), we further implement several DRO methods based on a mixture of distance metrics, including Sinkhorn-DRO, Holistic Robust DRO, and DRO based on OT discrepancy with moment constraints (MOT-DRO).
Sinkhorn-DRO¶
In Sinkhorn-DRO [1], \(\mathcal{P}(W_{\epsilon};{\rho,\epsilon})= \{P: W_{\epsilon}(P,\hat{P})\leq \rho \}\). Here \(W_{\epsilon}(\cdot,\cdot)\) denotes the Sinkhorn Distance, defined as:
where \(\mu,\nu\) are reference measures satisfying \(P\ll \mu\) and \(Q\ll \nu\).
Hyperparameters¶
reg_param: Dual parameter \(\lambda\). Please refer to Equation (14).
lambda_param: Variance (\(\epsilon\)) of the Gaussian noise distribution added to each sample. Please refer to Remark 8.
k_sample_max: \(l\) in Equation (14), where \(2^l\) points are sampled to approximate the sub-gradient.
Holistic-DRO¶
In Holistic-DRO [2], \(\mathcal{P}(LP_{\mathcal N}, D_{KL}; \alpha, r) = \{P: P,Q\in\mathcal{P}, LP_{\mathcal N}(\hat{P},Q)\leq \alpha, D_{KL}(Q\|P)\leq r \}\), which depends on two metrics divergence:
Levy-Prokhorov metric \(LP_{\mathcal N}(P,Q) = \inf_{\gamma\in\Pi(P,Q)} \inf \mathbb{I}(\xi-\xi'\notin \mathcal{N})d\gamma(\xi, \xi')\), where \(\mathcal N\) denotes the
KL-divergence \(D_{KL}(Q\|P) = \int_Q \log \frac{dQ}{dP}dQ\).
We support linear losses (SVM for classification and LAD for regression), where we follow Appendix D.1 and D.2 in [2], and we set the worst-case domain \(\Sigma = \{(X_i, Y_i): i \in [n]\} + B_2(0,\epsilon') \times \{0\}\) and \(\mathcal N = B_2(0,\epsilon) \times \{0\}\).
Hyperparameters¶
\(r\): Robustness parameter for the KL-DRO, denoted as
r
in the model config.\(\alpha\): Robustness parameter for the Levy-Prokhorov metric DRO, denoted as
alpha
in the model config.\(\epsilon\): Robustness parameter for the model noise (perturbed ball size), denoted as
epsilon
in the model config.\(\epsilon'\): Domain parameter, denoted as
epsilon_prime
in the model config.
MOT-DRO¶
In MOT-DRO [3], \(\mathcal{P}(M_c;\epsilon) = \{(Q, \delta): M_c((Q, \delta), \tilde P) \leq \epsilon\}\) uses the OT-discrepancy with moment constraints, defined as:
where \(\pi_{(Z,W)}=P, \pi_{(\hat Z, \hat W)}=Q\), and \(\mathbb{E}_\pi[W]=1\). Taking the cost function as
where \(\tilde{P} =\hat{P} \otimes \delta_1\).
We support linear losses (SVM for classification and LAD for regression), where we follow Theorem 5.2 and Corollary 5.1 in [3].
Hyperparameters¶
\(\theta_1\) (or \(\theta_2\)): relative penalty of Wasserstein (outcome) perturbation or likelihood perturbation, satisfying \(\frac{1}{\theta_1} + \frac{1}{\theta_2} = 1\).
\(\epsilon\): robustness radius for OT ambigiuty set, denoted as
epsilon
.\(p\): cost penalty of outcome perturbation, where we only implement the case of \(p \in \{1, 2\}\).
Outlier-Robust Wasserstein DRO¶
In Outlier-Robust Wassersteuin DRO (OR-WDRO) [4], \(\mathcal{P}(W_p^{\eta};\epsilon) = \{Q: W_p^{\eta}(Q, \hat P)\leq \epsilon\}\), where:
where \(p\) is the \(p\)-Wasserstein distance and \(\eta \in [0, 0.5)\) denotes the corruption ratio.
Hyperparameters¶
\(p\): Norm parameter for controlling the perturbation moment of X. We only allow the dual norm \(\frac{p}{p - 1}\) in \(\{1, 2\}\).
\(\eta\): Contamination level \([0, 0.5)\).
We only consider SVM (for classification) and LAD (for regression) based on the convex reformulation of Theorem 2 in [4]. Note that the model also requires the input of \(\sigma\), which we take \(\sigma = \sqrt{d_x}\) as default.
Reference¶
[1] Wang, Jie, Rui Gao, and Yao Xie. “Sinkhorn distributionally robust optimization.” arXiv preprint arXiv:2109.11926 (2021).
[2] Bennouna, Amine, and Bart Van Parys. “Holistic robust data-driven decisions.” arXiv preprint arXiv:2207.09560 (2022).
[3] Jose Blanchet, Daniel Kuhn, Jiajin Li, Bahar Taskesen. “Unifying Distributionally Robust Optimization via Optimal Transport Theory.” arXiv preprint arXiv:2308.05414 (2023).
[4] Nietert, Sloan, Ziv Goldfeld, Soroosh Shafiee, “Outlier-Robust Wasserstein DRO.” NeurIPS 2023.