OSCAR Logo

OSCAR: Optical-aware Semantic Control for Aleatoric Refinement in SAR-to-Optical Translation

Corresponding author
1Kyungpook National University
2Korea Aerospace Research Institute

BENv2 Results Comparison

SEN12MS Results Comparison

Abstract

Synthetic Aperture Radar (SAR) provides robust all-weather imaging capabilities; however, translating SAR observations into photo-realistic optical images remains a fundamentally ill-posed problem. Current approaches are often hindered by the inherent speckle noise and geometric distortions of SAR data, which frequently result in semantic misinterpretation, ambiguous texture synthesis, and structural hallucinations.

To address these limitations, we propose OSCAR (Optical-aware Semantic Control for Aleatoric Refinement), a novel SAR-to-Optical (S2O) translation framework that integrates three core technical contributions:

  1. Cross-Modal Semantic Alignment: Establishes an Optical-Aware SAR Encoder by distilling robust semantic priors from an Optical Teacher into a SAR Student.
  2. Semantically-Grounded Generative Guidance: Realized by a ControlNet that integrates class-aware text prompts for global context with hierarchical visual prompts for local spatial guidance.
  3. Uncertainty-Aware Objective: Explicitly models aleatoric uncertainty to dynamically modulate the reconstruction focus, effectively mitigating artifacts caused by speckle-induced ambiguity.

Extensive experiments demonstrate that OSCAR achieves superior perceptual quality and semantic consistency compared to state-of-the-art approaches.

Overview

1. Optical-Aware SAR Encoder

Optical-Aware SAR Encoder

The Optical-Aware SAR Encoder bridges the fundamental cross-modal gap by aligning SAR features with a rich optical semantic manifold.

  • Cross-Modal Distillation: We utilize DINOv3-SAT—a foundation model pre-trained on 493M satellite images—as an Optical Teacher to guide the SAR Student.
  • Multi-level Alignment: The framework synchronizes logit-level probabilities, intermediate features (Attention maps & CLS tokens), and structural alignment to capture robust, modality-agnostic priors.
  • Parameter Efficiency: We integrate Low-Rank Adaptation (LoRA) to effectively adapt the large-scale encoder to the SAR domain without heavy computational overhead.

2. Semantically-Grounded ControlNet

Semantically-Grounded ControlNet

The Semantically-Grounded ControlNet performs the translation by injecting dual-path semantic guidance into the diffusion process.

  • Global & Local Guidance:
    • Class-aware Text Prompts: Establish the global semantic tone and style based on high-confidence predictions.
    • Hierarchical Visual Prompts: Provide dense spatial anchors via the Semantically-Grounded Guidance Module (SGGM) to ensure structural fidelity.
  • Uncertainty-Aware Objective: To handle the inherent noise in SAR, we explicitly model aleatoric uncertainty. By estimating a pixel-wise confidence map, the model learns to ignore speckle-induced ambiguity and focus on sharp, accurate reconstruction.

Experimental Results

Quantitative Results

Quantitative Results

We conduct a quantitative comparison with state-of-the-art SAR-to-optical translation methods on both BENv2 and SEN12MS datasets. Our OSCAR framework establishes a new state-of-the-art across all metric categories.

  • BigEarthNet-v2: OSCAR reduces the FID score by 32.5% compared to BBDM.
  • SEN12MS: OSCAR achieves an FID reduction of over 50.2% compared to StegoGAN.

Ablation Study - Quantitative

Ablation Study Quantitative

We analyze the individual contributions of our three core technical pillars:

  • Cross-modal Alignment (Aln.): Essential for ensuring images follow the spectral and semantic logic of the optical domain.
  • Hierarchical Visual Prompts (Hier.): Provides dense spatial guidance to anchor textures correctly.
  • Class-aware Text Prompts (Text): Establishes the global semantic tone and domain context.

The integration of all components (Full Model) achieves the highest quantitative performance across DISTS, LPIPS, and SAM metrics, demonstrating a clear synergy between spatial and global guidance.

Ablation Study - Qualitative

Ablation Study Qualitative

The qualitative results visualize how each component contributes to the final synthesis. As Cross-modal Alignment, Hierarchical Visual Prompts, and Class-aware Text Prompts are progressively integrated, we observe:

  • A significant improvement in structural fidelity.
  • Effective suppression of artifacts caused by SAR speckle noise.

Ultimately, the Full Model (v) produces the most photorealistic results with accurate color distributions and the sharpest boundaries compared to any other configuration.