Vehicle Benchmark
ETRI Visual Intelligence Lab
Overview
This project proposes a CCTV-only vehicle trajectory forecasting framework for real urban infrastructure environments.
Unlike prior work that depends on ego-vehicle sensors, LiDAR-built HD maps, or controlled cooperative setups, this work focuses on single-view urban CCTV streams and builds an end-to-end pipeline for dataset construction and trajectory prediction.
Key Challenge
Reliable trajectory prediction from infrastructure-only CCTV is difficult due to viewpoint inconsistency, missing depth/LiDAR cues, noisy tracking jitter, and the absence of pre-built HD maps.
Reliable trajectory prediction from infrastructure-only CCTV is difficult due to viewpoint inconsistency, missing depth/LiDAR cues, noisy tracking jitter, and the absence of pre-built HD maps.
Approach
The full pipeline includes camera geometry recovery, dataset generation, map integration, and forecasting model adaptation:
| Component | Description |
|---|---|
| Monocular Camera Calibration | Estimate homography from CCTV views to BEV using learning-based auto-calibration and RANSAC constraints |
| Dataset Construction | Convert detections/tracks into V2X-Seq-style trajectory records (type, x, y, theta, timestamp), then generate infrastructure-aligned scenes |
| HD-Map Integration | Build map annotations directly from converted BEV space for infrastructure-only prediction |
| Model Benchmarking | Evaluate infrastructure-only variants of HiVT, V2X-Graph (without IA), and proposed V2ITrajNet |
| Interestingness Target Selection | Rank target agents by motion pattern importance (turning/lateral behavior) to improve training focus |
Key Results
The proposed V2ITrajNet (with map) achieved the best performance on the Cheonan-built dataset:
| Model | minADE | minFDE | MR |
|---|---|---|---|
| HiVT (w map) | 1.033 | 1.543 | 0.204 |
| V2X-Graph (w map) | 0.996 | 1.517 | 0.212 |
| V2ITrajNet (w map, Ours) | 0.943 | 1.288 | 0.162 |
Performance Highlight
Compared with map-enabled baselines, V2ITrajNet achieved the lowest displacement error and miss rate, showing strong suitability for infrastructure-only forecasting.
Compared with map-enabled baselines, V2ITrajNet achieved the lowest displacement error and miss rate, showing strong suitability for infrastructure-only forecasting.
Dataset Statistics (Cheonan CCTV)
| Item | Value |
|---|---|
| Total Scenes | 7,364 |
| Total Rows | 21,389,113 |
| Total Actors | 528,501 |
| Total Targets | 1,501,136 |
| Timestamps per Scene | 80 |