Vehicle Benchmark

ETRI Visual Intelligence Lab

Overview

This project proposes a CCTV-only vehicle trajectory forecasting framework for real urban infrastructure environments.
Unlike prior work that depends on ego-vehicle sensors, LiDAR-built HD maps, or controlled cooperative setups, this work focuses on single-view urban CCTV streams and builds an end-to-end pipeline for dataset construction and trajectory prediction.

Key Challenge
Reliable trajectory prediction from infrastructure-only CCTV is difficult due to viewpoint inconsistency, missing depth/LiDAR cues, noisy tracking jitter, and the absence of pre-built HD maps.

Approach

The full pipeline includes camera geometry recovery, dataset generation, map integration, and forecasting model adaptation:

Component Description
Monocular Camera Calibration Estimate homography from CCTV views to BEV using learning-based auto-calibration and RANSAC constraints
Dataset Construction Convert detections/tracks into V2X-Seq-style trajectory records (type, x, y, theta, timestamp), then generate infrastructure-aligned scenes
HD-Map Integration Build map annotations directly from converted BEV space for infrastructure-only prediction
Model Benchmarking Evaluate infrastructure-only variants of HiVT, V2X-Graph (without IA), and proposed V2ITrajNet
Interestingness Target Selection Rank target agents by motion pattern importance (turning/lateral behavior) to improve training focus

Key Results

The proposed V2ITrajNet (with map) achieved the best performance on the Cheonan-built dataset:

Model minADE minFDE MR
HiVT (w map) 1.033 1.543 0.204
V2X-Graph (w map) 0.996 1.517 0.212
V2ITrajNet (w map, Ours) 0.943 1.288 0.162
Performance Highlight
Compared with map-enabled baselines, V2ITrajNet achieved the lowest displacement error and miss rate, showing strong suitability for infrastructure-only forecasting.

Dataset Statistics (Cheonan CCTV)

Item Value
Total Scenes 7,364
Total Rows 21,389,113
Total Actors 528,501
Total Targets 1,501,136
Timestamps per Scene 80

Resources