TaCOS: Task-Specific Camera Optimization with Simulation

Designing cameras with high perception task performance is costly, requires human labour and hardware experiments. We introduce an end-to-end optimization approach for co-designing a camera automatically with specific perception tasks. This work leverages recent computer graphics techniques and physical camera characteristics to prototype the camera in software simulation. The main contributions of this work are:

An end-to-end camera design method that combines derivative-free and gradient-based optimization to automatically co-design cameras with perception tasks, allowing continuous, discrete, and categorical camera variables
A camera simulation that includes a physics-based noise model and a virtual environment, and provide a procedurally generated environments
Validation through comparison of synthetic imagery to imagery captured with physical cameras
Demonstration of camera designs with improved performance than the state-of-the-art design method and common off-the-shelf alternatives

This work is a key step in simplifying the process of designing cameras for autonomous systems like robots, emphasizing task performance and manufacturability constraints.

Publications

• C. Yan and D. G. Dansereau, “TaCOS: Task-specific camera optimization with simulation,” in Winter Conference on Applications of Computer Vision (WACV), 2025. Available here.

• Video Presentation and Poster at WACV 2025.

Citing

If you find this work useful please cite

@inproceedings{yan2025tacos,
  title={{TaCOS}: Task-Specific Camera Optimization with Simulation},
  author = {Chengyang Yan and Donald G. Dansereau},
  booktitle={Winter Conference on Applications of Computer Vision (WACV)},
  year={2025}
}

This work was carried out in the Robotic Imaging Group at the Australian Centre for Robotics, University of Sydney.

Acknowledgments

We would like to thank both ARIA Research Pty Ltd and the Australian government for their funding support via a CRC Projects Round 11 grant.

Themes

Learning to See, Novel Cameras.

Downloads

The code is available here.

Gallery

(click to enlarge)

We establish a virtual environment and capture scene renders using a ray-traced scene capture camera. We then add physics-based, sensor-specific noise to the renders and input them into perception tasks for evaluation. In our optimization process, we jointly optimize the camera parameters using a fitness function F with a derivative-free optimizer (blue arrow), as well as the parameters of perception tasks (if trainable) on their corresponding loss functions with gradient-based optimizers (red arrow).

We chose to use Unreal Engine (UE) with real-time ray-tracing to build the virtual environment in this work. We deploy cameras on an auto-agent simulating the platform that uses the camera. The agent navigates the virtual environment autonomously, enabling a fully automated design process.

We employ a UE camera to capture scene irradiance that uses ray tracing technique. The UE5 camera allows the configuration of parameters associated with cameras’ placement, optics, the image sensor, exposure settings, and multi-camera designs, as well as the configuration of algorithms in the image processing pipeline. Our optimizer can handle the optimization of all parameters captured in the camera simulation. The genetic algorithm is designed to accommodate both continuous and discrete parameters, enhancing the generalizability of our method.

Additional parameters like geometric distortion and defocus blur could be added by augmenting the renderer, our noise model serves as an example.

Image noise is a fundamental limiting factor for many robotic vision tasks that is tightly coupled to camera design parameters. As the UE camera simulation lacks a realistic noise model, we incorporate a post-render image augmentation that introduces noise. We employ thermal and signal-dependent Poisson noise following the affine noise model (Heteroscedastic Gaussian). The noise model is calibrated using a FLIR Flea3 Camera with a Sony IMX172 image sensor and generalized to other exposure settings and image sensors.

Example of a rendered image with and without inclusion of the physically based noise model.

We first validate our simulator by establishing equivalence of both low-level image statistics and high-level task performance between synthetic and captured imagery with physical cameras.

Comparison of captured and synthetic images in terms of variance in pixel intensities. Despite differences in color intensities due to manufacturing variations of the test target, the variances of pixel values in synthetic images match those in captured images, validating the accuracy of our noise model.

We compare the performance on the feature extraction tasks using ORB [1] of our synthetic image with images captured with 3 robotic/machine vision cameras with a test target used in literature. The graph shows the ranking of the cameras’ performance in our simulation aligns with the physical cameras, and the differences in their performance between the captured and synthetic images are consistent.

In the first design experiment, we apply our method to design the horizontal FOVs and the baseline for a stereo camera (with 2 RGB cameras) on an autonomous vehicle for the task of depth estimation. This experiment is conducted in CARLA Simulator [2]. The stereo camera is mounted to a car that moves automatically. The depth estimation is performed with PSMNet [3].

We compare the performance of stereo cameras designed by our method with two off-the-shelf models, the Intel RealSense D450 and ZED 2i, as well as a camera designed with the Reinforcement Learning-based method DISeR [4]. We also compare results from jointly optimizing camera design and perception tasks against optimizing camera design alone while fixing perception model parameters. In the table on the left, green dot indicates optimized parameters and grey dot indicates fixed parameters. Performance is evaluated using Average Log Error and Root Mean Square Error between estimated and ground-truth depths in meters.

Training curves comparing the design of cameras using our method with and without joint optimization, and curve of the Reinforcement Learning-based method DISeR [4], are plotted. Zoomed-in windows from 0 to 50 and 950 to 1000 timesteps are provided for visualization. Our method converges with 45 steps in 2 min and completes 1000 steps in 38 minutes with an NVIDIA RTX4070 GPU, while DISeR takes 700 steps (67 min) to converge and 97 minutes to complete 1000 steps.

Comparison of captured left images, estimated depth maps, and log errors using cameras designed by our method with and without joint optimization, the RL method (DISeR), and off-the-shelf cameras, which are RealSense D450 and ZED2i. The depth and metrics are calculated in meters, and depth maps are capped at 1000 m. The off-the-shelf cameras fail with objects at long distances, whereas the cameras designed by our method and DISeR achieve desirable performance for all distance ranges.

In our second experiment, we apply our method to design a monocular RGB camera for an MR headset. Object detection, obstacle avoidance, and feature extraction for 3D reconstruction are selected as examples of tasks as they are essential for most MR devices. The pitch mounting angle of the camera on the headset, the camera's focal length, the image sensor' dimension and pixel size are optimized in this experiment. We use Faster R-CNN [5] as the object detector and extract ORB [1] features for the tasks.

We establish an indoor environment in UE 5 with 10 object classes. Floorplans and object locations are randomly generated to introduce variability.

We apply our method to design cameras under two illumination conditions: a well illuminated daytime scenario (20 lux) and a low-light nighttime scenario (2 lux). The camera gain is set to a higher value under the nighttime scenario to achieve brighter images.

Comparing the FOVs, object detection, and feature extraction performance of the cameras optimized using our method and those designed by humans, for the daytime scenario. Our method design a camera with the largest FOV, balancing FOV against effective resolution to allow the camera to detect features, obstacles and objects.

Comparison of the parameters and performance of cameras designed with our method, using fully discrete and quantized continuous schemes under daytime and nighttime scenarios, and 3 robotic/machine vision cameras. The optimized parameters are labelled with green dots while the fixed ones are labelled with gray dots. The cameras achieve compelling results compared to human-designed cameras, while the quantized continuous schemes and joint optimization with the object detector achieve higher perfromance.

Comparison of object detection performance using cameras designed by our method and the off-the-shelf cameras. The cameras designed by our method show improved performance with small objects, objects at long distances, and objects that are partly occluded by optimizing the FOV and pixel size to obtain a more suitable effective resolution and signal-to-noise ratio for the task.

Comparison of feature extraction and matching with images captured by cameras designed by the proposed method and the off-the-shelf cameras. We display the features that are successfully matched with the features in the next frame and filtered (inliers) on the images, which shows that the cameras designed by our method contain the highest number of inliers.

References

[1] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An efficient alternative to sift or surf. In 2011 International conference on computer vision (ICCV), pages 2564–2571. IEEE, 2011.
[2] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In Conference on robot learning (CoRL), pages 1–16. PMLR, 2017.
[3] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 5410–5418, 2018.
[4] Tzofi Klinghoffer, Kushagra Tiwary, Nikhil Behari, Bhavya Agrawalla, and Ramesh Raskar. DISeR: Designing imaging systems with reinforcement learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 23632–23642, 2023.
[5] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in neural information processing systems (NeurIPS), 28, 2015.