+

Research Posts

MCUBench: A Benchmark of Tiny Object Detectors on MCUs

Papers with Code
Reporter Javier Vásquez

By Javier Vásquez

Posted on: September 30, 2024

MCUBench: A Benchmark of Tiny Object Detectors on MCUs

We introduce MCUBench, a benchmark featuring over 100 YOLO-based object detection models evaluated on the VOC dataset across seven different MCUs. This benchmark provides detailed data on average precision, latency, RAM, and Flash usage for various input resolutions and YOLO-based one-stage detector...

Read More

StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

Papers with Code
Reporter Naomi Wilson

By Naomi Wilson

Posted on: September 25, 2024

StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

Tuning-free personalized image generation methods have achieved significant success in maintaining facial consistency, i.e., identities, even with multiple characters. However, the lack of holistic consistency in scenes with multiple characters hampers these methods' ability to create a cohesive nar...

Read More

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

Papers with Code
Reporter Naomi Wilson

By Naomi Wilson

Posted on: September 25, 2024

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

We present 3DGS-LM, a new method that accelerates the reconstruction of 3D Gaussian Splatting (3DGS) by replacing its ADAM optimizer with a tailored Levenberg-Marquardt (LM). Existing methods reduce the optimization time by decreasing the number of Gaussians or by improving the implementation of the...

Read More

Training Language Models to Self-Correct via Reinforcement Learning

Papers with Code
Reporter Kate Martin

By Kate Martin

Posted on: September 25, 2024

Training Language Models to Self-Correct via Reinforcement Learning

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision...

Read More

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

Papers with Code
Reporter Naomi Wilson

By Naomi Wilson

Posted on: September 25, 2024

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

The increasing demand for high-quality 3D assets across various industries necessitates efficient and automated 3D content creation. Despite recent advancements in 3D generative models, existing methods still face challenges with optimization speed, geometric fidelity, and the lack of assets for phy...

Read More

Colorful Diffuse Intrinsic Image Decomposition in the Wild

Papers with Code
Reporter Naomi Wilson

By Naomi Wilson

Posted on: September 25, 2024

Colorful Diffuse Intrinsic Image Decomposition in the Wild

Intrinsic image decomposition aims to separate the surface reflectance and the effects from the illumination given a single photograph. Due to the complexity of the problem, most prior works assume a single-color illumination and a Lambertian world, which limits their use in illumination-aware image...

Read More

CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

Papers with Code
Reporter Naomi Wilson

By Naomi Wilson

Posted on: September 25, 2024

CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D ...

Read More

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Papers with Code
Reporter Naomi Wilson

By Naomi Wilson

Posted on: September 22, 2024

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Visual data comes in various forms, ranging from small icons of just a few pixels to long videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual inputs to a fixed resolution for visual encoders and yield similar numbers of tokens for LLMs. This approach is non-opti...

Read More