+

Research Posts

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Papers with Code
Reporter Naomi Wilson

By Naomi Wilson

Posted on: December 04, 2024

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-...

Read More

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Papers with Code
Reporter Kate Martin

By Kate Martin

Posted on: December 02, 2024

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate o...

Read More

StableAnimator: High-Quality Identity-Preserving Human Image Animation

Papers with Code
Reporter Javier Vásquez

By Javier Vásquez

Posted on: November 29, 2024

StableAnimator: High-Quality Identity-Preserving Human Image Animation

Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a...

Read More

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Papers with Code
Reporter Javier Vásquez

By Javier Vásquez

Posted on: November 29, 2024

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visual...

Read More

Star Attention: Efficient LLM Inference over Long Sequences

Papers with Code
Reporter Javier Vásquez

By Javier Vásquez

Posted on: November 29, 2024

Star Attention: Efficient LLM Inference over Long Sequences

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention ac...

Read More

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

Papers with Code
Reporter Naomi Wilson

By Naomi Wilson

Posted on: November 29, 2024

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

Rapid development of large language models (LLMs) has significantly advanced multimodal large language models (LMMs), particularly in vision-language tasks. However, existing video-language models often overlook precise temporal localization and struggle with videos of varying lengths. We introduce ...

Read More

Exploring Discrete Flow Matching for 3D De Novo Molecule Generation

Papers with Code
Reporter Kate Martin

By Kate Martin

Posted on: November 27, 2024

Exploring Discrete Flow Matching for 3D De Novo Molecule Generation

Deep generative models that produce novel molecular structures have the potential to facilitate chemical discovery. Flow matching is a recently proposed generative modeling framework that has achieved impressive performance on a variety of tasks including those on biomolecular structures. The semina...

Read More

On the Efficiency of NLP-Inspired Methods for Tabular Deep Learning

Papers with Code
Reporter Kate Martin

By Kate Martin

Posted on: November 27, 2024

On the Efficiency of NLP-Inspired Methods for Tabular Deep Learning

Recent advancements in tabular deep learning (DL) have led to substantial performance improvements, surpassing the capabilities of traditional models. With the adoption of techniques from natural language processing (NLP), such as language model-based approaches, DL models for tabular data have also...

Read More