GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
By Naomi Wilson
Posted on: December 04, 2024
We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-...
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
By Kate Martin
Posted on: December 02, 2024
Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate o...
StableAnimator: High-Quality Identity-Preserving Human Image Animation
By Javier Vásquez
Posted on: November 29, 2024
Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a...
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
By Javier Vásquez
Posted on: November 29, 2024
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visual...
Star Attention: Efficient LLM Inference over Long Sequences
By Javier Vásquez
Posted on: November 29, 2024
Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention ac...
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
By Naomi Wilson
Posted on: November 29, 2024
Rapid development of large language models (LLMs) has significantly advanced multimodal large language models (LMMs), particularly in vision-language tasks. However, existing video-language models often overlook precise temporal localization and struggle with videos of varying lengths. We introduce ...
Exploring Discrete Flow Matching for 3D De Novo Molecule Generation
By Kate Martin
Posted on: November 27, 2024
Deep generative models that produce novel molecular structures have the potential to facilitate chemical discovery. Flow matching is a recently proposed generative modeling framework that has achieved impressive performance on a variety of tasks including those on biomolecular structures. The semina...
On the Efficiency of NLP-Inspired Methods for Tabular Deep Learning
By Kate Martin
Posted on: November 27, 2024
Recent advancements in tabular deep learning (DL) have led to substantial performance improvements, surpassing the capabilities of traditional models. With the adoption of techniques from natural language processing (NLP), such as language model-based approaches, DL models for tabular data have also...