+

Research Posts

VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration

Papers with Code
Reporter Javier Vásquez

By Javier Vásquez

Posted on: January 06, 2025

VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration

We present VoiceRestore, a novel approach to restoring the quality of speech recordings using flow-matching Transformers trained in a self-supervised manner on synthetic data. Our method tackles a wide range of degradations frequently found in both short and long-form speech recordings, including ba...

Read More

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Papers with Code
Reporter Naomi Wilson

By Naomi Wilson

Posted on: January 06, 2025

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutio...

Read More

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

Papers with Code
Reporter Naomi Wilson

By Naomi Wilson

Posted on: January 06, 2025

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation...

Read More

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Papers with Code
Reporter Javier Vásquez

By Javier Vásquez

Posted on: January 06, 2025

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substan...

Read More

Hierarchical Banzhaf Interaction for General Video-Language Representation Learning

Papers with Code
Reporter Javier Vásquez

By Javier Vásquez

Posted on: January 03, 2025

Hierarchical Banzhaf Interaction for General Video-Language Representation Learning

Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. ...

Read More

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

Papers with Code
Reporter Kate Martin

By Kate Martin

Posted on: January 03, 2025

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, comp...

Read More

Calibre: Towards Fair and Accurate Personalized Federated Learning with Self-Supervised Learning

Papers with Code
Reporter Kate Martin

By Kate Martin

Posted on: January 01, 2025

Calibre: Towards Fair and Accurate Personalized Federated Learning with Self-Supervised Learning

In the context of personalized federated learning, existing approaches train a global model to extract transferable representations, based on which any client could train personalized models with a limited number of data samples. Self-supervised learning is considered a promising direction as the gl...

Read More

Open-Sora: Democratizing Efficient Video Production for All

Papers with Code
Reporter Kate Martin

By Kate Martin

Posted on: January 01, 2025

Open-Sora: Democratizing Efficient Video Production for All

Vision and language are the two foundational senses for humans, and they build up our cognitive ability and intelligence. While significant breakthroughs have been made in AI language ability, artificial visual intelligence, especially the ability to generate and simulate the world we see, is far la...

Read More