VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration
By Javier Vásquez
Posted on: January 06, 2025
We present VoiceRestore, a novel approach to restoring the quality of speech recordings using flow-matching Transformers trained in a self-supervised manner on synthetic data. Our method tackles a wide range of degradations frequently found in both short and long-form speech recordings, including ba...
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
By Naomi Wilson
Posted on: January 06, 2025
Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutio...
MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization
By Naomi Wilson
Posted on: January 06, 2025
Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation...
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
By Javier Vásquez
Posted on: January 06, 2025
Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substan...
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
By Javier Vásquez
Posted on: January 03, 2025
Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. ...
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
By Kate Martin
Posted on: January 03, 2025
While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, comp...
Calibre: Towards Fair and Accurate Personalized Federated Learning with Self-Supervised Learning
By Kate Martin
Posted on: January 01, 2025
In the context of personalized federated learning, existing approaches train a global model to extract transferable representations, based on which any client could train personalized models with a limited number of data samples. Self-supervised learning is considered a promising direction as the gl...
Open-Sora: Democratizing Efficient Video Production for All
By Kate Martin
Posted on: January 01, 2025
Vision and language are the two foundational senses for humans, and they build up our cognitive ability and intelligence. While significant breakthroughs have been made in AI language ability, artificial visual intelligence, especially the ability to generate and simulate the world we see, is far la...