Historical video archives and recordings from the past often suffer from degraded or completely missing audio tracks due to deterioration of storage media, recording limitations of the era, or loss during archival processes. Similarly, silent films and performance documentation may lack synchronized sound entirely. Emerging generative artificial intelligence techniques have demonstrated the potential to reconstruct missing audio content by analyzing visual information alone—a capability particularly valuable for restoring cultural heritage materials and historical performance recordings. However, when applied to complex activities such as musical instrument performance, existing methods have shown limited accuracy in capturing the nuances of sound production. Prior research has established that SpecVQGAN architectures combined with Transformer-based mechanisms can improve video-to-audio generation. This work introduces an enhanced model that augments SpecVQGAN by incorporating human skeletal pose features, specifically designed to elevate the quality of generated musical instrument sounds. Through comprehensive evaluation using both subjective user studies and objective quantitative metrics, we demonstrate that the proposed framework significantly outperforms existing approaches in reconstructing authentic instrumental audio from archival and silent performance videos.

Details

Titel: Generating Instrument Sounds Aligned with Video via Human Body Keypoints
Untertitel: A Deep Learning Approach to Multimodal Audio-Visual Synthesis
Note: Good
Autoren: Haruka Okano (Autor:in), Yuichi Sei (Autor:in), Yasuyuki Tahara (Autor:in), Akihiko Ohsuga (Autor:in)
Erscheinungsjahr: 2026
Seiten: 34
Katalognummer: V1696412
ISBN (eBook): 9783389179826
ISBN (Buch): 9783389179833
Sprache: Englisch
Schlagworte: Deep Learning Audio-Visual Learning Audio Generation Multi-modal
Produktsicherheit: GRIN Publishing GmbH

Arbeit zitieren: Haruka Okano (Autor:in), Yuichi Sei (Autor:in), Yasuyuki Tahara (Autor:in), Akihiko Ohsuga (Autor:in), 2026, Generating Instrument Sounds Aligned with Video via Human Body Keypoints, München, GRIN Verlag, https://www.hausarbeiten.de/document/1696412

Generating Instrument Sounds Aligned with Video via Human Body Keypoints

A Deep Learning Approach to Multimodal Audio-Visual Synthesis

Details