Historical video archives and recordings from the past often suffer from degraded or completely missing audio tracks due to deterioration of storage media, recording limitations of the era, or loss during archival processes. Similarly, silent films and performance documentation may lack synchronized sound entirely. Emerging generative artificial intelligence techniques have demonstrated the potential to reconstruct missing audio content by analyzing visual information alone—a capability particularly valuable for restoring cultural heritage materials and historical performance recordings. However, when applied to complex activities such as musical instrument performance, existing methods have shown limited accuracy in capturing the nuances of sound production. Prior research has established that SpecVQGAN architectures combined with Transformer-based mechanisms can improve video-to-audio generation. This work introduces an enhanced model that augments SpecVQGAN by incorporating human skeletal pose features, specifically designed to elevate the quality of generated musical instrument sounds. Through comprehensive evaluation using both subjective user studies and objective quantitative metrics, we demonstrate that the proposed framework significantly outperforms existing approaches in reconstructing authentic instrumental audio from archival and silent performance videos.
- Arbeit zitieren
- Haruka Okano (Autor:in), Yuichi Sei (Autor:in), Yasuyuki Tahara (Autor:in), Akihiko Ohsuga (Autor:in), 2026, Generating Instrument Sounds Aligned with Video via Human Body Keypoints, München, GRIN Verlag, https://www.hausarbeiten.de/document/1696412