Music Processing, Audio Segmentation, User Interface
Introduction
The Ehrenreich Collection is a unique archive of private opera recordings amassed by the New York opera enthusiast Leroy Ehrenreich, documenting live performances from major opera venues between 1965 and 2010. It contains thousands of hours of bootleg recordings, including live captures, radio tapes, and some commercial sources, reflecting over four decades of vocal performance, repertoire, and interpretive practice. [1][2]
Since 2018, the Hochschule der Künste Bern (Bern Academy of the Arts) has been digitizing, cataloguing, and researching the collection within the project Ehrenreich Collection — Identity, Voice, Sound. This collaborative effort also involves EPFL’s Cultural Heritage and Innovation Center, whose help in the digitization process is essential. This work not only preserves these rare acoustic documents but also supports long-term study of cultural, interpretative, and reception phenomena in live opera, exploring such aspects as performance variation, audience sound, and the broader context of opera bootlegging culture.
The collection serves as an important resource for musicological and computational research, opening perspectives on opera interpretation and facilitating efforts, like those in this project, to analyze and segment recordings using automated audio processing techniques.
Even with such a resourceful collection, studying operas can be fastidious. One first thing that musicologists might want to do would be to segment an opera piece into its corresponding movements. Such a segmentation hasn’t been done for the Ehrenreich collection and it’s all what this project is about.
Tackling this problem requires thinking of different axes of research and evaluating each one’s tradeoff. Two complementary approaches were explored:
Audio-only segmentation, relying solely on acoustic features extracted from the opera recordings.
Basic energy-based methods (silence and applause detection)
Novelty-curve based methods using various audio features (chromagram, MFCCs, tempogram)
Segment alignment, where pre-existing opera segments are aligned to full-length recordings.
In addition to algorithmic development, a software application using PyQt6 [3] was implemented to allow interactive exploration and comparison of all proposed methods. Video demonstrations of the application usage can be found under the ▶ Application Usage sections throughout the report. The application can be downloaded at this link.
Note: All the code in this report (imported from src) can be found here.
from src.audio.audio_file import AudioFilefrom src.audio.signal import Signalfrom src.io.ts_annotation import TSAnnotationssignal: Signal = AudioFile("index_data/report_sample.wav ").load()content_parts = TSAnnotations.load_annotations("index_data/transitions.csv")
Note 2: This report uses a reference audio file named report_sample.wav for demonstration purposes. This file is a short example made from two songs and one applause segment. The first song is Mumbo Sugar by Arc de Soleil[4] and the second one is one of the themes of Princess Mononoke by Joe Hisaishi[5]. The segment is constructed as follows:
0s - 20.5s: Mumbo Sugar by Arc de Soleil
20.5s - 23.5s: Silence
23.5s - 37s: Mumbo Sugar sped up by a factor 2
37s - 45s: Applause sound effect
45s - 1:06s: Mumbo Sugar at normal speed
1:06s - 1:32s: Princess Mononoke by Joe Hisaishi
This construction allows to have different types of transitions that tries to mimic the ones that can be found in operas (change of tempo, silence, applause, change of timbre, harmonic structure, etc.). We define the ground truth transitions at the following timestamps (in seconds): 22s, 41s, 1:06s.
The different “content” parts of the audio are colored on the plots to help visualize the segmentation results (transitions remain white).