Ehrenreich Collection Segmentation

Digital and Cognitive Musicology Laboratory, EPFL

Published

January 13, 2026

Keywords

Music Processing, Audio Segmentation, User Interface


Introduction

The Ehrenreich Collection is a unique archive of private opera recordings amassed by the New York opera enthusiast Leroy Ehrenreich, documenting live performances from major opera venues between 1965 and 2010. It contains thousands of hours of bootleg recordings, including live captures, radio tapes, and some commercial sources, reflecting over four decades of vocal performance, repertoire, and interpretive practice. [1] [2]

Since 2018, the Hochschule der Künste Bern (Bern Academy of the Arts) has been digitizing, cataloguing, and researching the collection within the project Ehrenreich Collection — Identity, Voice, Sound. This collaborative effort also involves EPFL’s Cultural Heritage and Innovation Center, whose help in the digitization process is essential. This work not only preserves these rare acoustic documents but also supports long-term study of cultural, interpretative, and reception phenomena in live opera, exploring such aspects as performance variation, audience sound, and the broader context of opera bootlegging culture.

The collection serves as an important resource for musicological and computational research, opening perspectives on opera interpretation and facilitating efforts, like those in this project, to analyze and segment recordings using automated audio processing techniques.

Even with such a resourceful collection, studying operas can be fastidious. One first thing that musicologists might want to do would be to segment an opera piece into its corresponding movements. Such a segmentation hasn’t been done for the Ehrenreich collection and it’s all what this project is about.

Tackling this problem requires thinking of different axes of research and evaluating each one’s tradeoff. Two complementary approaches were explored:

  1. Audio-only segmentation, relying solely on acoustic features extracted from the opera recordings.
    1. Basic energy-based methods (silence and applause detection)
    2. Novelty-curve based methods using various audio features (chromagram, MFCCs, tempogram)
  2. Segment alignment, where pre-existing opera segments are aligned to full-length recordings.

In addition to algorithmic development, a software application using PyQt6 [3] was implemented to allow interactive exploration and comparison of all proposed methods. Video demonstrations of the application usage can be found under the ▶ Application Usage sections throughout the report. The application can be downloaded at this link.

Note: All the code in this report (imported from src) can be found here.

from src.audio.audio_file import AudioFile
from src.audio.signal import Signal
from src.io.ts_annotation import TSAnnotations
signal: Signal = AudioFile("index_data/report_sample.wav ").load()
content_parts = TSAnnotations.load_annotations("index_data/transitions.csv")
* Loaded Audio Path 'index_data/report_sample.wav ' 
* Samples Number: 4083923 
* Sample Rate: 44100 Hz 
* Duration: 00:01:32 
* File Size: 7.79 MB

Note 2: This report uses a reference audio file named report_sample.wav for demonstration purposes. This file is a short example made from two songs and one applause segment. The first song is Mumbo Sugar by Arc de Soleil [4] and the second one is one of the themes of Princess Mononoke by Joe Hisaishi [5]. The segment is constructed as follows:

  • 0s - 20.5s: Mumbo Sugar by Arc de Soleil
  • 20.5s - 23.5s: Silence
  • 23.5s - 37s: Mumbo Sugar sped up by a factor 2
  • 37s - 45s: Applause sound effect
  • 45s - 1:06s: Mumbo Sugar at normal speed
  • 1:06s - 1:32s: Princess Mononoke by Joe Hisaishi

This construction allows to have different types of transitions that tries to mimic the ones that can be found in operas (change of tempo, silence, applause, change of timbre, harmonic structure, etc.). We define the ground truth transitions at the following timestamps (in seconds): 22s, 41s, 1:06s.

The different “content” parts of the audio are colored on the plots to help visualize the segmentation results (transitions remain white).

1.a – Basic Segmentation Methods

Sections 1.a and 1.b present segmentation algorithms relying exclusively on audio features extracted from the opera recordings.

This section focuses on energy-based algorithms.

Silence Curve

Principle

Silence based segmentation identifies structural boundaries by detecting low energy regions in audio signals. These quiet moments often mark natural transitions between musical movements, sections, or pieces in opera recordings. The method works by measuring audio energy levels over time and highlighting regions where the signal power drops significantly below the average level.

Two main approaches exist for computing energy levels, each with distinct characteristics and applications:

Amplitude based detection measures energy directly in the time domain using the Root Mean Square (RMS) of the audio signal:

E_{amp}(n) = \sqrt{\frac{1}{N} \sum_{k=0}^{N-1} |x(n \cdot H + k)|^2}

where E_{amp}(n) is the energy at frame n, x(k) is the audio signal, N is the frame length, and H is the hop length between frames.

Spectral based detection computes energy in the frequency domain by analyzing the Short Time Fourier Transform (STFT):

E_{spec}(n) = \sum_{f=0}^{F-1} |X(n, f)|^2

where E_{spec}(n) is the spectral energy at frame n, X(n, f) is the STFT coefficient at frame n and frequency bin f, and F is the total number of frequency bins.

The amplitude method excels at detecting clear silences such as pauses between movements, while the spectral method proves more sensitive to subtle changes in orchestral texture and can identify quiet sustained notes that might be missed by amplitude analysis alone. The silence curve is then computed by inverting the energy values, creating peaks at low energy regions that indicate potential structural boundaries.

Implementation

Energy is computed over short-time frames and smoothed to obtain a silence curve.

from src.audio_features.builders import SilenceCurveBuilder
from src.audio_features.features import SilenceCurve

builder = SilenceCurveBuilder(silence_type="spectral", frame_length=44100, hop_length=22050)
silence_curve = builder.build(signal)
peaks = silence_curve.find_peaks(threshold=0.8, distance_seconds=5)
silence_curve.plot(original_signal=signal, time_annotations=content_parts, peaks=peaks, figsize=(8,8))

Silence curve showing low-energy regions in the audio signal

As it can be observed on the plot, the silence curve shows the spot in the audio with the lowest spectral amplitude at its highest peak, effectively finding the “silence” transition at ~22 seconds explained in the Introduction.

Applause Curve (HRPS)

Principle

The Harmonic/Percussive separation technique originally developped by Dr. Derry FitzGerald [6] is a powerful technique allowing to separate an audio into two new ones, one containing harmonic components and the other percussive ones. This algorithm has been further developped to include a residual part in between the two components, capturing elements in a sound that could not be classified into harmonic nor percussive [7]. This new technique is called Harmonic-Residual-Percussive Separation (HRPS) and turns out to be very effective in finding applause in audios. The intuition behind why HRPS works in that situation is that applause is inherently percussive but also lasts a long time, making it sitting exactly in between the two categories.

Once the three components are extracted from the audio signal, the energy over time can be computed on each of them similar to the silence curve method see above. The residual component’s energy over time forms an applause curve, which is used to detect it. In the following plot, the three components have been kept for visualization purposes, contrary to the application that only shows the residual curve.

Implementation

from src.audio_features.builders import HRPSBuilder
from src.audio_features.features import HRPS

builder = HRPSBuilder(L_h_frames=30, L_p_bins=100, beta=1.75, frame_length=4410, hop_length=2205, downsampling_factor=5)
hrps = builder.build(signal)
hrps.plot(original_signal=signal, time_annotations=content_parts, figsize=(8,8))
Computing STFT...Computed STFT                    

Applying Median for Harmonic Component (1/2)
Y.shape: (2206, 371)
Median filtering took 0.27 seconds
Applying Median for Percussive Component (2/2)
Median filtering took 1.18 seconds


Computing Masks...Computed Masks                    

Computing Inverse STFT for x_h (1/3)
Computing Inverse STFT for x_r (2/3)
Computing Inverse STFT for x_p (3/3)


Computing Local Energy for Harmonic Signal (1/3)
Computing Local Energy for Residual Signal (2/3)
Computing Local Energy for Percussive Signal (3/3)


N bins: 3

HRPS curve showing high-energy regions in each of the three components. top: Percussive component, middle: Residual component, bottom: Harmonic component.

As observed in the plot, the HRPS technique successfully identifies the applause segment at approximately 41 seconds, corresponding to the applause sound effect described in the Introduction. The residual component captures this percussive yet sustained audio characteristic that sits between purely harmonic and purely percussive categories without containing any other musical part of the music sample.

▶ Application Usage

The following demonstration video shows how to use the application to explore the basic segmentation methods presented in this section. After a module outputs a curve with its detected transitions, it is possible to add all transitions (right-click on the plot) of that module or only specific ones (by right-clicking on it) and press “add all transitions” or “add this transition” respectively. All added transitions are then displayed on the main timeline at the bottom of the application window which is the final segmentation output that can be exported.

1.b – Novelty Curve Segmentation Methods

This section presents segmentation methods based on novelty curves, a sophisticated approach that identifies structural boundaries by analyzing patterns of change in audio features over time.

The Novelty Curve Pipeline

The novelty curve approach follows a systematic four step pipeline: raw signal → feature extraction → self similarity matrix → novelty curve. This methodology can work with any type of audio feature that captures meaningful musical characteristics.

Self Similarity Matrix (SSM): The foundation of novelty curves lies in the self similarity matrix [8], which compares every moment in the audio with every other moment using the extracted features. For a feature sequence F = [f_1, f_2, ..., f_N], the SSM is computed as:

S(i,j) = \text{similarity}(f_i, f_j)

where the similarity function (typically cosine similarity or Euclidean distance) measures how similar two feature vectors are. The resulting matrix reveals the internal structure of the music: similar sections create block patterns along the diagonal, while transitions between different sections appear as changes in these patterns (more details here).

Novelty Curve Computation: The novelty curve extracts boundary information from the SSM by applying a checkerboard kernel that detects corner points between adjoining blocks. This process identifies sudden changes in the similarity patterns, which correspond to structural transitions:

NC(n) = \sum_{k,\ell} K(k,\ell) \cdot S(n+k, n+\ell)

where K(k,\ell) is a checkerboard kernel and NC(n) represents the novelty value at time frame n. When the kernel is positioned within uniform regions, positive and negative values cancel out, resulting in low novelty. When positioned at structural boundaries, all values align positively, creating peaks that indicate potential segmentation points (more details here).

Feature Selection and Complementarity

While this methodology can accommodate any audio feature, this project focuses on three complementary representations that together provide comprehensive musical description coverage:

  • Chromagram: Captures harmonic and tonal changes, tracking key transitions and chord progressions
  • Mel Frequency Cepstral Coefficients (MFCCs): Captures timbral variations, detecting changes in instrumentation and vocal texture
  • Tempogram: Focuses on rhythmic changes, identifying tempo shifts and rhythmic pattern variations

The choice of these three features is motivated from their complementary nature: they cover the primary dimensions of musical perception (harmony, timbre, rhythm) while remaining computationally efficient compared to high dimensional spectrograms for example.

Novelty Curve: Chromagram

Principle

A chromagram represents how strong each of the twelve musical notes (C, C#, D, …, B) appears over time, regardless of which octave they’re played in. This makes it perfect for tracking harmonic changes and key transitions in music, since it focuses on the musical “color” rather than the exact pitch height.

The chromagram is computed by grouping frequency components into pitch classes:

C(n, p) = \sum_{f \in F_p} |X(n, f)|^2

where C(n, p) is the chroma value for time frame n and pitch class p, X(n, f) is the frequency content from the STFT, and F_p contains all frequency bins that correspond to pitch class p across different octaves.

Implementation

from src.audio_features.builders import ChromagramBuilder, SSMBuilder
from src.audio_features.features import Chromagram
import numpy as np

builder = ChromagramBuilder(frame_length=4410, hop_length=2205)
chromagram = builder.build(signal=signal)
chromagram = chromagram.normalize(norm='2', threshold=0.001)
chromagram = chromagram.smooth(filter_length=11, window_type='boxcar')
chromagram = chromagram.downsample(factor=5)
chromagram = chromagram.ensure_positive() # Important for log compression
chromagram = chromagram.log_compress(gamma=20)

ssm = SSMBuilder(
    smoothing_filter_length=1,
    smoothing_filter_direction='both',
    shift_set=np.array([0]),
    tempo_relative_set=np.array([1]),
).build(chromagram)

ssm = ssm.threshold(thresh=0.8)
ssm.plot(x_axis_type='time', original_base_feature=chromagram, time_annotations=content_parts)

(top) Chromagram computed from the raw signal showing the 12 pitch classes, (bottom) self-similarity matrix derived from this chromagram. The intensity of the first plot tells how much a pitch is present at a given time, while the second plot shows how similar different time frames are in terms of harmonic content.
<Figure size 672x480 with 0 Axes>
from src.audio_features.features import NoveltyCurve

nc_chromagram: NoveltyCurve = ssm.compute_novelty_curve(kernel_size=20, variance=2, exclude_borders=True)

peaks = nc_chromagram.find_peaks(threshold=0.4, distance_seconds=30)
nc_chromagram.plot(x_axis_type='time', novelty_name='Chromagram Novelty Curve', peaks=peaks, time_annotations=content_parts, figsize=(8, 4))

nc_chromagram_smooth = nc_chromagram.smooth(sigma=10)
nc_chromagram_smooth.plot(x_axis_type='time', novelty_name='Smoothed Chromagram Novelty Curve', peaks=peaks, time_annotations=content_parts, figsize=(8, 4))

Novelty curve computed from the chromagram self-similarity matrix. Peaks indicate high novelty points in terms of harmonic features, possibly corresponding to structural boundaries. The curve has been normalized using min-max normalization, creating some kind of probabilistic interpretation for transitions.

As it can be observed, the highest peak is located at the 3rd transition (~1:06s) which corresponds to the change between the two songs, effectively representing a significant harmonic change.

Novelty Curve: MFCC

Principle

Mel-Frequency Cepstral Coefficients (MFCCs) capture the “color” or texture of sound: what makes a piano sound different from a violin playing the same note. They work by mimicking how our ears naturally process sound.

The human ear doesn’t hear all frequencies equally. We’re more sensitive to differences in low frequencies than high ones. MFCCs replicate this by using the Mel scale (Wiki), which spaces frequencies the same way our ear perceives them. Additionally, our hearing works logarithmically; the difference between 100Hz and 200Hz sounds similar to the difference between 1000Hz and 2000Hz. MFCCs apply this logarithmic compression to match our natural hearing.

The computation follows our ear’s processing steps:

Short-Time Fourier Transform (STFT) → Mel filterbank → log compression → Discrete Cosine Transform (DCT)

The Mel filterbank applies a series of triangular filters spaced according to the mel scale, effectively grouping frequency components the way our ear naturally organizes sound. This converts the linear frequency spectrum into perceptually meaningful frequency bands. The Discrete Cosine Transform (DCT) then compresses this information by extracting the most important patterns from the mel-filtered spectrum, creating a compact representation that captures the essential spectral shape while reducing dimensionality.

In opera recordings, MFCCs capture the overall sonic texture created by the blend of voices and orchestra. When the instrumentation changes, when a new singer enters, or when the recording conditions shift, MFCCs detect these timbral transitions effectively, making them excellent for finding section boundaries based on “how the music sounds” rather than “what notes are played.”

\mathrm{MFCC}(n,k) = \sum_{m=1}^{M} \log\!\left( \sum_{f} |X(n,f)|^2 \, H_m(f) \right) \cos\!\left( \frac{\pi k}{M}\left(m - \frac{1}{2}\right) \right)

where:

\mathrm{MFCC}(n, k) is the k-th MFCC coefficient at frame n,

X(n,f) is the Short-Time Fourier Transform (STFT) of the signal,

H_m(f) is the m-th Mel filterbank,

M is the number of Mel filters,

k is the cepstral coefficient index (typically k=0,\dots,K-1).

Implementation

from src.audio_features.builders import MFCCBuilder, SSMBuilder
from src.audio_features.features import MFCC
import numpy as np

builder = MFCCBuilder(n_mfcc=12, frame_length=4410, hop_length=2205)
mfcc = builder.build(signal=signal)
mfcc = mfcc.normalize(norm='2', threshold=0.001)
mfcc = mfcc.smooth(filter_length=11, window_type='boxcar')
mfcc = mfcc.downsample(factor=5)
mfcc = mfcc.ensure_positive() # Important for log compression
mfcc = mfcc.log_compress(gamma=1)

ssm = SSMBuilder(
    smoothing_filter_length=1,
    smoothing_filter_direction='both',
    shift_set=np.array([0]),
    tempo_relative_set=np.array([1]),
).build(mfcc)

ssm = ssm.threshold(thresh=0.8)
ssm.plot(x_axis_type='time', original_base_feature=mfcc, time_annotations=content_parts)

(top) 12 first MFCCs computed from the raw signal, (bottom) self-similarity matrix derived from these MFCCs. The intensity of the first plot tells how much an MFC-coefficient is present at a given time, while the second plot shows how similar different time frames are in terms of timbral content.
<Figure size 672x480 with 0 Axes>
from src.audio_features.features import NoveltyCurve

nc_mfcc: NoveltyCurve = ssm.compute_novelty_curve(kernel_size=20, variance=2, exclude_borders=True)

peaks = nc_mfcc.find_peaks(threshold=0.2, distance_seconds=1)
nc_mfcc.plot(x_axis_type='time', novelty_name='MFCC Novelty Curve', peaks=peaks, time_annotations=content_parts, figsize=(8, 4))

nc_mfcc_smooth = nc_mfcc.smooth(sigma=10)
nc_mfcc_smooth.plot(x_axis_type='time', novelty_name='Smoothed MFCC Novelty Curve', peaks=peaks, time_annotations=content_parts, figsize=(8, 4))

Novelty curve computed from the MFCC self-similarity matrix. Peaks indicate high novelty points in terms of timbral features, possibly corresponding to structural boundaries. The curve has been normalized using min-max normalization, creating some kind of probabilistic interpretation for transitions.

On this plot, the highest peak is located at the 2nd transition (~41s) which corresponds to the applause segment that introduces a significant timbral change in the audio. The other two transitions are also detected but with less intensity.

Novelty Curve: Tempogram

Principle

A tempogram reveals the rhythmic patterns and tempo changes in music over time. It works by finding repeating patterns when musical events (like notes or beats) occur, helping identify tempo shifts and rhythmic transitions.

The computation process follows two main steps:

  1. Onset Detection: First, we identify when musical events happen by computing an onset strength envelope that highlights note attacks and rhythmic events (Wiki).

  2. Pattern Analysis: Then, we analyze these onset patterns using localized autocorrelation to find repeating rhythmic cycles at different time scales.

Mathematically, the tempogram is computed as:

T(n, \tau) = \sum_{k=0}^{w-1} O(n+k) \cdot O(n+k+\tau)

where T(n, \tau) is the tempogram value at time frame n and lag \tau, O(n) is the onset strength at frame n, and w is the analysis window length.

Implementation

from src.audio_features.builders import TempogramBuilder, SSMBuilder
from src.audio_features.features import Tempogram
import numpy as np

builder = TempogramBuilder(frame_length=4410, hop_length=2205)
tempogram = builder.build(signal=signal)
tempogram = tempogram.normalize(norm='2', threshold=0.001)
tempogram = tempogram.smooth(filter_length=11, window_type='boxcar')
tempogram = tempogram.downsample(factor=5)
tempogram = tempogram.ensure_positive() # Important for log compression
tempogram = tempogram.log_compress(gamma=20)

ssm = SSMBuilder(
    smoothing_filter_length=1,
    smoothing_filter_direction='both',
    shift_set=np.array([0]),
    tempo_relative_set=np.array([1]),
).build(tempogram)

ssm = ssm.threshold(thresh=0.8)
ssm.plot(x_axis_type='time', original_base_feature=tempogram, time_annotations=content_parts)

(top) Tempogram computed from the raw signal showing the rhythmic patterns, (bottom) self-similarity matrix derived from this tempogram. The intensity of the first plot tells how much a certain tempo is present at a given time, while the second plot shows how similar different time frames are in terms of rhythmic content.
<Figure size 672x480 with 0 Axes>
from src.audio_features.features import NoveltyCurve

nc_tempogram: NoveltyCurve = ssm.compute_novelty_curve(kernel_size=20, variance=2, exclude_borders=True)

peaks = nc_tempogram.find_peaks(threshold=0.4, distance_seconds=10)
nc_tempogram.plot(x_axis_type='time', novelty_name='Tempogram Novelty Curve', peaks=peaks, time_annotations=content_parts, figsize=(8, 4))

nc_tempogram_smooth = nc_tempogram.smooth(sigma=10)
nc_tempogram_smooth.plot(x_axis_type='time', novelty_name='Smoothed Tempogram Novelty Curve', peaks=peaks, time_annotations=content_parts, figsize=(8, 4))

Novelty curve computed from the tempogram self-similarity matrix. Peaks indicate high novelty points in terms of rhythmic features, possibly corresponding to structural boundaries. The curve has been normalized using min-max normalization, creating some kind of probabilistic interpretation for transitions.

Here, the highest peak is located at the 1st transition (~22s) which corresponds to the most abrupt change of tempo (normal speed to double speed). The second transition (~41s) also shows a peak but less pronounced. The possible reason is the applause part that adds some noisy rhythmic content that could confuse the tempogram. The last transition (~1:06s) is also detected since the tempo also changes between the two songs.

Note: In this example, the tempogram seems to be the most effective feature for detecting tempo changes, but in real opera recordings, its performance tends to be less reliable than the two other features (chromagram and MFCC) due to the complex and varying rhythmic structures present in operatic music (see Results section for more details).

Novelty Curve: Combination of Features

Principle

The combination approach leverages the complementary strengths of different audio features by intelligently merging their individual novelty curves into a unified segmentation result. Rather than relying on a single feature type, this method recognizes that different musical aspects—harmonic changes (chromagram), timbral variations (MFCC), and rhythmic shifts (tempogram)—provide different but valuable information for detecting structural boundaries.

The combination process works in two main steps:

  1. Weighted Feature Integration: Each novelty curve is assigned a weight (0.0 to 1.0) that reflects its importance for the specific analysis.

  2. Mathematical Combination: The weighted curves are combined using one of two methods:

    • Mean combination: Computes the weighted average, requiring consensus across features
    • Max combination: Takes the maximum value, emphasizing the strongest transitions from any feature

Mathematically, for mean combination: NC_{combined}(t) = \frac{\sum_{i} w_i \cdot NC_i(t)}{\sum_{i} w_i}

And for max combination: NC_{combined}(t) = \max_i(w_i \cdot NC_i(t))

where w_i are the feature weights and NC_i(t) are the individual novelty curves.

Implementation

from src.audio_features.features import NoveltyCurve
import numpy as np

# Define combination parameters (optimized weights from research)
chromagram_weight = 1.0
mfcc_weight = 1.0
tempogram_weight = 1.0
combination_methods = ["mean", "max"]

# Collect available novelty curves and their weights
available_curves = [[nc_chromagram, nc_mfcc, nc_tempogram]]

curves = [
  ("normal", [nc_chromagram, nc_mfcc, nc_tempogram]),
  ("smoothed", [nc_chromagram_smooth, nc_mfcc_smooth, nc_tempogram_smooth]),
]
corresponding_weights = [chromagram_weight, mfcc_weight, tempogram_weight]

for method in combination_methods:
  for curve_name, available_curves in curves:
    # Combine novelty curves
    nc_combined = NoveltyCurve.combine(
        available_curves, 
        weights=corresponding_weights, 
        method=method
    )

    # Find peaks in combined curve
    peaks = nc_combined.find_peaks(threshold=0.5, distance_seconds=15)
    nc_combined.plot(
        x_axis_type='time', 
        novelty_name=f'Combined {curve_name.upper()} Novelty Curves using {method.upper()} method', 
        peaks=peaks, 
        time_annotations=content_parts, 
        figsize=(8, 4)
    )

The combined approach demonstrates superior performance by detecting transitions that individual features might miss while reducing false positives through the consensus mechanism. The weighted combination allows emphasizing the most reliable feature types for the specific musical content being analyzed.

We see that the mean approach captures all three transitions effectively while smoothing the false positives (e.g. the right peak detected by the chromagram inside the yellow band was effectively smoothed out).

The max approach also captures all three transitions but is more sensitive to false positives since it only requires one feature to signal a transition. This makes this method more aggressive, which can be beneficial in some contexts but may also lead to over-segmentation.

Note: Optuna optimization shows combination approach significantly outperforms individual features (0.80 recall vs 0.58-0.69) while maintaining 0.51 precision.

▶ Application Usage

The following demonstration video shows how to use the application to explore the novelty curve segmentation methods presented in this section. After a module outputs a curve with its detected transitions, it is possible to add all transitions (right-click on the plot) of that module or only specific ones (by right-clicking on it) and press “add all transitions” or “add this transition” respectively. All added transitions are then displayed on the main timeline at the bottom of the application window which is the final segmentation output that can be exported.

On the video and from top left to bottom right, the following modules are shown: Chromagram novelty curve, MFCC novelty curve, Tempogram novelty curve, Combined novelty curve using mean method. As it can be observed, as the different features’ novelty curves are computed, the combined novelty curve module updates its output accordingly. At the end, we see see that the Chromagram and MFCC reduced the spikes of the Tempogram, keeping only the tempogram’s most confident transitions (which turned out to be true positive).

Also, we see that both Chromagram and MFCC novelty curves share a high peak on the right side, outputting, by consensus, a very confident transition in the combined novelty curve (which is a real transition in the example).

2 – Segment Alignment

In contrast to audio-only segmentation, this approach assumes the availability of external opera segments and leverages them to identify structural boundaries in full-length recordings. Rather than analyzing audio content directly to discover transitions, segment alignment works by matching known musical segments with their corresponding positions within complete opera recordings.

Approach Overview and Motivation

The segment alignment method addresses a fundamental challenge in opera analysis: while audio-only methods must infer structural boundaries from acoustic features alone, many opera works already have established segmentations available through commercial databases, score annotations, or musicological analyses. By leveraging these existing segmentations, we can potentially achieve more accurate boundary detection while also providing semantic labels for identified segments (such as “Overture”, “Act I Scene 2”, etc.).

This approach is particularly valuable for the following reasons:

  1. Accuracy: When external segments are of high quality, alignment can provide more reliable boundary detection than purely audio-only methods
  2. Semantic information: Beyond just finding boundaries, alignment provides meaningful labels for each identified segment

Methodology and Pipeline

The segment alignment process follows a systematic pipeline designed to match external segments with positions in full recordings:

  1. Data Collection: Gather external opera segments (typically preview clips from commercial databases like Naxos) along with metadata about segment ordering and duration
  2. Temporal Interpolation: Estimate approximate positions of segments within the full recording using duration ratios and cumulative timing information
  3. Feature Extraction: Convert both external segments and the full recording into comparable feature representations (chromagrams in our case)
  4. Window-based Alignment: For each external segment, search within a localized window around its estimated position to find the best feature-based match (cost matrix) using dynamic time warping (DTW) (Wiki).
  5. Boundary Detection: Extract segment start times from alignment results to generate structural boundaries

The key insight is that rather than searching the entire recording for each segment (which would be computationally expensive and prone to false matches), temporal interpolation provides intelligent search windows that dramatically reduce complexity while improving accuracy.

Technical Considerations

This approach faces several practical challenges that influence its design:

  • Quality mismatch: External segments (often commercial recordings) may differ significantly in audio quality from bootleg opera recordings
  • Version differences: Different performances of the same opera may have timing variations, key changes, interpretive differences, or skipped sections
  • Database limitations: External segment availability varies significantly across different opera works and composers

For this project, the alignment module uses naxos.com 30s preview clips as external segments by scraping the website. These clips are publicly available.

Chromagram-based Alignment

Each segment and the full opera recording are represented using chromagrams.
Alignment is performed by maximizing similarity over time shifts.

Implementation

First, we load the main audio track (Ehrenreich). In this example, we only take the first 500 seconds of Giulio Cesare by Handel (BAR103, track 2, channel 2).

from src.audio.audio_file import AudioFile
from src.audio.signal import Signal
from src.io.ts_annotation import TSAnnotations
import numpy as np

ehrenreich_audio_filepath = "index_data/alignment/bar103-t2-c2-0-900.wav"
ehrenreich_ground_truths_filepath = "index_data/alignment/bar103-t2-c2-timestamps.txt"

# Load Ehrenreich audio (500 first seconds) and ground truths
ehrenreich_signal: Signal = AudioFile(ehrenreich_audio_filepath).load().subsignal(0, 500)
ehrenreich_ground_truths = TSAnnotations.load_transitions_txt(ehrenreich_ground_truths_filepath)
* Loaded Audio Path 'index_data/alignment/bar103-t2-c2-0-900.wav' 
* Samples Number: 43200000 
* Sample Rate: 48000 Hz 
* Duration: 00:15:00 
* File Size: 164.79 MB

After that, we load the three first previews (30s) of this same opera from the Naxos database, namely Overture, Act I Scene 1: Caesar! Caesar! Egypt acclaims thee (Chorus) and Act I Scene 1: Kneel in tribute, fair land of Egypt (Caesar) (link). from that, knowing the duration of the whole piece (Ehrenreich) and the one from each segment, we can perform a temporal interpolation estimating the preview start within the big piece. This gives us a good first approximation that we later refine using the chromagram alignment algorithm.

import os
from src.audio_features.aligners import ChromagramAligner

# Load Naxos preview signals from directory
naxos_previews_dir = "index_data/alignment/previews"
naxos_signals = []

for filename in os.listdir(naxos_previews_dir):
    if filename.endswith(".wav"):
        filepath = os.path.join(naxos_previews_dir, filename)
        naxos_signal: Signal = AudioFile(filepath).load()
        naxos_signals.append(naxos_signal)

print(f"Loaded {len(naxos_signals)} Naxos preview signals.\n")

# Load duration data for temporal interpolation
naxos_durations_filepath = "index_data/alignment/audio_full_durations.npy"
naxos_preview_durations = np.load(naxos_durations_filepath, allow_pickle=True)

# Calculate cumulative start times (original Naxos timeline)
naxos_cumulative_starts = np.cumsum(naxos_preview_durations)
naxos_total_duration = naxos_cumulative_starts[-1]  # Total duration of all previews
naxos_preview_starts = np.insert(naxos_cumulative_starts, 0, 0)[:-1]  # Insert 0 at start, remove last

print("Original Naxos preview timeline:")
for i, (start, duration) in enumerate(zip(naxos_preview_starts, naxos_preview_durations)):
    print(f"     - Preview {i+1}: starts at {start:.1f}s (duration: {duration:.1f}s)")

# Temporal interpolation: map Naxos timeline to Ehrenreich signal duration
interpolated_starts = [
    ChromagramAligner.get_relative_time(
        time_sec=start, 
        total_duration=naxos_total_duration, 
        reference_duration=ehrenreich_signal.duration_seconds()
    )
    for start in naxos_preview_starts
]

print(f"\nInterpolated timeline (mapped to {ehrenreich_signal.duration_seconds():.1f}s Ehrenreich signal):")
for i, start in enumerate(interpolated_starts):
    print(f"     - Preview {i+1}: estimated start at {start:.1f}s")
* Loaded Audio Path 'index_data/alignment/previews\30s_preview_01.wav' 
* Samples Number: 1323008 
* Sample Rate: 44100 Hz 
* Duration: 00:00:30 
* File Size: 2.52 MB
* Loaded Audio Path 'index_data/alignment/previews\30s_preview_02.wav' 
* Samples Number: 1323008 
* Sample Rate: 44100 Hz 
* Duration: 00:00:30 
* File Size: 2.52 MB
* Loaded Audio Path 'index_data/alignment/previews\30s_preview_03.wav' 
* Samples Number: 1323008 
* Sample Rate: 44100 Hz 
* Duration: 00:00:30 
* File Size: 2.52 MB
Loaded 3 Naxos preview signals.

Original Naxos preview timeline:
     - Preview 1: starts at 0.0s (duration: 182.0s)
     - Preview 2: starts at 182.0s (duration: 98.0s)
     - Preview 3: starts at 280.0s (duration: 125.0s)

Interpolated timeline (mapped to 500.0s Ehrenreich signal):
     - Preview 1: estimated start at 0.0s
     - Preview 2: estimated start at 224.7s
     - Preview 3: estimated start at 345.7s

After loading all data, we convert raw audio into chromagram using the same processing parameters.

from src.audio_features.builders import ChromagramBuilder
from src.audio_features.features import Chromagram

def preprocess(chroma: Chromagram) -> Chromagram:
    ch_p = chroma.normalize("2")
    ch_p = ch_p.smooth(21)
    ch_p = ch_p.log_compress(500)
    return ch_p


ehrenreich_chroma: Chromagram = ChromagramBuilder().build(ehrenreich_signal)
ehrenreich_chroma: Chromagram = preprocess(ehrenreich_chroma)
ehrenreich_chroma.plot(figsize=(9, 2), title_override="Ehrenreich Chromagram")

naxos_chromas = []
for i, naxos_signal in enumerate(naxos_signals):
    naxos_chroma: Chromagram = ChromagramBuilder().build(naxos_signal)
    naxos_chroma: Chromagram = preprocess(naxos_chroma)
    naxos_chromas.append(naxos_chroma)
    naxos_chroma.plot(figsize=(9, 2), title_override=f"Naxos Preview Chromagram {i+1}")

Finally, we can start the chromagram alignment. Each preview chromagram is aligned with the reference (Ehrenreich chromagram) using optimized windows of research. Those windows are approximated using the temporal interpolation mentioned just before. This avoids trying to align the opening of an opera at the end of it, and reduces the time complexity algorithm drastically. For example, if an opera lasts three hours and we define the research window to be 15 minutes, we divide the time complexity by 12.

import numpy as np

# Initialize the chromagram aligner
aligner = ChromagramAligner(sigma=np.array([[2, 1], [1, 2], [1, 1]]))

transition_predictions = []

for i, naxos_chroma in enumerate(naxos_chromas):

    # Use the aligner to find the best alignment
    # The aligner expects (reference, query), so we pass chroma as ref and naxos_chroma as query
    start_s, end_s = aligner.align(
        ref=ehrenreich_chroma,
        query=naxos_chroma,
        expected_start_sec=interpolated_starts[i],
        window_size_sec=1200,
        use_gaussian_filter=True,
        filter_sigma=3,
        output_type="time",
        plot_cost_matrix=True,
    )

    # For transition detection, we use the start of the alignment
    transition_predictions.append(start_s)

print("All detected transitions (seconds):")
for i, transition in enumerate(transition_predictions):
    print(f"    * Preview {i+1} starts at {transition:.2f}s on Ehrenreich's track")

All detected transitions (seconds):
    * Preview 1 starts at 4.27s on Ehrenreich's track
    * Preview 2 starts at 188.02s on Ehrenreich's track
    * Preview 3 starts at 278.52s on Ehrenreich's track

Finally, we can compare the results found with the ground truths annotated from a score.

from src.metrics.metrics import TS_Evaluator

print(
    "Ehrenreich segment number:",
    len(ehrenreich_ground_truths),
    "| Preview segment number:",
    len(naxos_signals),
)

evaluator = TS_Evaluator(tolerance_seconds=15)
evaluation = evaluator.evaluate(ehrenreich_ground_truths, transition_predictions)
evaluator.plot_evaluation(ehrenreich_ground_truths, transition_predictions, figsize=(8, 4))
Ehrenreich segment number: 4 | Preview segment number: 3

The chromagram-based alignment demonstrates very solid performance in this example, achieving perfect precision (100%) by accurately aligning all detected Naxos preview segments with their corresponding positions in the Ehrenreich recording.

However, a fundamental challenge emerges from the inherent disagreement between different sources regarding segment boundaries. In this 8 minutes example, the Ehrenreich recording contains 4 structural movements, while the Naxos database provides only 3 corresponding preview segments for the same temporal range. This discrepancy, discussed in the Alignment Segmentation Results section, represents an unavoidable limitation when multiple sources define different segmentation schemes for the same musical work—a common occurrence in opera where no absolute structural truth exists. Consequently, one genuine transition remains undetected, resulting in a recall of 75% and an overall F1 score of 0.857.

Despite this inherent limitation, the alignment technique proves highly effective when both recordings maintain sufficient quality. However, practical constraints limit its broader applicability to the Ehrenreich Collection. Not all operas are represented in the Naxos database, and the bootleg nature of Ehrenreich recordings, captured illegally during live performances as described in the Introduction, often results in poor audio quality that can complicate reliable feature alignment.

The selection of chromagram features for alignment was motivated by extensive comparative analysis. Unlike spectrograms, which contain high-dimensional redundant information that increases computational complexity, chromagrams provide a compact 12-dimensional representation focusing on harmonic content. Compared to MFCCs, chromagrams demonstrate superior robustness across different opera versions and recording conditions, as MFCCs prove more sensitive to timbral variations introduced by recording quality differences, background noise, and acoustic environments; factors particularly relevant when comparing commercial studio recordings with bootleg live captures.

▶ Application Usage

The following demonstration video shows how to use the application to perform chromagram-based alignment. First, search for an opera on naxos.com in the catalogue section (https://www.naxos.com/Catalogue). From there, navigate to the opera page (https://www.naxos.com/CatalogueDetail/?id=…) and ensure the page contains playable audio previews, if not, the module will not be able to retreive anyting.

Then, paste this Naxos catalogue URL (https://www.naxos.com/CatalogueDetail/?id=…) in the text input of the alignment module. This will automatically fetch all preview audio files from that opera. Once done, a table with playable previews should appear in the application with a blue button on the right side of the module named “Start Alignment”. By clicking on it, the alignment process will start and the table will be updated with the detected start times and end times for each preview within the main audio track.

By clicking on a row of the table, the red preview line will be teleported to the corresponding position in the main timeline at the bottom of the application window, allowing for quick audio comparison between the aligned preview and the main audio track.

To add one or multiple detected transitions to the final timeline, right-click on a row of the table and select “Add this transition” or “Add all transitions” respectively.

Parameters Overview

All algorithms presented use specific parameters that control their behavior and performance. This section provides a comprehensive reference for understanding and tuning these parameters.

Common Parameters

Parameter Options Description / Intuition
Frame Length 4410 - 88200 samples Analysis window size controlling the fundamental time-frequency resolution trade-off in spectral analysis.

Impact: Larger frames provide better frequency resolution but worse temporal localization of transitions. Smaller frames enable precise timing but may miss subtle harmonic changes.
Hop Length 2205 - 44100 samples Step size between consecutive analysis windows, determining temporal resolution and computational cost.

Impact: Smaller hop lengths create smoother novelty curves with finer temporal detail but increase computation time. Larger values reduce processing overhead but may miss brief transitions.
Threshold 0.0 - 1.0 Peak detection sensitivity parameter controlling the minimum novelty value required to identify structural boundaries.

Impact: Higher thresholds reduce false positives but may miss subtle transitions. Lower values increase detection sensitivity but produce more spurious boundaries.
Min Distance Between Peaks 1.0 - 60.0 seconds Minimum temporal separation required between detected transitions to prevent multiple detections of the same boundary.

Impact: Larger distances prevent over-segmentation but may merge distinct nearby transitions. Smaller values allow detection of rapid structural changes but increase false positive clustering.


Silence Curve Parameters

Parameter Options Description / Intuition
Silence Type amplitude, spectral Detection method for low-energy regions. Amplitude uses simple energy calculation, spectral uses frequency-domain analysis for enhanced sensitivity.

Impact: Amplitude detection responds primarily to overall volume drops but may miss sustained quiet notes. Spectral detection captures subtle timbral changes and quiet orchestral textures that amplitude methods overlook.
Min Silence Duration 0.0 - 5.0 seconds Minimum duration threshold for silence regions to be considered valid structural boundaries. Filters out brief pauses and breathing.

Impact: Shorter durations increase sensitivity to brief structural pauses but introduce false positives from performer breathing and brief rests. Longer durations ensure only significant silences are detected but may miss rapid movement transitions.


Applause Curve Parameters (HRPS)

Parameter Options Description / Intuition
L_h_frames 1-501 frames Temporal median filter length applied to harmonic component spectrogram for smoothing across time frames.

Impact: Larger values create smoother harmonic components by averaging over longer time periods but blur rapid tonal changes. Smaller values preserve temporal detail in harmonic content but may retain noise and transient artifacts in the harmonic component.
L_p_bins 1-501 bins Frequency median filter length applied to percussive component spectrogram for smoothing across frequency bins.

Impact: Larger values create broader frequency smoothing of percussive elements, emphasizing truly broadband events like applause but potentially blurring frequency-specific transients. Smaller values preserve frequency detail in percussive content but may retain tonal artifacts and narrow-band noise.
Beta (β) 1.1-5.0 Soft masking threshold ratio determining the size of the residual component.

Impact: Higher values create a broader residual component by allowing more audio content to be classified as neither harmonic nor percussive. Lower values tend to misclassify the applause.


Novelty Curve Module Parameters

Parameter Options Description / Intuition
Normalization 1, 2, max, z Feature vector normalization method applied to audio features for comparable magnitudes across different feature dimensions. L2 norm (euclidean): square root of sum of squares. L1 norm: sum of absolute values. Max norm: largest absolute value. Z-score: subtract mean, divide by standard deviation.

Impact: Improper normalization can cause certain feature dimensions to dominate similarity calculations, leading to biased structural boundary detection.
Smoothing Filter Length 1-51 frames (odd) Temporal smoothing filter kernel size applied to feature sequences for noise reduction and pattern enhancement. Must be odd-valued for symmetric filtering.

Impact: Insufficient smoothing preserves noise that creates false positive boundary detections, while excessive smoothing destroys temporal precision necessary for accurate transition localization. Optimal values depend on audio quality and structural granularity requirements.
Downsampling Factor 1-50 Temporal resolution reduction factor applied to feature sequences. Balances computational efficiency with temporal precision for structural analysis.

Impact: Aggressive downsampling reduces computational cost but may cause brief transitions to become temporally indistinguishable from longer sections. Conservative downsampling preserves all structural detail but increases processing time quadratically in self-similarity matrix computation.
Log Compression 0.0-50.0 Logarithmic compression parameter γ applied to feature magnitudes via log(1 + γ×x). Enhances dynamic range by emphasizing low-amplitude features while compressing high-amplitude values.

Impact: Insufficient compression allows dominant features to mask subtle but structurally significant patterns, reducing boundary detection sensitivity. Excessive compression equalizes all feature magnitudes, destroying important amplitude relationships that indicate structural importance.
SSM Smoothing Length 1-51 frames (odd) Gaussian smoothing kernel size applied to self-similarity matrices for noise reduction and diagonal structure enhancement. Must be odd-valued for symmetric filtering.

Impact: Similar as the smoothing filter length for the feature but applied on the SSM.
SSM Smoothing Direction forward, backward, both Temporal direction for Gaussian smoothing filter application on self-similarity matrices. Controls how temporal context influences structural pattern enhancement.

Impact: Unidirectional smoothing can introduce temporal bias, causing asymmetric boundary detection where transition strength depends on approach direction. Bidirectional smoothing provides symmetric structural analysis but increases computational cost and may over-smooth rapid structural changes.
SSM Threshold 0.0-1.0 Similarity threshold τ applied to self-similarity matrices for structural contrast enhancement and sparsity control. Values below τ are set to zero.

Impact: Conservative thresholds preserve weak similarity patterns that may indicate gradual transitions but introduce noise sensitivity. Aggressive thresholds create sparse matrices with enhanced contrast but may eliminate subtle structural relationships crucial for detecting nuanced musical boundaries.
Binarize SSM boolean Binary thresholding operation converting continuous similarity values to discrete {0,1} representation. Creates sharp structural delineation by eliminating gradient information.

Impact: Binarization enhances detection of abrupt structural transitions by creating stark similarity contrasts but destroys gradient information essential for identifying gradual modulations and progressive harmonic changes common in opera transitions.
Kernel Size 1-150 frames Gaussian checkerboard kernel dimensions for novelty detection via SSM convolution. Controls temporal integration window for structural change detection.

Impact: Small kernels provide high temporal resolution for detecting brief transitions but increase sensitivity to local similarity fluctuations and noise. Large kernels integrate over extended temporal windows, reducing false positives but potentially merging distinct nearby boundaries into single detected transitions.
Variance 0.1-3.0 Gaussian kernel variance parameter σ² controlling spatial spread characteristics of the novelty detection kernel. Controls how much the kernel emphasizes the immediate neighborhood.

Impact: Low variance creates focused kernels that provide precise temporal localization but may miss gradual structural changes spanning multiple frames. High variance produces broad kernels that capture extended transition patterns but reduce temporal precision and may merge distinct adjacent boundaries.
Smoothing Sigma 0.0-40.0 Post-processing Gaussian smoothing parameter σ applied to computed novelty curves for noise reduction and peak enhancement. Final temporal regularization stage before peak detection.

Impact: Minimal smoothing preserves all detected structural variations but retains noise artifacts that create false positive peaks during threshold-based boundary detection. Excessive smoothing eliminates spurious peaks but may merge closely-spaced genuine transitions or reduce peak prominence below detection thresholds.


Combination Method Parameters

Parameter Options Description / Intuition
Chromagram Weight 0.0-1.0 Relative importance of harmonic transitions (key changes, chord progressions, tonal shifts) in the combined novelty curve computation.
MFCC Weight 0.0-1.0 Relative importance of timbral changes (instrumentation, vocal texture, recording characteristics) in the combined analysis.
Tempogram Weight 0.0-1.0 Relative importance of rhythmic shifts (tempo changes, meter variations, applause transitions) in the fusion process.
Combination Method mean, max Mathematical fusion approach for integrating multiple feature-based novelty curves into unified segmentation results.

Impact: Mean combination requires consensus across features, reducing false positives through conservative boundary detection but potentially missing transitions detected by only one feature type. Max combination emphasizes strongest individual signals, increasing recall by detecting any significant transition but raising false positive rates when single features generate spurious peaks.


▶ Application Usage

All modules in the application allow to modify those parameters and recompute the output curves on-the-fly. You can access them clicking on the gear icon on the bottom right corner of each module. The parameters can be reset to their default values by clicking on the Reset To Defaults button.

Evaluation and Optimization

Metrics Used

In order to assess the efficacy of the presented methods, doing a proper evaluation is crucial. The first thing to do is selecting a good metric that would faithfully indicate whether a segmentation is considered good or not. For that, F1 score turns out to be the perfect match (Wiki). Its rigorous formula is the following:

F_1=2\cdot\frac{Precision \cdot Recall}{Precision + Recall} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}

  • TP=True Positives,
  • FP=False Positives,
  • FN=False Negatives.

with the Precision and Recall being:

Precision = \frac{TP}{TP + FP} Recall = \frac{TP}{TP + FN}

In our case, The precision tells how much an algorithm is reliable when it finds a transition. If the precision is 75%, it means that 75% of the time, the algorithm is effectively true about its prediction.

The recall gives the coverage of the algorithm. If the recall is 75%, it means that 75% of all real transitions in the audio have been found by the algorithm.

Separately, each metric is not a good indicator. Indeed, only maximizing the precision can lead to a program that expressively refuses to find transitions by fearing to be wrong. On the other side, only maximizing the recall will encourage the algorithm to find transitions everywhere to avoid missing one. F1 score is the combination of both metrics and ensures that the program is performant in both metrics.

Additionally, putting more weight on one metric can also be done to favor a certain behaviour. For instance, in our case, it could be relevant to put more weight on the precision so that the algorithm is more reliable at the cost of missing a bit more transitions.

To favor one metric (e.g., recall), a weighted F1 score can be used:

F_{1,\beta} = (1 + \beta^2) \cdot \frac{Precision \cdot Recall}{\beta^2 \cdot Precision + Recall}

where \beta controls the weight for both metrics.

  • \beta = 1.0, both metrics weigh the same which is the standard F1 score
  • \beta < 1.0, the precision is prioritized over the recall
  • \beta > 1.0, the recall is prioritized over the precision

Optimization using Optuna

Now that a good metric has been found, some of the previous methods can be optimized by maximizing it. Indeed, Some techniques have multiple parameters and finding the optimal ones is not a trivial task. The proposed solution to tweak them intelligently is Optuna [9]. Optuna is a hyperparameter optimization framework that allows to easily define an optimization problem and find the best parameters using various algorithms. In our case, the Tree-structured Parzen Estimator (TPE) [10] algorithm has been used to optimize the parameters of the novelty-based segmentation methods. The optimization is done by running multiple trials, each trial corresponding to a different set of parameters. After a predefined number of trials, the best parameters are selected based on the highest F1 score obtained.

Example plot that evaluates the predicition of the segmentation of the chromagram novelty curve approach.

Results

Audio-Only Segmentation Results

The table below summarizes the F1 scores obtained by each audio-only segmentation method after optimization. The datasets used for this evaluation consist of three opera excerpts from the Ehrenreich collection, each with annotated structural boundaries found using scores. The best precision, recall, and F1 score for each method are highlighted in bold, while the second-best scores are underlined. each method was optimized 5 times with 250 trials each, where the mean and standard deviation of the metrics across these runs are reported.

Method Precision¹ Recall¹ F1 Score¹
Audio-Only Segmentation
Silence Curve 0.32 0.81 0.45
Applause Curve (HRPS) ² 1.00 0.34 0.51
Novelty Curve (Chromagram) 0.51 0.65 0.54
Novelty Curve (MFCC) 0.53 0.69 0.56
Novelty Curve (Tempogram) 0.44 0.58 0.46
Novelty Curve (Combined) 0.51 0.80 0.60

¹ These metrics represent mean values across all evaluation windows, not overall metrics computed from cumulative TP, FP, FN counts. Each value can be interpreted as: “on an average window (15 minutes), what would be my Precision, Recall, or F1 Score”. This is why the F1 Score values are not exactly equal to: \frac{2\cdot\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}} due to the averaging process across windows.

The real F1 score is then: W^{-1} \cdot \sum_{w}^{W}(\frac{2 \cdot \text{Precision}_w \cdot \text{Recall}_w}{\text{Precision}_w + \text{Recall}_w})

W being the number of windows and \text{Precision}_w, \text{Recall}_w being the precision and recall for window w.

² HRPS applause curve results are not directly comparable to other methods since only one of the three evaluation excerpts contains applause segments. While the F1 score appears reasonable, applying this algorithm to excerpts without applause would be meaningless. However, these results demonstrate that HRPS-based segmentation can be highly effective when properly calibrated for recordings containing applause.

Alignment Segmentation Results

Regarding the segment alignment method and similar to HRPS applause curve, the results are not directly comparable to audio-only segmentation methods since not all excerpts were accessible/relevant on the naxos.com database. Therefore, the following table shows results of one excerpt only, namely Giulio Cesare by Handel (BAR103, track 2, channel 2).

The alignment results are compared against a simple baseline method that estimates segment start times using only the duration ratio between preview segments and the full opera recording, without any chromagram-based refinement.

Results are presented both with and without a post-processing correction step (+ Correction). This correction step addresses a data mismatch issue: the Naxos database contained fewer preview segments than were annotated in the musical score. For example, Naxos might list 60 opera segments while the score annotations marked 70 structural boundaries. This discrepancy led to degraded recall during evaluation, as the algorithm could only detect segments that existed in the Naxos data, missing valid structural boundaries that were annotated in the score but had no corresponding Naxos preview. The correction step accounts for these missing segments to provide a fairer comparison against the available score-based ground truth.

Method Precision Recall F1 Score
Segment Alignment
Baseline Alignment 0.38 0.32 0.35
+ Correction 0.38 0.38 0.38
Feature-Based Alignment 0.70 0.55 0.62
+ Correction 0.70 0.65 0.67

Discussion

Performance Analysis and Method Comparison

The evaluation results reveal distinct performance characteristics across different segmentation approaches, each with specific strengths and limitations that have important implications for practical opera analysis.

Audio-Only Segmentation Methods

The combined novelty curve approach emerges as the most effective audio-only method, achieving the highest F1 score (0.60) by successfully balancing precision and recall. This superior performance validates the core hypothesis that different audio features capture complementary aspects of musical structure—harmonic transitions (chromagram), timbral changes (MFCC), and rhythmic variations (tempogram).

Interestingly, the silence curve method demonstrates the highest recall (0.81) among all approaches, indicating exceptional capability in detecting actual structural boundaries. However, its low precision (0.32) reveals a tendency toward over-segmentation, detecting many non-structural silences. This behavior is particularly relevant for opera recordings, where brief pauses for breathing, dramatic effect, or applause transitions may not correspond to true movement boundaries. For musicologists prioritizing comprehensive boundary detection over precision, this method could serve as an initial screening tool.

The individual novelty curve methods show moderate but consistent performance across different feature types. MFCC-based segmentation achieves the highest precision (0.53), likely due to its sensitivity to orchestral and vocal texture changes that often accompany major structural transitions. Chromagram-based methods perform well in detecting harmonic transitions, while tempogram-based approaches show the lowest overall performance, possibly because opera transitions often maintain rhythmic continuity despite significant harmonic and timbral changes, and because tempo can vary considerably within individual movements without indicating structural boundaries.

Segment Alignment Approach

The feature-based alignment method demonstrates promising results when compared to naive baseline approaches, improving F1 scores from 0.35 to 0.67 with correction. This substantial improvement validates the chromagram-based Dynamic Time Warping (DTW) approach for aligning preview segments with full recordings. The correction mechanism proves crucial, improving recall from 0.55 to 0.65 by accounting for data mismatch issues between Naxos preview availability and score annotations.

The method achieves a precision of 0.70 which indicates that identified boundaries have minimal false positives, making the approach suitable for applications requiring confident structural boundary detection.

However, the alignment approach faces inherent scalability limitations. The requirement for external segment data (Naxos previews) restricts its applicability to well-documented operas with available commercial segments. This dependency makes the method less suitable for the broader Ehrenreich Collection, which contains many rare and bootleg recordings lacking corresponding preview segments.

Methodological Limitations and Evaluation Challenges

Several factors limit the generalizability of these results. The evaluation dataset consists of only three opera excerpts, potentially insufficient for capturing the full diversity of structural patterns across different composers, periods, and performance styles. The ground truth annotations, derived from musical scores, may not always reflect the acoustic reality of live performances, particularly in bootleg recordings where interpretive decisions and performance conditions can alter structural boundaries.

The data mismatch issues highlighted in the alignment evaluation underscore a broader challenge in computational musicology: the discrepancy between theoretical musical structure (score annotations) and practical segmentation needs (available preview segments). This challenge is particularly acute for historical recordings where comprehensive metadata may be incomplete or unavailable.

Conclusion and Future Work

Summary of Key Findings

This project successfully demonstrates that computational segmentation of opera recordings is feasible using established music information retrieval techniques, with several methods showing promising performance for practical musicological applications.

The combined novelty curve approach represents the most significant contribution, achieving an F1 score of 0.60 by intelligently fusing chromagram, MFCC, and tempogram features. This method consistently outperformed individual feature-based approaches, validating the hypothesis that different audio characteristics capture complementary aspects of musical structure. The Optuna-optimized feature weights confirm that harmonic (chromagram) and timbral (MFCC) information provide more reliable structural indicators than rhythmic patterns for opera segmentation.

The interactive PyQt6 application provides researchers with immediate access to these methods, enabling real-time parameter adjustment and comparative analysis. This tool bridges the gap between algorithmic research and practical musicological workflows, allowing domain experts to fine-tune segmentation approaches for specific recordings or research questions.

Future Research Directions

Machine Learning and Self-Supervised Approaches

Current methods rely on hand-crafted features and traditional signal processing approaches, which, while effective, may not capture the full complexity of musical structure. During this project, initial exploration was conducted into self-supervised deep learning approaches that could leverage the vast unannotated audio content of the Ehrenreich Collection for pretraining, followed by fine-tuning on smaller annotated datasets.

The proposed approach centers on contrastive learning between audio segments, where the model learns meaningful representations by comparing spectral features of different audio windows. Unlike raw audio processing, this method would operate on audio features (e.g. spectrograms), which provide a more stable and less noisy representation suitable for structural analysis tasks. The self-supervised pretraining phase would utilize the thousands of hours available in the Ehrenreich Collection, exposing the model to diverse opera styles, composers, and performance conditions to learn general musical structural patterns.

For practical implementation, the approach envisions sliding window classification using 5-second audio windows, where each window receives a continuous probability score (0-1) indicating the likelihood of containing a structural boundary. This granular approach would enable smooth, probabilistic segmentation by iterating over entire recordings and aggregating window-level predictions. The trained model could be directly integrated into the PyQt6 application, providing users with an additional machine learning module alongside existing signal processing methods.

Initial conceptual work focused on binary classification frameworks where the model distinguishes between transition and non-transition windows, but preliminary exploration suggested that continuous probability outputs offer more nuanced and interpretable results for musicological analysis. However, the substantial computational requirements for self-supervised pretraining, combined with the time-intensive process of acquiring diverse annotated opera datasets for fine-tuning, led to this approach being deferred in favor of the implemented traditional methods and to prioritize the development and refinement of the application.

Future work could build upon this foundation by exploring modern self-supervised architectures like TS-TCC [11] or music-specific models like MERT [12], adapted for the specific challenges of opera segmentation. The integration of such approaches would significantly enhance the application’s capabilities, potentially achieving superior performance through learned representations rather than hand-crafted features.

Acknowledgements

This work owes much to Prof. Meinard Müller’s important contributions to music information retrieval. Prof. Müller’s foundational research [13] provides the main theoretical framework that guides the approach to analyzing audio features, detecting structural changes, and understanding opera recordings used throughout this project.

Prof. Müller has made his research widely accessible through open science initiatives. His comprehensive collection of Python notebooks [14], [15], freely available online, demonstrates practical implementations of music analysis techniques with clear explanations. These educational resources have served as invaluable learning tools and implementation guides, facilitating a deeper understanding of the theoretical concepts applied in this work.

This project makes extensive use of the libfmp library [16], Prof. Müller’s open source Python package containing robust implementations of music information retrieval algorithms. The libfmp library has proven particularly valuable for core signal processing operations, audio feature extraction methods, and similarity matrix computations that form the foundation of the segmentation approaches presented here. The library’s clear documentation and comprehensive functionality significantly accelerated the development process while ensuring adherence to established best practices in computational musicology.

Finally, this project was conducted under the supervision of Dr. Yannis Rammos at the Digital and Cognitive Musicology Laboratory (DCML) at EPFL. Dr. Rammos provided invaluable guidance throughout the research process, offering deep musicological expertise and helping shape the analytical approach to opera segmentation.

References

[1]
Bern University of Applied Sciences, “The ehrenreich collection online platform.” https://web.ehrenreich.bfh.science/, 2024.
[2]
Hochschule der Künste Bern, “Ehrenreich collection — identity, voice, sound.” https://www.hkb-interpretation.ch/projekte/ehrenreich-collection, 2021.
[3]
Riverbank Computing Limited, “PyQt6: Python bindings for the qt cross platform application toolkit.” https://pypi.org/project/PyQt6/, 2021.
[4]
[5]
J. Hisaishi, “4th movement mononoke hime.” https://www.youtube.com/watch?v=T8NFYQ8O8Dg, 2020.
[6]
D. Fitzgerald, “Harmonic/percussive separation using median filtering,” 2010.
[7]
J. Driedger, M. Müller, and S. Disch, “Extending harmonic-percussive separation of audio signals.” in Ismir, 2014, pp. 611–616.
[8]
J. Paulus, M. Müller, and A. Klapuri, “Audio-based music structure analysis,” in Proceedings of the international conference on music information retrieval (ISMIR), Utrecht, The Netherlands, 2010, pp. 625–636.
[9]
T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework.” https://optuna.org/, 2019.
[10]
S. Watanabe, “Tree-structured parzen estimator: Understanding its algorithm components and their roles for better empirical performance,” arXiv preprint arXiv:2304.11127, 2023.
[11]
E. Eldele et al., “Time-series representation learning via temporal and contextual contrasting,” arXiv preprint arXiv:2106.14112, 2021.
[12]
L. Yizhi et al., “MERT: Acoustic music understanding model with large-scale self-supervised training,” in The twelfth international conference on learning representations, 2023.
[13]
M. Müller, Fundamentals of music processing: Audio, analysis, algorithms, applications. Cham: Springer, 2015. doi: 10.1007/978-3-319-21945-5.
[14]
M. Müller, Fundamentals of music processing – using python and jupyter notebooks, 2nd ed. Springer Verlag, 2021, pp. 1–495. doi: 10.1007/978-3-030-69808-9.
[15]
M. Müller and F. Zalkow, FMP notebooks: Educational material for teaching and learning fundamentals of music processing,” in Proceedings of the international conference on music information retrieval (ISMIR), Delft, The Netherlands, 2019.
[16]
M. Müller and F. Zalkow, libfmp: A Python package for fundamentals of music processing,” Journal of Open Source Software (JOSS), vol. 6, no. 63, pp. 3326:1–5, 2021, doi: 10.21105/joss.03326.