Three Improvements(…) - Off-Topic@Heyuri

File: Screen Shot 2026-04-18 at 11.57.41 PM.png

(230 KB, 936x2115)

ImgOpsHide image

▶

Three Improvements to Wide-Band Voice Pulse Modeling QueueSevenM◆Tnq5UWtkfs 2026/04/19(Sun)04:02:27 No.180302 +

▶

In the last post about my vocal synthesis project, I talked about implementing the Wide-Band Voice Pulse Modeling algorithm. Since then, I've actually done some original research of my own and have devised what I believe to be three minor improvements to the algorithm.

I implemented the Wide-Band Voice Pulse Modeling algorithm (from Dr. Jordi Bonada's PhD thesis: https://www.tdx.cat/bitstream/handle/10803/7555/tjbs.pdf) via the upsampling method (specifically, upsampling via a natural cubic spline). There are actually two methods proposed in that paper, the other being via periodization. There is actually a patent that pertains to WBVPM, but it only covers the periodization version (which is what they used for their results), so I have implemented the upsampling method instead. I have been able to validate the main results in that paper; specifically, its shape-invariance and lower residual when compared to other methods. Furthermore, I have devised three significant improvements to the algorithm - two of which are only possible because I used the spline approach, so in a sense it was good that I had to do it that way.

Of the three improvements, I have implemented the first two and shown their advantage of the original WBVPM algorithm. The resulting score has been obtained by taking the mean of the relative of residual level (i.e. the difference between the original and reconstructed signal; relative to the level original signal). I have done so on an audio sample that deliberately exhibits traits that were noted as negatively affecting the WBVPM algorithm's resulting quality. Notably, a low pitch voice with rapid and deep vibrato, transients, strong amplitude modulation, and a large portion of the sampling being between a voiced/unvoiced/voiced transition.

First I should note that my WBVPM implementation is currently far from optimal. The pitch estimation system (via the modified TWM algorithm) has not undergone testing and tuning of its parameters, and there are many variations of the TWM algorithm to consider. Additionally, I have not implemented unvoiced/voiced detection (because, as far as I can tell, it is not mentioned in Bonada's thesis; presumably it's in prior literature, but I have not researched it yet), so all the algorithms act as if they are always processing a voiced signal even when they are not.

RESILIENT BORDER INTERPOLATION IN SYNTHESIS - When I first implemented the synthesis step for WBVPM, it was late at night and I was tired. I wanted a quick result before I went to bed and didn't understand the wording of the description of the synthesis step in WBVPM. As such, my original implementation differed significantly. Instead of using the overlap-and-add, it instead, for each sample, found the closest voice pulse and determined its value for that time, taking advantage of the spline that was generated for downsampling and using the periodic nature of the pulse to extend it when the sample was beyond its domain (i.e. the opposite of overlapping). This approach lead to high-frequency crackling artifacts due to discontinuities between the voice pulse boundaries.

The following day, I properly understood the synthesis approach and rewrote the synthesis code. Interestingly, this actually gave worse overall results. While the high frequency artifacts were gone, there were now large low frequency artifacts that appeared as large modulations in the time-domain. I eventually tracked this down to being a bug in my implementation of the MFPA algorithm that sometimes resulted in massive errors of up to 1.5 radians. I fixed this bug and the reconstruction synthesis no longer had significant artifacts, but I thought it was interesting that my approach, despite having the discontinuity issue, was more resilient to errors in the MFPA estimation. I began thinking if the two approaches could be combined to create an even better approach.

I was thinking about why the modulation occurred in the case of the overlap-and-add method. Thinking about it, when the fundamental frequency is stationary and the MFPA onsets are perfect, the trapezoidal window function is equivalent to a weighted average between two adjacent voice pulses over the duration of twice the border interpolation size. However, when the MFPA onsets are inaccurate, or even just when the fundamental frequency is non-stationary, this is no longer true. Even worse, thinking about it from the weighted average point of view, the sum isn't necessarily one everywhere anymore, hence the modulation.

I then devised a method that would not result in modulation. This method works by first synthesizing the 'inner' portion of each pulse (by 'inner', I mean starting at the end of the border interpolation at the start, and ending before the start of the next border interpolation towards the end of the pulse). Then, for the gaps in between each pulse, we calculate each sample value by a weighted average of two values. These are values are the values of each voice pulse at that time. Since the gap extends beyond the boundaries of each voice pulse, we use the periodic nature of the pulses to compute the effective position in the voice pulse by taking the position modulo the period of the fundamental frequency at that voice pulse. The fundamental frequencies of each of the voice pulses may differ, so we actually change step in time linearly. At each end of the gap, the step size for the voice pulse it is next to is one sample in time, while the step for the former voice pulse is the equivalent of one sample in the latter voice pulse relative to the former's fundamental frequency (e.g. if the second voice pulse has twice the fundamental frequency as the first; the step size for the first would be 2 and tep size for the second would be 1, at the end of the gap). For the start of the gap, it is the same except relative to the first pulse having a step of 1. In between, we the step size interpolate linearly.

It is worth noting that in the ideal case where the onsets are exactly correct and the fundamental frequency is stationary, the result of this approach is the same as using the trapezoidal window.

FREQUENCY WARP-CORRECTION - As noted in Bonada's thesis, WBVPM assumes that the fundamental frequency is stationary within each pulse, however this is not actually true, and that the artifacts from this are particularly apparent for low fundamental frequency voice signals, because each period of the signal is longer in time and thus the internal state of the system has more time to change.

One of the changes that can happen over time is modulation of the fundamental frequency. This can be thought of actually as a time-domain remapping function that distorts each voice pulse according to a continuous fundamental frequency trajectory.

I discovered a way of correcting this, largely by accident as I was thinking about solving the modulation issue I discussed in the previous section. I was thinking about how I proposed changing the step size linearly in the gaps between the 'inner' pulses. I was thinking, we have a discrete sequence of fundamental frequencies. So, what if instead of changing the step size linearly, we instead created a spline from the fundamental frequencies and instead changed the step size based on that? Then I realized that we could also use this for the whole voice pulses and just sample everything with a step size based on the fundamental frequency trajectory. I then realized that this would actually act like the distortion from changing parameters within each voice pulse, at least in the synthesis stage. Further more, since we are already computing splines for each voice pulse to downsample it, this comes at very little additional computational cost.

However, the voice pulses in analysis are already distorted. So then I already we can do the inverse resampling in the upsampling stage of WBVPM analysis to correct for non-stationary frequency, then it is redistorted according to the transformed fundamental frequency trajectory in the synthesis stage. This makes this method effectively invariant to modulations in fundamental frequency, so long as the modulation is less than the fundamental frequency and it is modeled well by the spline, which should be the case for modulation period is at least several voice pulses.

PITCHED/UNPITCHED DECOMPOSITION - As mentioned in Bonada's thesis, WBVPM only models sinusoids, and thus residual in the input signal is encoded as flucuations between the spectra of voice pulses. I devised a post-processing technique to separate the voice pulses into sinusoidal and residual components.

I have not actually implemented and test this yet, because as it is post-processing step, it will not improve the residual level, and in fact, will probably make it worse. The benefit of it is in the transformation stage, which is much harder to quantify and which I have not finished implementing. However, I believe this approach should work.

(continued in next post)

Marked for deletion (Old)

Name
E-mail	sagenokodump
Subject
Comment Tegaki	Emotes Kaomoji Emoji BBCode
File
Password	(for deletion)
Allowed file types are: gif, jpg, jpeg, png, bmp, swf, webm, mp4 Maximum file size allowed is 50000 KB. Images greater than 200 * 200 pixels will be thumbnailed. 25 unique users in the last 10 minutes (including lurkers) Switch form position \| BBCode reference \| Banned? \| Quick reply \| Post API Read the rules before you post. Protect your username, use a tripcode! 日本のへゆり

Heyuri!

Bulletin Boards

Heyuri★CGI

Other

Off-Topic@Heyuri

Posting mode: Reply