[Settings] [Home] [Catalog] [Search] [Private messages] [Admin]
[Return] [Bottom]

Posting mode: Reply

Emotes
Kaomoji
Emoji
BBCode
(for deletion)
  • Allowed file types are: gif, jpg, jpeg, png, bmp, swf, webm, mp4
  • Maximum file size allowed is 50000 KB.
  • Images greater than 200 * 200 pixels will be thumbnailed.
  • 21 unique users in the last 10 minutes (including lurkers)





Want your banner here? Click here to submit yours!

In the last post about my vocal synthesis project, I talked about implementing the Wide-Band Voice Pulse Modeling algorithm. Since then, I've actually done some original research of my own and have devised what I believe to be three minor improvements to the algorithm.

I implemented the Wide-Band Voice Pulse Modeling algorithm (from Dr. Jordi Bonada's PhD thesis: https://www.tdx.cat/bitstream/handle/10803/7555/tjbs.pdf) via the upsampling method (specifically, upsampling via a natural cubic spline). There are actually two methods proposed in that paper, the other being via periodization. There is actually a patent that pertains to WBVPM, but it only covers the periodization version (which is what they used for their results), so I have implemented the upsampling method instead. I have been able to validate the main results in that paper; specifically, its shape-invariance and lower residual when compared to other methods. Furthermore, I have devised three significant improvements to the algorithm - two of which are only possible because I used the spline approach, so in a sense it was good that I had to do it that way.

Of the three improvements, I have implemented the first two and shown their advantage of the original WBVPM algorithm. The resulting score has been obtained by taking the mean of the relative of residual level (i.e. the difference between the original and reconstructed signal; relative to the level original signal). I have done so on an audio sample that deliberately exhibits traits that were noted as negatively affecting the WBVPM algorithm's resulting quality. Notably, a low pitch voice with rapid and deep vibrato, transients, strong amplitude modulation, and a large portion of the sampling being between a voiced/unvoiced/voiced transition.

First I should note that my WBVPM implementation is currently far from optimal. The pitch estimation system (via the modified TWM algorithm) has not undergone testing and tuning of its parameters, and there are many variations of the TWM algorithm to consider. Additionally, I have not implemented unvoiced/voiced detection (because, as far as I can tell, it is not mentioned in Bonada's thesis; presumably it's in prior literature, but I have not researched it yet), so all the algorithms act as if they are always processing a voiced signal even when they are not.

RESILIENT BORDER INTERPOLATION IN SYNTHESIS - When I first implemented the synthesis step for WBVPM, it was late at night and I was tired. I wanted a quick result before I went to bed and didn't understand the wording of the description of the synthesis step in WBVPM. As such, my original implementation differed significantly. Instead of using the overlap-and-add, it instead, for each sample, found the closest voice pulse and determined its value for that time, taking advantage of the spline that was generated for downsampling and using the periodic nature of the pulse to extend it when the sample was beyond its domain (i.e. the opposite of overlapping). This approach lead to high-frequency crackling artifacts due to discontinuities between the voice pulse boundaries.

The following day, I properly understood the synthesis approach and rewrote the synthesis code. Interestingly, this actually gave worse overall results. While the high frequency artifacts were gone, there were now large low frequency artifacts that appeared as large modulations in the time-domain. I eventually tracked this down to being a bug in my implementation of the MFPA algorithm that sometimes resulted in massive errors of up to 1.5 radians. I fixed this bug and the reconstruction synthesis no longer had significant artifacts, but I thought it was interesting that my approach, despite having the discontinuity issue, was more resilient to errors in the MFPA estimation. I began thinking if the two approaches could be combined to create an even better approach.

I was thinking about why the modulation occurred in the case of the overlap-and-add method. Thinking about it, when the fundamental frequency is stationary and the MFPA onsets are perfect, the trapezoidal window function is equivalent to a weighted average between two adjacent voice pulses over the duration of twice the border interpolation size. However, when the MFPA onsets are inaccurate, or even just when the fundamental frequency is non-stationary, this is no longer true. Even worse, thinking about it from the weighted average point of view, the sum isn't necessarily one everywhere anymore, hence the modulation.

I then devised a method that would not result in modulation. This method works by first synthesizing the 'inner' portion of each pulse (by 'inner', I mean starting at the end of the border interpolation at the start, and ending before the start of the next border interpolation towards the end of the pulse). Then, for the gaps in between each pulse, we calculate each sample value by a weighted average of two values. These are values are the values of each voice pulse at that time. Since the gap extends beyond the boundaries of each voice pulse, we use the periodic nature of the pulses to compute the effective position in the voice pulse by taking the position modulo the period of the fundamental frequency at that voice pulse. The fundamental frequencies of each of the voice pulses may differ, so we actually change step in time linearly. At each end of the gap, the step size for the voice pulse it is next to is one sample in time, while the step for the former voice pulse is the equivalent of one sample in the latter voice pulse relative to the former's fundamental frequency (e.g. if the second voice pulse has twice the fundamental frequency as the first; the step size for the first would be 2 and tep size for the second would be 1, at the end of the gap). For the start of the gap, it is the same except relative to the first pulse having a step of 1. In between, we the step size interpolate linearly.

It is worth noting that in the ideal case where the onsets are exactly correct and the fundamental frequency is stationary, the result of this approach is the same as using the trapezoidal window.

FREQUENCY WARP-CORRECTION - As noted in Bonada's thesis, WBVPM assumes that the fundamental frequency is stationary within each pulse, however this is not actually true, and that the artifacts from this are particularly apparent for low fundamental frequency voice signals, because each period of the signal is longer in time and thus the internal state of the system has more time to change.

One of the changes that can happen over time is modulation of the fundamental frequency. This can be thought of actually as a time-domain remapping function that distorts each voice pulse according to a continuous fundamental frequency trajectory.

I discovered a way of correcting this, largely by accident as I was thinking about solving the modulation issue I discussed in the previous section. I was thinking about how I proposed changing the step size linearly in the gaps between the 'inner' pulses. I was thinking, we have a discrete sequence of fundamental frequencies. So, what if instead of changing the step size linearly, we instead created a spline from the fundamental frequencies and instead changed the step size based on that? Then I realized that we could also use this for the whole voice pulses and just sample everything with a step size based on the fundamental frequency trajectory. I then realized that this would actually act like the distortion from changing parameters within each voice pulse, at least in the synthesis stage. Further more, since we are already computing splines for each voice pulse to downsample it, this comes at very little additional computational cost.

However, the voice pulses in analysis are already distorted. So then I already we can do the inverse resampling in the upsampling stage of WBVPM analysis to correct for non-stationary frequency, then it is redistorted according to the transformed fundamental frequency trajectory in the synthesis stage. This makes this method effectively invariant to modulations in fundamental frequency, so long as the modulation is less than the fundamental frequency and it is modeled well by the spline, which should be the case for modulation period is at least several voice pulses.

PITCHED/UNPITCHED DECOMPOSITION - As mentioned in Bonada's thesis, WBVPM only models sinusoids, and thus residual in the input signal is encoded as flucuations between the spectra of voice pulses. I devised a post-processing technique to separate the voice pulses into sinusoidal and residual components.

I have not actually implemented and test this yet, because as it is post-processing step, it will not improve the residual level, and in fact, will probably make it worse. The benefit of it is in the transformation stage, which is much harder to quantify and which I have not finished implementing. However, I believe this approach should work.

(continued in next post)
>>
The technique works as follows:
a) First, for each voice pulse, and then for each harmonic of its spectra, we compute a spline based on the values of the amplitude of that harmonic in the voice pulse as well as a fixed number of surrounding voice pulses.
b) Since the time delta between voice pulses can vary, we then resample each local harmonic spline with fixed steps in time.
c) We compute the fourier transform of these resampled local harmonic trajectories
d) We apply a low-pass and high-pass filter to separate it into low-frequency and high-frequency components.
e) We then apply the inverse fourier transform to each of these. We can then sample the low pass trajectory at the time of the voice pulse to get the amplitude value of the denoised harmonic for that voice pulse. The same can be done for the high pass trajectory to obtain a pseudo-pulse representing the residual. These residual voice pulses can then be synthesized using the WBVPM synthesis method to obtain a time-domain residual signal which can be processed separately from the main harmonic signal.

A significant source of error in this process presumably would come from the resampling step. This can be decreased by using a smaller time step, at an increased computational cost. However, the error could probably be greatly reduced by first calculating the difference between the original amplitudes and the amplitudes at the same times in a spline computed from the resampled harmonic spline before applying the band filters, this difference can later be added back to the low-pass amplitude trajectory.

The denoised harmonic phase can also be computed via the same method, using Bonada's method for unwrapping phase across both frequency and time. The residual phase can be calculated by taking difference of the original phase from the denoised phase and dividing it by the residual amplitude.

RESULTS:

I have tested these improvements and obtained the following results for the aforementioned audio sample:

Original WBVPM: -36.355dB
Warp-correction improvement only: -36.74595dB
Warp-correction & Resilient border interpolation in synthesis: -37.41177dB

More research is needed to properly evaluate these improvements across more samples with more variety, and to see if these techniques still result in improvements with more accurate pitch and MFPA estimation and with proper handling of unvoiced/voiced frames.

Want your banner here? Click here to submit yours!

[Top]

Delete post: []
First
[0]
Last