all
a
/
b
/
c
/
f
/
h
/
j
/
jp
/
l
/
o
/
q
/
s
/
sw
/
lounge
cgi
up
wiki
Heyuri!
Bulletin Boards
2D Cute
2D Ero
2D Lolikon
3D Girls
Anime/Manga
Flash
Girl Talk
日本語/Japan
Lounge
Oekaki
Off-Topic
Site Discussion
Strange World
Overboard
Heyuri★CGI
Heyuri★CGI
@PartyII
Battle Royale R
Chat
Chinsouki★
Dating
DevChat
Drama Club
Hakoniwa Islands PvE
Hakoniwa Islands PvP
Polls
Slime Breeder
Web Banana
Web Shiritori
Yumemiru Gambler
Kakiko Checker
Other
Anime Nominations
Banners
Cytube
Heyuri Calendar
Heyuri Wiki
MAL Club
Museum
Steam Group
Uploader
[
Home
] [
Catalog
] [
Search
] [
Inbox
] [
Write PM
] [
Admin
]
Off-Topic@Heyuri
it's the place to be!
[
Return
] [
Bottom
]
Posting mode: Reply
Name
Email
Subject
Post
Comment
File
Animated GIF
Password
(for deletion)
Allowed file types are: gif, jpg, jpeg, png, bmp, swf, webm, mp4
Maximum file size allowed is 50000 KB.
Images greater than 200 * 200 pixels will be thumbnailed.
26
unique users in the last 10 minutes (including lurkers)
Switch form position
|
BBCode reference
Read the
rules
before you post.
Protect your username, use a
tripcode!
日本のへゆり
2026-03-04
-
NEW GAME:
Gekikuukan Powerful League 2
! Create your own baseball team and try to win the league!
2026-02-01
-
Soudane (Yeah) feature has been implemented back
2025-09-03
-
AHoge Editor
is now available in English! You can easily create your own AA with it.
2025-05-04
-
Heyuri Calendar
has been launched. Find out about upcoming Heyuri events!
[
Show all
]
We will play
Unreal Tournament 2004
this Saturday at 17:00 UTC [
Info
] [
Countdown
]
File:
Screen Shot 2026-03-30 at 4.51.32 AM.png
(92 KB, 1132x839)
[
ImgOps
]
Shape-invariant transforms using Wide-Band Voice Pulse Modeling
QueueSevenM
◆Tnq5UWtkfs
2026/03/30
(Mon)
11:59:39
No.
178285
Yeah x2
▶
Hello, I'm back and I have a major update to my VOCALOID project! I have sucessfully achieved a shape-invariant pitch transposition!
Here it is.
First the original audio:
https://files.catbox.moe/zmt3rr.wav
Now my version with WBVPM (pitched down by an octave):
https://files.catbox.moe/kho97n.wav
And a version using a naive pitch shift:
https://files.catbox.moe/xs39bq.wav
Notice that my version, while having more noise, sounds more natural and has less phasiness. This is particular noticeable if you play both at very low volume. One sounds much more 'human' than the other.
Also note that this an extreme example with an octave shift (or 1200 cents) - in practice, shifts would typically be far less. Also this doesn't implement several other parts of the system (more on that later).
I'll explain all of this in a moment, but first, I'd to correct some major biographical errors. Since this is a long post, I've divided it into sections
BIOGRAPHICAL CORRECTIONS
In the last post, I claimed that VOCALOID1 used Narrow-Band Voice Pulse Modeling while VOCALOID2 and onwards used Wide-Band Voice Pulse Modeling. This was incorrect, and additionally it was the source of most of my confusion surround the paper.
What actually happened is that the research technology that would later become VOCALOID1 started out as work to improve the existing Spectral Modeling Synthesis system that had been developed in the early 1990s. This improvement began work in the late 1990s. But importantly, this system evolved and techniques from it were incorporated with techniques from a system that was being developed called a Phase-Locked Vocoder, and this system would be released as VOCALOID1. In the mid-2000s, work began on combining the techniques learned from improving SMS and the PLVC-based system and attempting to combine them with the mucher older and well-known TD-PSOLA system. Importantly, TD-PSOLA (Time-Domain Pitch Synchronous OverLap and Add) was a time-domain system, while SMS was a frequency-domain system (and also TD-PSOLA was pitch synchronous - hence the name, while SMS had a constant hop size). The first technique they developed was Narrow-Band Voice Pulse Modeling, and later Wide-Band Voice Pulse Modeling. Wide-Band Voice Pulse Modeling ended it up being used in VOCALOID2.
Now that I understand this, I also understand the major mistake I made when reading the paper: I was reading it from the perspective of an implementer, thinking of the sections as the steps to implementing it instead of as research. I had thought that section 2.2 described the core processing algorithms. When it was actually about SMS, and importantly, about *the improvements they made to SMS*, and not a complete description of SMS, since SMS was already an established technique. Hence my confusion on why some things were seemingly vaguely explained, since *the paper wasn't about them*. At the same time, much of that section is very useful though because importantly, much of that research was also incorporated into the later techniques.
RESULTS
I have successfully implemented the Wide-Band Voice Pulse Modeling; synthesis; and pitch transposition, time stretching, and timbre scaling algorithms. Additionally, I have also finished implementing the full version of the pitch estimation module, changed the code to work using overlapping windows, implemented the window adaption system, and fixed countless.
Importantly, I have been able to experimentally replicate a very important property - and one of the main reasons WBVPM was developed, in fact. That property is shape-invariance. You see, an important property of the human voice is that, all else being equal, the shape of each pulse in the waveform stays roughly the same regardless of frequency. The reason for this property is phase-coherence. At the start of each voice pulse (when the glottis closes), the phases of all the harmonics within each formant (where each 'formant' is a spectral region affected by the vocal tract differently) are roughly the same. Since phase changes proportionally to frequency, the different harmonics will shift from that point over time, and soon become very different from one another. Since the phases are vastly different with relation to the frequency at times other than the voice pulse onset, the harmonics interfere constructively and destructively in the time-domain. Importantly however, if all the harmonics are scaled equally, the phases all now change at a slower or faster rate, but importantly this rate scales the same for all of them. This gives rise to shape-invariance, since the pattern of interference stays the same, just at different scales.
Importantly, if you apply a transform relative to a point that is not a voice pulse onset, the phases will not be flat. Of course, that transform can shift the changes *from* the point it started from, but importantly it is NOT accounting for inherent phase shift that occurs from not being at a voice pulse onset. Of course, if no transform is occurred, then there will be no issue. But if one is, say a pitch transposition, then the initial phases from the starts if signal was actually shifted to the pitch originally will differ considerably from the observed ones since the observed ones base themselves on the measured phase *at a different pitch*. This results in the breaking of shape-invariance, a noticeable 'phasiness' sound, and the sound sounding un-human.
Here is an image of 500 samples from the original signal:
https://files.catbox.moe/223l7p.png
Now here is 1000 from a one octave down pitch transposition using a naive approach (a fixed-window and hop-size approach using a 1024-point Hann window):
https://files.catbox.moe/jxgtg0.png
Notice that not only is the waveform unrecognizable compared to the original, it even varies considerably between individual voice pulses!
Now compare to 1000 samples from the WBVPM approach:
https://files.catbox.moe/6hpf8l.png
Notice how the waveform is almost identical, only scaled up two times in period, and it varies much less.
You may be wondering, couldn't we just downsample or upsample the signal and play it back at the same sample rate to get the same result? Well, importantly, we have independent control over pitch and time. In the example, I downsampled the voice by a factor two, but kept the time the same and it contains the same number of samples as the original. Additionally, in the analysis and then synthesis reconstruction, it is seperating it into individual voice pulses. Importantly, it isn't just scaling them, it is generating new voice pulses in the frequency domain and inserting them at positions that were also generated.
Here is an amplitude envelope of the latter half of the original audio:
https://files.catbox.moe/6zw5v5.png
Now here is an amplitude envelope of the latter half of the pitch-transposed audio:
https://files.catbox.moe/2a3utu.png
Notice how they a roughly the same. If the audio was just downsampled, the second would be stretched out by a factor of two - but is not.
I also implemented timbre-scaling, although I have not tested it. Fun fact; when I implemented it, actually did so by accident. I was trying to implement the pitch transposition, got a bit confused, and realized I had also accidently implemented timbre scaling.
All these transforms are currently implemented as linear transformers. However, they are all implemented by just sampling a spline at a regular interval, so they could be trivially made to accept a non-linear parameter, sequence of points, or spline instead.
Although this current implementation is far from perfect, I think it works reasonably well as a demonstration of the techniques and their properties. Keep in my mind that I have done nearly no adjustment of the constant parameters/'tuning'. In fact, there are several parameters whose corresponding feature is effectively disabled because I wasn't sure what value to pick. This implementation could probably be considerably improved just by adjusting a few constant. An (hopefully) efficient and accurate way of 'tuning' automatically is discussed later in this post.
Notably, the pitch-transposed spectrum varies significantly, with some areas showing little residual and straight lines:
https://files.catbox.moe/04naz0.png
(those lines in some areas around the center of the spectrum are probably the aliasing artifacts Bonada 2008 mentioned in WBVPM when using upscaling, I will implement the method for avoiding them at some point)
Additionally, I have also tested reconstructing the sound with no transforms. This seems yield little residual, although it seems concentrated at higher frequencies, so maybe that can be fixed. Maybe it could also be caused by aliasing. Here is an audio file that was reconstructed through the WBVPM synthesis procedure using downsampling:
https://files.catbox.moe/bhxjpw.wav
And a spectrogram:
https://files.catbox.moe/kvrnkq.png
Compare to the spectrogram of the original:
https://files.catbox.moe/d1vo16.png
PITCH ESTIMATION
Throughout this project, the most finicky part has been - and continues to be - the pitch estimation; specifically the Two-Way Mismatch algorithm for monophonic pitch detection. I have compiled several variations of the TWM algorithm. I tested one change that worked by scaling a term by the amplitude (I had actually though of this idea myself, and this term happened to be the term I mentioned causing me trouble last time), and it led to considerable improvement so I kept it.
>>
1
QueueSevenM
◆Tnq5UWtkfs
2026/03/30
(Mon)
12:00:01
No.
178286
+
▶
There's also the adaptive window procedure that wraps the TWM f0 estimation. One thing I had been noticing for a long time was that Kaiser-Bessel beta values about 10% higher than the recommended values given in Cano 1998 seemed to perform much better. I had assumed this was just because of issues with my code, or the audio samples I was testing on. Much later, I was experimenting in python when I noticed a function called kaiser_beta which converted something else abbreviated to 'a' to the equivalent beta value. Previously in Cano 1998 and in other places, I had seen the Kaiser-Bessel parameter as alpha instead of beta. Up until this point, I had either not paid attention to this, or I had assumed that these had referred to the same thing. I did some research and found out that it converts between attenuation and the beta value for the Kaiser-Bessel window. Then I found that there is indeed an alpha form of the parameter and it is not that same as beta. Confusingly however, it is not attenuation, but both abbreviate to the same thing. The Kaiser-Bessel beta can be determined by just multiplying the alpha value by pi. Interestingly, this is much higher than the 10% I tested, however it seemed to perform better (or at least not worse) anyway. A possible explanation for this discrepancy is that the adaptive window is larger than the window I used to test the adjustment originally.
Another improvement relating to windows is the window used for the harmonics that are fed into MFPA. Originally, I had used the same Kaiser-Bessel window for both. I later switched to a Blackman-Harris -92dB window, which I had seen mentioned in the paper. This resulted in a significant improvement. Another improvement I tried was adapting the window size to a value relative to the period of the estimated fundamental frequency. I tried doing this - using the same number of periods as are used for the Kaiser-Bessel window used for TWM - and noted a substantial improvement, even more so than the improvement from switching to the Blackman-Harris window in the first place. Indeed, this matches the results contained in the study. In the WBVPM section, they observe a considerable improvement (up to -10dB) when using an adaptive window size when compared to a fixed window for narrow-band analysis. In that same section, they also found 2 to be the ideal number of periods for minimizing noise and also did experiments with a Hann window. Perhaps experimenting with these ideas could lead to improvements, although that section is about getting an accurate spectrum and reconstruction, which may differ somewhat in needs from the needs of MFPA. Another idea could be using a separate function for determining its adaptive number of periods, as opposed to using the same value as for the Kaiser-Bessel window as I am currently doing. Perhaps always using an integer number of periods could be beneficial. Another idea is only using one or two periods, which would provide better time resolution, and could be better suited for wide-band analysis as we are doing. Another potential improvement I have thought of, but not yet tested, is modifying the constant parameters of the Blackman-Harris window in a manner similar to the method Cano 1998 describes for the Kaiser-Bessel window beta (and that I have used for that), where the constant parameters (or in this case, parameters) are modified in accordance with the fundamental frequency.
Another potential for improvement of the MFPA results could be the use of a peak selection algorithm. I had previously used a very simple one I had found on another resource by the UPF Music Technology Group. Although this algorithm did not seem to show an improvement. I later removed it and saw no observable detriment. The paper does not provide details on this specifically, but I now understand why, so I should do more research with how this was tackled in SMS. One idea I've though of myself is to calculate the estimated harmonics and then search the surrounding area for peaks. We then select the peak with the minimum error, where that error is determined based on distance and amplitude. One formula I have thought for the error calculation but not tested is amplitude / distance^2. We want to search far enough to always have the best candidate, but not too far is to be computationally inefficient or run into floating-point error and instability in the error function. A potential improvement to this approach is instead of determining the initial estimate for the harmonic frequency by multiplying the fundamental frequency by the harmonic index, we could instead add the fundamental frequency to the peak that was chosen to be the last harmonic. This would account for drift caused by inaccuracies in the f0 estimation and also distortion in the harmonics. However, this also runs the risk of drifting away from the harmonics. A possible solution to this issue could be blending this estimated harmonic frequency with the one obtained by multiplication with the fundamental frequency. This could act as a sort of course correct that would work gradually, but at the same time keep the benefits of basing it on the previous selected harmonic peak.
Another potential improvement could be found by fixing sudden jumps in fundamental frequency that last for only a few analysis frames and then return to roughly the same fundamental frequency as before the jump. Cano 1998 calls for a "hystheresis cycle" - though I am not sure exactly what that means. I have implemented a simple system that discarded large relative jumps that last for only a single frames. However, this has two major issues. The first is that these jumps often last for more than just one frame. The second is that if I legitimate jump in f0 that stays occurs, this introduces one frame of lag.
MAXIMALLY FLAT PHASE ALIGNMENT
My last post was about MFPA, since then, I have made a number of improvements to this part of the system. I don't believe I have made any changes to the core MFPA function itself, but I have made a lot of improvements to the MFPA refinement algorithm as well as the code surrounding MFPA.
One major improvement I made only recently. The previous issue stemmed from what I now believe to have been a misunderstanding. The MFPA algorithm gives a phase shift for each frame. This can be converted into a time offset. However, unless the frame-rate is exactly the same as the fundamental frequency (in the instantaneous sense), this will give more or fewer pulse onsets than actually exist. At the time, I was using a high-pitched sample for testing whose f0 was much faster than the analysis hop-size of 256 samples (or ~172 per second at 44.1kHz). Because of this, there were usually more than one pulse in between each detected pulse onset. At the time, I had thought that getting all the pulse onsets was the purpose of the MFPA refinement algorithm. Which is why I was confused that the it was described as choosing a *subset* of the pulses and not a superset. At the time, I had implemented the MFPA refinement algorithm, but it was buggy and either didn't work or did nothing. Later, I began thinking of ways myself of getting the in between onsets. My ideas was to add increments of the f0 period until the next pulse was reached.
I eventually realized that the purpose of MFPA refinement algorithm was not interpolation, but to take a list of pulse onsets that could include multiple close estimates for the same pulse and narrow it down so there is only per pulse and such that the best one is chosen (actually it looks at a few additional candidates, which somewhat tripped me up into thinking it was about interpolation for long). For this to happen, the analysis hop-size needs to be greater than the fundamental frequency (if it was equal, it would likely slowly drift and eventually miss one onset). I realized the issue why the hop size was high (and thus the maximum frequency low) in the paper was that they were using low frequency audio samples in the range of 50-100Hz, while I was using samples around 300Hz. I adjusted to the hop size to 96 and got great results. I think I had also tried this before, but it had not worked, and it couldn't have, because this is only possible without decreasing the size of the analysis window within the overlapping window framework, which I had not implemented yet at the time I first tried.
However, this low hop size is relatively computationally expensive, so much so that f0 and MFPA peaks take up most of the execution time. A possible improvement would be to use a lower analysis rate and actually use the interpolation method, but then feed the interpolated pulses into the MFPA refinement algorithm as you would likely get better results that way.
I have fixed numerous bugs within the MFPA refinement implementation. A noteworthy one is that previously, I was not considering that the analysis window's time is in the center, and not the start. Because of that, the new onsets are now offset compared to the old ones, but I believe it is now correct.
The pulse onset selection is now quite good:
https://files.catbox.moe/ik27fw.png
Close up:
https://files.catbox.moe/urby2w.png
However, there are still deviations. Here is one at around 20k samples in one of my test audio samples:
https://files.catbox.moe/76nlgo.png
So there is still some work to be done.
Another potential improvement could be the introduction of a system for detecting for formants and weighting them less in the MFPA calculation. Recall that phase is roughly constant within a formant, but not between them.
>>
2
QueueSevenM
◆Tnq5UWtkfs
2026/03/30
(Mon)
12:00:29
No.
178287
+
▶
START FRAMES
In the audio samples I have provided so far, I have cut off the first part of the audio. The issue is with pitch estimation for early frames. Remember that are analysis window is multiple f0 periods in size. Because of this, it can't fit at the start so it has to be decreased to a much smaller size. This is much more of an issue now that I have decreased the hop size substantially. I have now set it to skip the first few frames, because otherwise, the forced extremely small window size causes the whole pitch estimation system to irreversibly destabilize. I've been thinking of solutions to this problem. One solution could be to let the analysis window take on the full size it wants and pad the area before the start with zeros or maybe something else, this could also be used for the end. Possibly the most promising solution I have come up with, although I have not test any of these, is to back fill the previous the pulses with the first good estimated pulse onset minus integer multiples of the first good estimate of the fundamental frequency. This should work assuming both the first pulse and fundamental frequency estimate are good, the fundamental frequency stays relatively constant over the start section, and the start section only contains a few pulses. Luckly the last criteria will always be satisfied as the size of the start section is half the size of the window, and the number of pulses is then (window_size / period) / 2, but the window size in the adaptive framework is just a small number of periods, so we are left with the (mostly) constant adaptive_period_count / 2 as the number of pulses.
WIDE-BAND VOICE PULSE MODELING
Regarding the patent issue, I have determined that it applies only to the specific technique in Bonada 2008 WBVPM of using periodization to achieve a real-sized discrete fourier transform. However, that section also another option, that being interpolation. I have implementated it and found it to work well. I did a test a found a noise level of about -140dB (for reference, 1ulp for a single-precision float is about -145dB), which is extremely negligible and comparable to the results in the study for the periodization technique. I have also added the ability to use a few extra samples on the side to improve the spline. However, I have not tested the consequences of this variation. I don't know whether the original implementation did something like this.
Text in the patent: "generating for each pulse a sequence of repetitions of said audio pulse, said audio pulse being repeated according to its own characteristic frequency; deriving frequency domain information associated with at least some of the sequences of repetitions of said audio pulses, each said sequences of repetitions of said audio pulse being represented as a vector of sinusoids based on the derived frequency, said vector of sinusoids corresponds to a sinusoidal series expansion of the specific audio pulse;"
Bonada 2008, WBVPM, NON-INTEGER SIZE FFT: "PERIODIZATION: one period of the input signal is windowed with wR (n) , and repeated several times at the rate defined by T so that the FFT buffer of length M covers in the end several periods. The repetition implies interpolating both the signal samples and the window function. Then the resulting signal sr (n) is windowed by an analysis window function wA (n) , and the spectrum obtained is actually the convolution of such analysis window response WA (f ) by the spectrum of Sr (f ) sampled at harmonic frequencies"
TUNING
I have come up with two techniques for tuning that apply in different ways.
AUTOMATIC TUNING - The idea is that we use a stochastic statistical algorithm that minimizes a cost function by adjusting a set of parameters (one I looked into that seems promising is global-optimization SPSA). The parameters in this cases would be constant used in C. A python script would replace placeholders with the values being picked by the minimization algorithm and then compile and run the C program. The results would then be compared to a reference by another algorithm/program, which would then be summed together to give a cost value. A program for doing I plan to research is called AudioVMAF. I believe it was originally designed to test audio compression, however I hope that it could also be useful here.
MANUAL TUNING - In this method, we insert instrumentation into various intermediate values calculated in the program. Then, for one very small snippet of audio, we use Automatic Tuning to determine ideal values. Then, a programmer tries to write code to make it better match these desired values. Then, if successful, it can be test in general over the whole dataset. If it is not an improvement, then the most negtaively affected audio snippets can be selected and then have a similar process to decrease the change for them while keeping the change for the ones that benefit.
Both of these methods would work best for matching with another vocal synthesizer, since the timings and parameters can match exactly. However, they may also be adaptable to optimizing parameters for real-world (and thus also realistic) voices. It would have to work somewhat in reverse though in that someone would sing first and then a note sequence would have to be made that matches it almost exactly.
OTHER CONSIDERATIONS
There are many more potential tweaks and improvements. I have many dozens accumulated and plenty more to research, test, and implement. One widely applicable variation is using logarithmic based scales.
I still don't have an answer to the voiced/unvoiced frame decision issue, but I will look for SMS research about that and older Bonada papers. One heuristic I thought of is noise / amplitude^2 > threshold.
>>
3
QueueSevenM
◆Tnq5UWtkfs
2026/03/30
(Mon)
12:08:06
No.
178288
+
▶
>Now my version with WBVPM (pitched down by an octave):
https://files.catbox.moe/kho97n.wav
For some reason, this link doesn't work. Try
https://voca.ro/1mJ5qljrp9hD
>>
4
Anonymous
2026/03/30
(Mon)
12:41:04
No.
178292
Yeah x1
▶
uwah... long text...
(;´Д`)
speaking as a layman (reading through your post scares me with the amount of stuff i don't know
(;´Д`)
) this is really cool to see!
keep up the good werk
ヽ(´∇`)ノ
>>
5
QueueSevenM
◆Tnq5UWtkfs
2026/03/30
(Mon)
12:43:29
No.
178293
+
▶
ADDENDUM, because I just realized I forgot a bunch of things I meant to put into this post
This is still a simplified model. It does not take into the Excitation plus Resonance model, the Spectral Voice Model. It uses a linear transform and not generated trajectories.
One thing I was thinking about was the part in WBVPM section where they said that one of the disadvantages of WBVPM was not being able to separate harmonic and non-harmonic. I also read that the noise is embedded as fluctuations in the spectrum of each voice pulse and over time, which is what I had presumed because the information has to go somewhere.
I was thinking, what if you took each harmonic as the values and the pulse onsets times as the positions in a spline. Then interpolated at regular intervals. Then applied the fourier transform. Then separate the highest frequencies and the others. Take the others and apply the inverse Fourier transform, and then rebuild a spline from this and interpolate the values back at the onsets. I wonder if this would work.
There would be loss though because of the resampling steps. This could decreased by taking more samples. You could also apply a correction by sampling and sampling it back to calculate the resampling loss itself without the removal of the high frequency modulations, and then add this difference back to the main pulse information after the separation.
>>
6
Anonymous
2026/03/30
(Mon)
16:19:11
No.
178316
Yeah x2
▶
A Heyuriloid?
>>
7
Anonymous
2026/03/30
(Mon)
19:55:16
No.
178339
+
▶
>>178288
Catbox has been
incredibly
flaky lately. Anyone hosting files there should probably look for an alternative.
>>
8
QueueSevenM
◆Tnq5UWtkfs
2026/03/30
(Mon)
22:15:05
No.
178361
+
▶
>>178339
What should I use as an alternative? I know vocaroo, but they delete the file after a while unless it gets many views.
>>
9
hachikuji
◆naRqMpZoTs
## lost snail
2026/03/30
(Mon)
22:31:39
No.
178368
+
▶
Uploader@Heyuri
accepts audio files and large file sizes so you could use that
>>
10
Hanekawa
◆uDXYYKdt5A
2026/04/01
(Wed)
08:38:53
No.
178483
+
▶
>>178285
wow... are you trying to re-implement OG vocaloid engine?
so much work was put into our silly vocaloid voices we really should be grateful it even exists huh...
>>
11
Sasazuka
◆nsbFqtfIsJ
2026/04/01
(Wed)
17:58:35
No.
178589
+
▶
I wish I was as interested in anything as OP is in whatever he's talking about
>>
12
Imouto-san
◆vnY2R+856E
2026/04/01
(Wed)
20:39:37
No.
178623
+
▶
>>178483
Well actually I'm implementing the techniques that were used for VOCALOID2. But a VOCALOID1-like engine could be an interesting future project.
>so much work was put into our silly vocaloid voices we really should be grateful it even exists huh...
https://www.tdx.cat/bitstream/handle/10803/7555/tjbs.pdf?sequence=1&isAllowed=y
>>
13
Imouto-san
◆vnY2R+856E
2026/04/01
(Wed)
20:41:42
No.
178626
+
▶
>>178483
Wait what happened to my tripcode??
>>
14
Rei
◆7SKICFe/rT
2026/04/01
(Wed)
20:42:12
No.
178627
+
▶
>>178626
sorry
i ate it
>>
15
QueueSevenM
◆Tnq5UWtkfs
2026/04/03
(Fri)
00:35:47
No.
178806
+
▶
Hello I'm back with another update to my VOCALOID project. It's not as big an improvement as last time - and in fact, there's no new features - but I felt like it was worth posting. I've been trying to rectify the major issues before I move onto implementing the Excitation plus Resonance model.
The first thing I attempted to tackle was all the added noise at high frequencies.
Here's the original spectrum:
https://files.catbox.moe/fq55bo.png
And here's the reconstructed spectrum (with no transforms applied):
https://files.catbox.moe/gq7jff.png
You can clearly see the high frequency artifacts. The first thing I tried was something mentioned in the paper. In the paper, specifically the WBVPM section, it was mentioned that there are two approaches for a non-integer size discrete fourier transform. The first one is repeating the signal while second is upsampling it. I went with second as the former is patented and also because the second is easier to implement. It is mentioned that increasing the repetition count of the signal (or in the case of upsampling, the upsampling factor), and then discarding the higher frequencies, can improve the estimation by reducing artifacts. In the case of repetition, it is also mentioned that quadratic interpolation can be used in the resulting spectrum, however I am not sure if this can be done for upsampling and as such, I have not tried to implement it for now.
Here's the result after applying an upsampling factor of 3:
https://files.catbox.moe/qcgnzq.png
Here's the original audio:
https://files.catbox.moe/f7g8ta.wav
The original reconstruction:
https://files.catbox.moe/da0m1i.wav
And now with the improved reconstruction:
https://files.catbox.moe/513ycn.wav
You can see an improvement, especially at lower frequency, however the high frequency artifacts largely persist. So they have to be arising elsewhere. I realized the source was the reconstruction of the signal (AKA the "synthesis"). I had previously implemented a synthesis method that was quite different from the one used in the study, because I did not understand the method in the study at first. My synthesis method worked by taking each voice pulse and for each sample where the voice pulse is the closest voice pulse to that sample, setting the value of that sample to the interpolated value of a spline representing a time domain version of the upsampled voice pulse with a step corrospondin between the ratio a sample in the regular time domain and the upsampled time domain. Now, in some cases, estimation inaccuracies and differences from any transformations that were applied result in these regions of samples being bigger than the actual sample itself. In these cases, we take advantage of the period nature of the voice pulse and repeat it (i.e. sampling before the start is equivalent from that offset from the end, and sampling after the end is the same as that offset from the start). However, this method results in discontinuities in some cases.
Here is an example of such a discontinuity:
https://files.catbox.moe/jnnxfj.png
I began to try to implement an interpolation system. In this system, we could calculate the gap between pulses - or in the cases of inaccuracies in the other direction (i.e. overlapping pulses) - the overlapping area, and interpolate between one pulse and the other linearly. However, this was approach was complicated significantly by the non-integer (and potentially differing) sizes of the pulses as well as numerous edge cases. For this reason, I struggled to do so and spent over an hour trying to figure out how to do it corrrectly. About half way through, I decided to check the paper again and this time I understood the actual synthesis method properly, largely because of a diagram I had missed the first time.
In the actual method, each pulse is is expanded in a manner similar to that of the border interpolation technique used in WBVPM analysis, except kind of in reverse. In this technique, for each voice pulse, we generate extensions on both sides with each extension having the size of the border interpolation ratio of the size of the voice pulse. Then we apply a trapezoidal window to the voice pulse which starts at zero at each side of the extended voice pulse and becomes 1 on either side after protrusion of twice the border interpolation size on each side. Then we overlap and add the voice pulses.
This technique fixes the discontinuity issue because it effectively results in each border-interpolation-length side of each voice pulse being interpolated with the corrosponding section for the other voice pulse linearly over a period of twice the border interpolation size. However, this only holds perfectly when the fundamental frequency is the same for both voice pulses (and thus they are the same size) and they are spaced out at onsets that are exactly the period of the fundamental frequency apart. However, when this in not the case, some amount of modulation occurs that results in some voice pulses being attenuated while others are accentuated. This is especially noticeable when there are large inaccuracies in the fundamental frequency estimation and/or the voice pulse onset sequence.
Here's the same section from before. Notice how now it does not have a discontinuity:
https://files.catbox.moe/p26914.png
Now here's a zoomed-out version:
https://files.catbox.moe/zacw8w.png
Now here's a section with large inaccuracies in the MFPA estimation that clearly shows large modulation artifacting:
https://files.catbox.moe/efk1vx.png
Here's the new spectrum:
https://files.catbox.moe/f94zse.png
You can see that while the high frequency artifacts are now gone, there are now more low frequency artifacts. In fact, the overall amount of artifacts is actually higher than before.
Here's the reconstructed audio:
https://files.catbox.moe/ympfi0.wav
While I ended out solving this issue by fixing large inaccuracies in the MFPA system, it is interesting to note that my approach is more resilient to estimation inaccuracies. Perhaps for a future improved vocal synthesizer, it would be worth exploring a variant of my periodic continuation technique adapted with an interpolation method that could handle changes in pulse onset and f0.
The first thing I tried was switching to a magnitude-limited logarithmic scale for the ampltiude in the MFPA function instead of it being linear. However, this resulted in little to no effect. The next thing I tried was adjusting the size in periods of the window used for the peaks that are fed into MFPA, however again this resulted in little to no effect. Next, I tried implementing the harmonic peak selection algorithm I proposed in the previous post, but again this resulted in little to no effect.
Finally, I began looking at the MFPA refinement algorithm instead, and I found something quite interesting:
In this section, these are the per-frame detected onsets:
https://files.catbox.moe/f36gno.png
Now here's the onsets chosen by the MFPA refinement algorithm:
https://files.catbox.moe/4b7hc0.png
Notice that while one of the onsets in the detected onsets is wrong, there is also a correct one for that voice pulse, and additionally, the incorrect onset chosen was actually for the next pulse. Furthermore, that incorrect chosen onset was actually not even a detected one - the one detected for that frame was correct - so it must have been one of the additional onset candidates considered by the MFPA refinement algorithm. I realized shortly after what the issue was: When I first wrote the MFPA refinement algorithm, I was under the false assumption that it's primary purpose was to compute a superset, rather than a subset, of the detected onsets. Because of this, I realized I could make a simplification to the algorithm. In the paper, it says to calculate the MFPA error by finding the closest MFPA onset to the frame. However, since I thought there should be at most onset per pulse in the detected onsets, we could do this by just getting the onset time at that frame index (where we get the frame index by rounding the time). I believe actually even written the code originally to use a search, but simplified it.
But since now there can be (and usually are) multiple detections per pulse, that assumption is no longer true and by doing that, we may choose a pulse which is not actually the closest. In the case I show above, what probably happened was that the wrong detection in the previous pulse was chosen, resulting in choosing the wrong one for the next pulse. I fixed the issue by making it use a search (and also sorting the detected onsets first), and it fixed that section:
https://files.catbox.moe/0oryig.png
Here's the section that was heavily modulated before:
https://files.catbox.moe/r0oq2w.png
And here's the spectrum:
https://files.catbox.moe/pgfqfh.png
Notice the low frequency artifacts are mostly gone.
And here's the reconsutrcted audio:
https://files.catbox.moe/98zbd1.wav
Now here's the pitch transposed audio with the fixes applied:
https://voca.ro/1izsfK1EwXD3
Compare to before:
https://voca.ro/1mJ5qljrp9hD
[
Top
]
Delete post: [
File only
]
Password:
First
[0]
Last