[Home] [Catalog] [Search] [Inbox] [Write PM] [Admin]
[Return] [Bottom]

Posting mode: Reply

(for deletion)
  • Allowed file types are: gif, jpg, jpeg, png, bmp, swf, webm, mp4
  • Maximum file size allowed is 50000 KB.
  • Images greater than 200 * 200 pixels will be thumbnailed.



Heyuri is out of maintenance! ヽ(´∇`)ノ
Uploader@Heyuri is going to be back soon too.


File: m1772488142904.png
(444 KB, 2003x1640)[ImgOps]
454950
So I've been working on a project to re-implement the VOCALOID1 engine.
I'm basing it on the description in Jordi Bonada's PhD thesis "Voice Processing and Synthesis by Performance Sampling and Spectral Models" and not the original papers as the former is more detailed, easier to follow, and also describes the VOCALOID2 engine.

After a lot of trouble with getting TWM f0 estimation to work, I've finally gotten to implementing MFPA. And amazingly, it seems to have worked first try.

Compare my results:
https://i.ibb.co/dsvgv0fd/Screen-Shot-2026-03-02-at-3-54-48-PM.png

To the results in the study:
https://i.ibb.co/C3fjdWVd/Screen-Shot-2026-03-02-at-3-55-09-PM.png
>>
File: m1772489166399.png
(282 KB, 1754x1278)[ImgOps]
289025
Also here's the graph of the f0 estimate from the TWM. It's still somewhat flawed (see the jump down at frame 37), and it required using unusual parameters (Kaiser-Bessel beta 2.2 instead of the 1.95 recommended by the study, and only 6 harmonics instead of 11) to avoid instabilities even in relatively trivial scenarios. Actually I think I've finally figured out what's been wrong with the TWM the whole time - I forgot to convert the frequency in bins to the frequency in hertz. I think this was originally intentional, because only realized much later that the error formula is neither linear nor even relative. I haven't test the fix yet, however I'd imagine it should finally solve the problems I've had with TWM.

This graph specifically shows the estimated fundamental frequency for each 256-point frame of an E4 /e/ phoneme.
>>
日本語でおk
>>
>>175062
What do you mean?
>>
Actually I meant IPA /i/ not /e/.
>>
I don't understand it well myself but i admire your efforts ヽ(´ー`)ノ
>>
>as the former is more detailed, easier to follow, and also describes the VOCALOID2 engine.
you went with the harder to program paper? what made you choose this one?

also, wat are you planning to do with it afterwards? is it just a programming excercise? ヽ(゚ρ゚)ノ
>>
>>175073
>you went with the harder to program paper? what made you choose this one?
No, "former" means first in the sentence, not chronologically.
>also, wat are you planning to do with it afterwards? is it just a programming excercise? ヽ(゚ρ゚)ノ
It was just that originally, but I now plan to eventually release it as an open-source library.
>>
>>175070
You'd be surprised. Try just reading the paper from the start. You may find that you can actually understand it.

https://www.tdx.cat/bitstream/handle/10803/7555/tjbs.pdf?sequence=1&isAllowed=y


[Top]
Delete post: []
First
[0]
Last