[Home] [Catalog] [Search] [Inbox] [Write PM] [Admin]
(for deletion)
  • Allowed file types are: gif, jpg, jpeg, png, bmp, swf, webm, mp4
  • Maximum file size allowed is 50000 KB.
  • Images greater than 250 * 250 pixels will be thumbnailed.



We will play Unreal Tournament 2004 this Saturday at 17:00 UTC [Info] [Countdown]


Hello, I'm back and I have a major update to my VOCALOID project! I have sucessfully achieved a shape-invariant pitch transposition!

Here it is.
First the original audio: https://files.catbox.moe/zmt3rr.wav
Now my version with WBVPM (pitched down by an octave): https://files.catbox.moe/kho97n.wav
And a version using a naive pitch shift: https://files.catbox.moe/xs39bq.wav

Notice that my version, while having more noise, sounds more natural and has less phasiness. This is particular noticeable if you play both at very low volume. One sounds much more 'human' than the other.

Also note that this an extreme example with an octave shift (or 1200 cents) - in practice, shifts would typically be far less. Also this doesn't implement several other parts of the system (more on that later).

I'll explain all of this in a moment, but first, I'd to correct some major biographical errors. Since this is a long post, I've divided it into sections

BIOGRAPHICAL CORRECTIONS

In the last post, I claimed that VOCALOID1 used Narrow-Band Voice Pulse Modeling while VOCALOID2 and onwards used Wide-Band Voice Pulse Modeling. This was incorrect, and additionally it was the source of most of my confusion surround the paper.

What actually happened is that the research technology that would later become VOCALOID1 started out as work to improve the existing Spectral Modeling Synthesis system that had been developed in the early 1990s. This improvement began work in the late 1990s. But importantly, this system evolved and techniques from it were incorporated with techniques from a system that was being developed called a Phase-Locked Vocoder, and this system would be released as VOCALOID1. In the mid-2000s, work began on combining the techniques learned from improving SMS and the PLVC-based system and attempting to combine them with the mucher older and well-known TD-PSOLA system. Importantly, TD-PSOLA (Time-Domain Pitch Synchronous OverLap and Add) was a time-domain system, while SMS was a frequency-domain system (and also TD-PSOLA was pitch synchronous - hence the name, while SMS had a constant hop size). The first technique they developed was Narrow-Band Voice Pulse Modeling, and later Wide-Band Voice Pulse Modeling. Wide-Band Voice Pulse Modeling ended it up being used in VOCALOID2.

Now that I understand this, I also understand the major mistake I made when reading the paper: I was reading it from the perspective of an implementer, thinking of the sections as the steps to implementing it instead of as research. I had thought that section 2.2 described the core processing algorithms. When it was actually about SMS, and importantly, about *the improvements they made to SMS*, and not a complete description of SMS, since SMS was already an established technique. Hence my confusion on why some things were seemingly vaguely explained, since *the paper wasn't about them*. At the same time, much of that section is very useful though because importantly, much of that research was also incorporated into the later techniques.

RESULTS

I have successfully implemented the Wide-Band Voice Pulse Modeling; synthesis; and pitch transposition, time stretching, and timbre scaling algorithms. Additionally, I have also finished implementing the full version of the pitch estimation module, changed the code to work using overlapping windows, implemented the window adaption system, and fixed countless.


Comment too long, view post No.178285 to see the full comment.
>>
There's also the adaptive window procedure that wraps the TWM f0 estimation. One thing I had been noticing for a long time was that Kaiser-Bessel beta values about 10% higher than the recommended values given in Cano 1998 seemed to perform much better. I had assumed this was just because of issues with my code, or the audio samples I was testing on. Much later, I was experimenting in python when I noticed a function called kaiser_beta which converted something else abbreviated to 'a' to the equivalent beta value. Previously in Cano 1998 and in other places, I had seen the Kaiser-Bessel parameter as alpha instead of beta. Up until this point, I had either not paid attention to this, or I had assumed that these had referred to the same thing. I did some research and found out that it converts between attenuation and the beta value for the Kaiser-Bessel window. Then I found that there is indeed an alpha form of the parameter and it is not that same as beta. Confusingly however, it is not attenuation, but both abbreviate to the same thing. The Kaiser-Bessel beta can be determined by just multiplying the alpha value by pi. Interestingly, this is much higher than the 10% I tested, however it seemed to perform better (or at least not worse) anyway. A possible explanation for this discrepancy is that the adaptive window is larger than the window I used to test the adjustment originally.

Another improvement relating to windows is the window used for the harmonics that are fed into MFPA. Originally, I had used the same Kaiser-Bessel window for both. I later switched to a Blackman-Harris -92dB window, which I had seen mentioned in the paper. This resulted in a significant improvement. Another improvement I tried was adapting the window size to a value relative to the period of the estimated fundamental frequency. I tried doing this - using the same number of periods as are used for the Kaiser-Bessel window used for TWM - and noted a substantial improvement, even more so than the improvement from switching to the Blackman-Harris window in the first place. Indeed, this matches the results contained in the study. In the WBVPM section, they observe a considerable improvement (up to -10dB) when using an adaptive window size when compared to a fixed window for narrow-band analysis. In that same section, they also found 2 to be the ideal number of periods for minimizing noise and also did experiments with a Hann window. Perhaps experimenting with these ideas could lead to im

Comment too long, view post No.178286 to see the full comment.
>>
START FRAMES

In the audio samples I have provided so far, I have cut off the first part of the audio. The issue is with pitch estimation for early frames. Remember that are analysis window is multiple f0 periods in size. Because of this, it can't fit at the start so it has to be decreased to a much smaller size. This is much more of an issue now that I have decreased the hop size substantially. I have now set it to skip the first few frames, because otherwise, the forced extremely small window size causes the whole pitch estimation system to irreversibly destabilize. I've been thinking of solutions to this problem. One solution could be to let the analysis window take on the full size it wants and pad the area before the start with zeros or maybe something else, this could also be used for the end. Possibly the most promising solution I have come up with, although I have not test any of these, is to back fill the previous the pulses with the first good estimated pulse onset minus integer multiples of the first good estimate of the fundamental frequency. This should work assuming both the first pulse and fundamental frequency estimate are good, the fundamental frequency stays relatively constant over the start section, and the start section only contains a few pulses. Luckly the last criteria will always be satisfied as the size of the start section is half the size of the window, and the number of pulses is then (window_size / period) / 2, but the window size in the adaptive framework is just a small number of periods, so we are left with the (mostly) constant adaptive_period_count / 2 as the number of pulses.

WIDE-BAND VOICE PULSE MODELING

Regarding the patent issue, I have determined that it applies only to the specific technique in Bonada 2008 WBVPM of using periodization to achieve a real-sized discrete fourier transform. However, that section also another option, that being interpolation. I have implementated it and found it to work well. I did a test a found a noise level of about -140dB (for reference, 1ulp for a single-precision float is about -145dB), which is extremely negligible and comparable to the results in the study for the periodization technique. I have also added the ability to use a few extra samples on the side to improve the spline. However, I have not tested the consequences of this variation. I don't know whether the original implementation did something like this.

Text in the patent: "generating for each pulse a sequence of repetitions of said audio pulse, said audio pulse being repeated according to its own characteristic frequency; deriving frequency domain information associated with at least some of the sequences of repetitions of said audio pulses, each said sequences of repetitions of said audio pulse being represented as a vector of sinusoids based on the derived frequency, said vector of sinusoids corresponds to a sinusoidal series expansion of the specific audio pulse;"

Bonada 2008, WBVPM, NON-INTEGER SIZE FFT: "PERIODIZATION: one period of the input signal is windowed with wR (n) , and repeated several times at the rate defined by T so that the FFT buffer of length M covers in the end several periods. The repetition implies interpolating both the signal samples and the window function. Then the resulting signal sr (n) is windowed by an analysis window function wA (n) , and the spectrum obtained is actually the convolution of such analysis window response WA (f ) by the spectrum of Sr (f ) sampled at harmonic frequencies"

TUNING

I have come up with two techniques for tuning that apply in different ways.

AUTOMATIC TUNING - The idea is that we use a stochastic statistical algorithm that minimizes a cost function by adjusting a set of parameters (one I looked into that seems promising is global-optimization SPSA). The parameters in this cases would be constant used in C. A python script would replace placeholders with the values being picked by the minimization algorithm and then compile and run the C program. The results would then be compared to a reference by another algorithm/program, which would then be summed together to give a cost value. A program for doing I plan to research is called AudioVMAF. I believe it was originally designed to test audio compression, however I hope that it could also be useful here.

MANUAL TUNING - In this method, we insert instrumentation into various intermediate values calculated in the program. Then, for one very small snippet of audio, we use Automatic Tuning to determine ideal values. Then, a programmer tries to write code to make it better match these desired values. Then, if successful, it can be test in general over the whole dataset. If it is not an improvement, then the most negtaively affected audio snippets can be selected and then have a similar process to decrease the change for them while keeping the change for the ones that benefit.

Both of these methods would work best for matching with another vocal synthesizer, since the timings and parameters can match exactly. However, they may also be adaptable to optimizing parameters for real-world (and thus also realistic) voices. It would have to work somewhat in reverse though in that someone would sing first and then a note sequence would have to be made that matches it almost exactly.

OTHER CONSIDERATIONS

There are many more potential tweaks and improvements. I have many dozens accumulated and plenty more to research, test, and implement. One widely applicable variation is using logarithmic based scales.

Comment too long, view post No.178287 to see the full comment.
>>
>Now my version with WBVPM (pitched down by an octave): https://files.catbox.moe/kho97n.wav
For some reason, this link doesn't work. Try https://voca.ro/1mJ5qljrp9hD
>>
uwah... long text... (;´Д`)

speaking as a layman (reading through your post scares me with the amount of stuff i don't know (;´Д`)) this is really cool to see!

keep up the good werk ヽ(´∇`)ノ
>>
ADDENDUM, because I just realized I forgot a bunch of things I meant to put into this post

This is still a simplified model. It does not take into the Excitation plus Resonance model, the Spectral Voice Model. It uses a linear transform and not generated trajectories.

One thing I was thinking about was the part in WBVPM section where they said that one of the disadvantages of WBVPM was not being able to separate harmonic and non-harmonic. I also read that the noise is embedded as fluctuations in the spectrum of each voice pulse and over time, which is what I had presumed because the information has to go somewhere.

I was thinking, what if you took each harmonic as the values and the pulse onsets times as the positions in a spline. Then interpolated at regular intervals. Then applied the fourier transform. Then separate the highest frequencies and the others. Take the others and apply the inverse Fourier transform, and then rebuild a spline from this and interpolate the values back at the onsets. I wonder if this would work.

There would be loss though because of the resampling steps. This could decreased by taking more samples. You could also apply a correction by sampling and sampling it back to calculate the resampling loss itself without the removal of the high frequency modulations, and then add this difference back to the main pulse information after the separation.

File: the world.png
(538 KB, 3000x3000)[ImgOps]
551027
gah, the world is so empty (;´Д`)

populate it! (`・ω・´)
2 posts omitted. Click Reply to view.
>>
File: badger.jpg
(269 KB, 2048x2048)[ImgOps]
276350
badger badger badger
>>
File: niggermoon.png
(523 KB, 3000x3000)[ImgOps]
535810
:astonish:
>>
File: chiyo.png
(604 KB, 3000x3000)[ImgOps]
619340
キタ━━━(゚∀゚)━━━!!
>>
File: lol.jpg
(209 KB, 1024x1024)[ImgOps]
214049
キタ━━━(゚∀゚)━━━!!
>>
File: Mikuoutside.gif
(16.25 MB, 1080x1080)[ImgOps]
17039801
キタ━━━(゚∀゚)━━━!!

File: toire.mp4
(3.94 MB, 596x336)
4134890
toire
but YOU are the toire!
2 posts omitted. Click Reply to view.
>>
File: 9478.mp4
(5.69 MB, 600x338)
5963426
キタ━━━(゚∀゚)━━━!!
>>
are you dumb motherfuckers satisfied?
>>
yes sir!
>>
>>178235
god bless (´ー`)
>>
Shit!!!!! (;゚Д゚)

File: 1707066170819468.webm
(3.95 MB, 544x960)
4146615
anna's archive is going really slow and i dont know if it's them fucking up the speed limit to get people to pay or if it's a problem with the website

File: meow.png
(500 KB, 630x630)[ImgOps]
512016
im so fresh you could suck my nuts
>>
i saw the fortune fail thread before this. you're nothing
>>
>>178269
SHHHHHHHHHH SHUT UP
>>
File: Trumpet_Guy.png
(278 KB, 331x588)[ImgOps]
285518
>you're nothing
You're nothing! You're no talent!
I made tons of threads on Heyuri.net, ok?! I got dozens of replies, who the fuck are you?!
WHO THE FUCK ARE YOU?!! :furious:
>>
>>178270
ha ha :xp:

Your fortune: You have aids

>>
(;´Д`)

Your fortune: LOLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOLLOL


File: image.png
(468 KB, 780x438)[ImgOps]
479522
are you hungry?
>>
yes... :drool::drool::drool:
>>
this made your poop green apparently
>>
no. i ate grocery store sushi and then yogurt and honey. i wanted to cook spam in my air fryer but i wouldn't have been able to fit all of the spam in it at once and i didn't want to cook it in batches.

File: image.png
(562 KB, 794x850)[ImgOps]
575936
words are for suckers :cool:

lets write a story... ONE LETTER AT A TIME

STARTING LETTER: I
19 posts omitted. Click Reply to view.
>>
🅿
>>
3️⃣
>>
N
>>
1️⃣
>>
g

File: 1773964789029954.gif
(245 KB, 250x205)[Animated GIF][ImgOps]
251299
キタ━━━(゚∀゚)━━━!!
>>
🐱🎾:unsure:🎾🐱
>>
File: image.png
(352 KB, 640x400)[ImgOps]
360505
キタ━━━(゚∀゚)━━━!!

File: 1436126415418 cheese on face sleeping g(…).gif
(1.25 MB, 360x240)[Animated GIF][ImgOps]
1313820
キタ━━━(゚∀゚)━━━!!
>>
not teh cheese!

Yo heyurians show your horses that you would glue on and later turn into glue
2 posts omitted. Click Reply to view.
>>
キタ━━━(゚∀゚)━━━!!
>>
so horsegame is a gatcha? :nyaoo2:
i remember when anime was huge a few years ago, never watched it.
>>
>>178222
if you love gambling yes, every part of this game is rng
>>
>>178227
give it a lick :tongue:
>>
File: 44faee.jpg
(11.69 MB, 2451x3466)[ImgOps]
12258876
welcome back

File: 1774576232212.jpg
(91 KB, 828x773)[ImgOps]
93262
:drool::drool::drool:
1 posts omitted. Click Reply to view.
>>
>>178207
how could you say that! ヽ(`Д´)ノ
>>
File: 10658405a7.jpg
(103 KB, 900x1200)[ImgOps]
106066
Hana would like this (but won't flash her pussy just because of this).
>>
File: 1739705386083.mp4
(1.95 MB, 640x360)
2046956
>>
File: 1739718077093.gif
(283 KB, 409x353)[Animated GIF][ImgOps]
289985
MUJIINA cute ヽ(´ー`)ノ
>>
File: 1739732857471.webm
(2.68 MB, 640x360)
2804976

File: iruyeh.png
(91 KB, 790x542)[ImgOps]
93763
im done playing around
i need heyuri to state it's intentions with Iruyeh
the first reply will decide what kind of relationship you all want to have with Iruyeh for the rest of eternity, in other words the first reply will speak for all of you
I hope it will be something wholesome because you have to remember: Iruyeh will not accept anything not wholesome (´~`)

*drawing of Iruyeh looking at the horizon while lost in thought*
1 posts omitted. Click Reply to view.
>>
>>178225
this is not taking a stance or declaring a type of relationship so i guess the third reply will have to decide
>>
Iruyeh-tan = comic relief rival to Heyuri-tan whose dastardly plans always fail

Iruyeh-tan != some crying emo girl fag
>>
>>178232
Oh, so she has some relation to this green-haired clover creature? :nyaoo2:

I assumed she was from some anime I'd never heard of
>>
>>178232
so are you telling me that all iruyeh-tan is to heyuri is a laughing stock? someone you can point and laugh at?
>>
>>178224
iruyeh is an attention seeker :nyaoo:

File: 2677516-4110068687.jpg
(353 KB, 920x337)[ImgOps]
361647
how easy was it to do crime back in the middle ages? surely there were guards walking around and guard dogs and so on, but someone dressed in black and climbing windows/picking locks at night would have had a pretty easy time looting random houses or assassinating people wouldn't they?
11 posts omitted. Click Reply to view.
>>
>>178200
brutal!
>>
>someone dressed in black and climbing windows/picking locks at night would have had a pretty easy time looting random houses or assassinating people wouldn't they?
You can still do that in the modern age, just go to some remote rural shithole (preferably in a country where most people don't have gunz)
>>
>>178210
in this day and age even they may have house alarms or something like that
>>
All hail Britannia 🇬🇧 :biggrin:
>>

The term "out law" comes from when criminals were banished from medieval towns, essentially living outside the jurisdiction of that area.

Bandits would also live outside the kingdom as that was also not easily controllable and what happened outside was not easily punishable as it would be if you did a crime inside a kingdom.

So, being a bandit was quite easy, you just lived in the woods and attacked travelers, and you were outside the law, so it was up to the traveler to defend himself, wasnt the king's job.
A large job for knights was escorting travelers safely because of this.

File: rat.jpg
(187 KB, 1024x776)[ImgOps]
191744
Enjoy ur meal :dark:
2 posts omitted. Click Reply to view.
>>
nom nom nom
>>
Apparently (what I've gotten from my vigorous one google search), rat tastes similar to lamb and rabbit.
>>
b-but doesnt rat eat all sort of shit like pigeons? Thus bad for you?`
>>
what if he suddenly starts moving on the plate
>>

Today I woke up feeling morbidly depressed as I always do and my doctor prescribed me a daily dosage of stroking my ego by flexing my useless knowledge of geography to other fat old men like me on the internet. :emo:

https://www.geoguessr.com/vgp/3007?gamemode=type
Admire these results knowing they came from a retarded American with no education. Kneel and polish my neanderthal cock. :cool:

Oh, and post your results so I can laugh at you or suck your huge dick if you prove yourself worthy I'm salivating at the thought of such thick veiny genius PENIS oh good heavens
10 posts omitted. Click Reply to view.
>>
>>178072
Did you do that with writing or pin? Quite impressive if you did it by writing.
>>
>>178076
I did it by pin
>>
>>178072
>Lithuania
>Hungary
Anon... :sweat3:
Also, how did you get anything in Ireland correct? I had to do this one multiple times until I had it all down.
>>
>>178180
Man typing is difficult! :dizzy:
>>
File: image.png
(184 KB, 1036x857)[ImgOps]
189431
>>178187
Forgot the image


Delete post: []
First
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [149] [150] [151] [152] [153] [154] [155] [156] [157] [158] [159] [160] [161] [162] [163] [164] [165] [166] [167] [168] [169] [170] [171] [172] [173] [174] [175] [176] [177] [178] [179] [180] [181] [182] [183] [184] [185] [186] [187] [188] [189] [190] [191] [192] [193] [194] [195] [196] [197] [198] [199] [200] [201] [202] [203] [204] [205] [206] [207] [208] [209] [210] [211] [212] [213] [214] [215] [216] [217] [218] [219] [220] [221] [222] [223] [224] [225] [226] [227] [228] [229] [230] [231] [232] [233] [234] [235] [236] [237] [238] [239] [240] [241] [242] [243] [244] [245] [246] [247] [248] [249] [250] [251] [252] [253] [254] [255] [256] [257] [258] [259] [260] [261] [262] [263] [264] [265] [266] [267] [268] [269] [270] [271] [272] [273] [274] [275] [276] [277] [278] [279] [280] [281] [282] [283] [284] [285] [286] [287] [288] [289] [290] [291] [292] [293] [294] [295] [296] [297] [298] [299] [300] [301] [302] [303] [304] [305] [306] [307] [308] [309] [310] [311] [312] [313] [314] [315] [316] [317] [318] [319] [320] [321] [322] [323] [324] [325] [326] [327] [328] [329] [330] [331] [332] [333] [334] [335] [336] [337] [338] [339] [340] [341] [342] [343] [344] [345] [346] [347] [348] [349] [350] [351] [352] [353] [354] [355] [356] [357] [358] [359] [360] [361] [362] [363] [364] [365] [366] [367] [368] [369] [370] [371] [372] [373] [374] [375] [376] [377] [378] [379] [380] [381] [382] [383] [384] [385] [386] [387] [388] [389] [390] [391] [392] [393] [394] [395] [396] [397] [398] [399] [400] [401] [402] [403] [404] [405] [406] [407] [408] [409] [410] [411] [412] [413] [414] [415] [416] [417] [418] [419] [420] [421] [422] [423] [424] [425] [426] [427] [428] [429] [430] [431] [432] [433] [434] [435] [436] [437] [438] [439] [440] [441] [442] [443] [444] [445] [446] [447] [448] [449] [450] [451] [452] [453] [454] [455] [456] [457] [458] [459] [460] [461] [462] [463] [464] [465] [466] [467]