[Home] [Catalog] [Search] [Inbox] [Write PM] [Admin]
[Return] [Bottom]

Posting mode: Reply

(for deletion)
  • Allowed file types are: gif, jpg, jpeg, png, bmp, swf, webm, mp4
  • Maximum file size allowed is 50000 KB.
  • Images greater than 200 * 200 pixels will be thumbnailed.





File: GnhnsopaAAAbj_R.jpg
(238 KB, 1226x745)[ImgOps]
243712
fc2web and a whole bunch of other free japanese hosts are to be shut down at the end of june

it has a page on archiveteam but it seems nothing is being scraped right now
https://wiki.archiveteam.org/index.php/Fc2web

a lot of old sites may disappear sad
Marked for deletion (Old)
>>
Are web.fc2.com sites staying up?

E.g. http://strangewalker.web.fc2.com/
>>
Yes, nothing indicating those ones are shutting down. It's probably worth having our own Ayashii Walker anyways, because I think that's the only one left up
>>
>>136918
used >>136945 and tried to grab all the domains with fc2web & fc2.kt. first time doing something like this so i hope i didnt fuck it up
https://up.heyuri.net/src/4343.txt

will try follow up with other domains but i gotta find out all the subdomains that's dying...
>>
>>136844

I promise you, the UK doesn't have that much sway, nor have our incompetent government's expensive attempts to mediate the Internet ever actually done anything. It's all just WORDSWORDSWORDS intended to placate people who spend all day on Mumsnet complaining about how Pornhub needs to be banned because their impressionable shota saw one porn video and turned ghay (´~`)
>>
File: image.png
(47 KB, 784x313)[ImgOps]
48128
http://futabajinro.fc2web.com/
Futaba's werewolf game's logs
we should salvage teh icons here for Heyuri's usage
>>
>japs themselves think about it?
You could read through Twitter
https://xcancel.com/search?f=tweets&q=fc2web
>>
if there are 10K sites disappearing in total, if we say they are .1GB in average (this may even be generous)
10.000*0.1 = 1000GB = 1TB
I think it wouldn't be too impossible to archive everything? unsure
>>
>>136972
some of these are like
com.fc2web.toukei135.html.comic
com.fc2web.toukei135.txt.robots
which looks like they should be toukei135.fc2web.com/robots.txt and /comic.html
>>
>toukei135
gives 404 unsure
>>
>>136995
you got that right. my mistake, i made an updated one. also has the other domains that are in the script in >>136918
https://up.heyuri.net/src/4344.txt

>>136990
theres about 13600 lines in this file, and if we assume a third are dead, and that every site also takes up 100mb, that would mean roughly 900gb of storage. or uh... 1.36tb i guess. still somewhat in reach for anyone with a spare drive

there is a considerable amount of 404s and 403s, so i think it could be alot less if we figure how to just not download the ones that'll error out
>>
This probably needs a python script (thxfully we're in the age of chatgpt)

It should use links.txt 137011-san posted, if a page is redirecting to anywhere on error.fc2.com from the index page, it shouldn't save it and put the link to error.txt with a new line. Successful ones should be written to something like done.txt (will be used to generate the link index if we ever put it on Heyuri etc) after completing. Also it should have a "resuming" mechanism somehow using the two files, checking the last link in both of them & comparing which one is furher down in links.txt, and continuing from the one after that.

After downloading all pages, we should
1- convert downloaded .htm and .html files from SHIFT_JIS to UTF-8
there is a linux command noted on Museum@Heyuri for this: find . -type f \( -name '*.htm' -o -name '*.html' \) -exec sh -c 'iconv -f SHIFT_JIS -t UTF-8//TRANSLIT "{}" > "{}.tmp" && mv "{}.tmp" "{}"' \; or find . -type f \( -name '*.htm' -o -name '*.html' \) -exec sh -c 'iconv -f SHIFT_JIS -t UTF-8 "{}" | sed -r "s/‾/~/g;s/¥/\\\\/g" > "{}.tmp" && mv "{}.tmp" "{}"' \; for saving yen symbols
Or use teh powershell script: https://up.heyuri.net/src/3485.ps1 (Convert-Encoding.ps1)
2- convert <html> at the beginning of each file to <html lang="ja"> so browsers can display intended fonts.
also a command for this noted on Museum@Heyuri: find . -type f \( -name "*.html" -o -name "*.htm" \) -exec sed -i 's/<html\( lang="[^"]*"\)\?>/<html lang="ja">/Ig' {} +
3- remove the end of service notification popup thing
It should detect starting from "<div id="popup-container">" to "</script>" and remove in between
I didn't test but ChatGPT suggested this command: find . -type f \( -iname "*.html" -o -iname "*.htm" \) -exec perl -0777 -i -pe 's|<div id="popup-container">.*?</script>\s*||gs' {} +

Then all that's left is sharing with internets somehow biggrin
Depending how much it actually takes (I doubt the average gets anywhere close to 100MB), we can host it ourselves.

There will be some sites like http://uguisuhp.fc2web.com which doesn't link anything from its index but needs to be linked from http://uguisuhp.fc2web.com/newpage2.htm
I will remember to save this one specifically, but there are probably some others with hidden pages that will get lost...
>>
>1- convert downloaded .htm and .html files from SHIFT_JIS to UTF-8
>2- convert <html> at the beginning of each file to <html lang="ja"> so browsers can display intended fonts.
The potential collateral damage that this could cause (corruption, wrong characters, general b0rkage) isn't worth it IMHO - I think it'd be better to keep the pages in their original form sweat2
>>
*Not to mention that some pages may not even be Shift-JIS, but EUC-JP or even UTF-8
>>
>The potential collateral damage that this could cause (corruption, wrong characters, general b0rkage) isn't worth it IMHO
I recall running into issues with serving SJIS files over web servers, or maybe it was Cloudflare. I think <html lang="ja"> was necessary too, but maybe it could use some kind of check to see if it should do dizzy

If someone cares about just having them not disappear, they could just do Step 3 first and distribute that with torrent too
>>
File: 77.jpg
(16 KB, 320x240)[ImgOps]
16384
キタ━━━(゚∀゚)━━━!!
>>
File: Solo Leveling - S02E04.jpg
(785 KB, 1920x1080)[ImgOps]
803840
>>
Why not use wget or wget2 for archiving teh sites? It have recursive mode, option preventing it from blindly following external domains, option to convert links.
>>
And all thanks to our Pater RMS!
>>
>>136855
What website? (I want to know)
>>
File: y609kjivmpse1.png
(131 KB, 1794x1397)[ImgOps]
134144
Speaking of storage space... It is going to make you envy (yes, a reddit link dark) https://reddit.com/r/StableDiffusion/comments/1jqej32/vram_is_not_everything_today/
I wonder if they buy it from their parents' money. Or is it not so expensive, relative to the monthly income, to buy computer parts in US.
>>
File: hyh.jpg
(12 KB, 320x240)[ImgOps]
12288
キタ━━━(゚∀゚)━━━!!
>>
>>137022
What is he doing with his hand? That's not a peace sign. Looks like he's trying to symbolize a gun
>>
>>137056
Harddrive's arent that expensive. I make 15 an hour and I can afford as much space as I want 🐵
>>
>>137060
I do not have even a 100GB of free space currently dark And the number of SATA ports on ur motherboard is limited, what do?
>>
>>137056
computer parts are definitely not that expensive in the US
>>
>>137065
I just delete stuff that I'm probably not going to use again. Anime I've already watched that wasn't great, obsolete AI models, old versions of software, video games I probably won't replay, all of it can go. Someone who isn't a NEET can spend their money on HDDs to archive that stuff.
>>
File: dd.jpg
(65 KB, 600x450)[ImgOps]
66560
キタ━━━(゚∀゚)━━━!!
>>
>>137043
i tried wget to take all 55street sites since fc2 was too large
55street has 178 sites according to the host lvl record, and the downloaded data adds up to a total of 600mb, giving an average of 3.3mb per site. this COULD mean that the total archive could add up to 52.8gb, assuming the 16k site figure maintains a 3.3mb average. dont know how many sites just 404d, but lets just keep with the 3.3mb average

the compressed size of the 55street archive (450mb, 7z) is too big to put on heyuri uploader, so if anyone has a file uploader or a better compression algorithm do let me know

note that i havent done any of the cleaning mentioned in >>137014, just the raw site data. terminal says it was 23min to download everything, and i think i can scale it up a bit more too so it'll be faster. i only used 8threads and limited it to 512k & had a wait of .3s, so theres probably alot of room for improvement
>>
>>137075
Just put it in parts! ヽ(´ー`)ノ And it's better to make a separate user board.
>>
I am trying to write a script that checks if links are redirecting to error.fc2.com (and put the links that doesn't redirect to werks.txt file, those that redirect to error.txt) but I can't get it to work with 100% certainty yet (;´Д`)
Once I have a clean list of subdomains, using wget or httracker or watever should be easy

>this COULD mean that the total archive could add up to 52.8g
That's nothing... I'd put all on Museum@Heyuri biggrin
>>
>>137076
ooo good point. forgot i can do that. it's below:
https://up.heyuri.net/user/boards/jparc/index.php

i'll probably try put some more site data on there so hopefully it doesnt fill up too quick (´~`) or uh, that the data doesnt just become redundant ┐(゚~゚)┌
>>
>>137077
I think wget can do it too. ヽ(´ー`)ノ

wget --spider --max-redirect=0 -a wget.log --tries=1 --wait=0.4 -i list.txt


then you'll need just a script that parses its log
--2025-04-04 21:32:16--  http://toukei135.fc2web.com/
Resolving toukei135.fc2web.com (toukei135.fc2web.com)... 199.48.208.133
Connecting to toukei135.fc2web.com (toukei135.fc2web.com)|199.48.208.133|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://error.fc2.com/web/ [following]
0 redirections exceeded.


>should be easy
Oh, it'll require so much man reading and debugging once it werks dark
>>
File: thxgpt.png
(23 KB, 690x506)[ImgOps]
23552
THX, I think I already solved it - it was a false assumption of mine that they were all be index.html sweat2
I'm using curl for now but once I have a good list I should probably use wget or httracker. I'm not quite l33t enough (´ー`)
>>
>>137011
how complete is this?
I noticed it at least doesn't have http://njtown.fc2web.com/ unsure
>>
>>
looks like the Common Crawl list of 293 million websites still isn't the entire internet
>>
>>137082
Whoa, C code and even having a GUI is pretty cool dizzy I only can into console scripts and just started to experiment with Tkinter closed-eyes2
>>
i hope you guyz can save teh japanese internets from the demonic forces trying to maek the web all boring. ヽ(´∇`)ノ
>>
File: 2025-04-05_09-55.jpg
(66 KB, 709x402)[ImgOps]
67584
>>137087
sure doesnt (´~`) we might need to find other crawler lists. or maybe the older data sets of common crawler have domains that arent included in the one sent by >>136945
looks like the 16k site figure might go up Σ(;゚Д゚)
>>
File: image.png
(3 KB, 323x155)[ImgOps]
3072
I think there is no land.to domain that doesn't return 404 at all anymoar? unsure
The way it returns 404 confuses my script, so I'll just halt it and give teh result list. Less than 10K sites to archive if it werked right biggrin
>>
links that return an error - from 350wuen.fc2web.com onwards, they timeout instead of sending to error.fc2: https://up.heyuri.net/src/4347.txt
links that are ready to archive https://up.heyuri.net/src/4348.txt

Seems it got them correct but I didn't check thoroughly

>>137113
Would be bettar to include moar of course, but I think 10K sites that made to that crawl list is still sumthing biggrin
>>
There may be leaked h4ck3d DNS zone lists somewhere on the net. Dunno if google can find it, only real haxxors know where to get them now probably.
https://zonefiles.io/

https://stackoverflow.com/questions/131989/how-do-i-get-a-list-of-all-subdomains-of-a-domain
>>
>>136855
WHAT SITE
I never knew it existed and now I never will cry
>>
File: 55sites.png
(2 KB, 266x54)[ImgOps]
2048
55 sites, average is 4MB... in theory dark
probably doesn't help my only available drive is an encrypted one intended for saving big videos (exfat), and it haets small icons
I should find a smaller drive to maek NTFS...
>>
imagine using teh proprietary dark
>>
>>137185
exfat genuinely does suck in general

all my external drives are NTFS, particularly since I do remember a time when exfat support was a lot iffier on Linux
although really, my really lazy bastard with a hammer instead of a drill solution would be a veracrypt file volume of 300MB and a lazy password lolol
>>
>>137120
i tried to use the list here but wget came up with weird errors when i tried to use it (´人`) just used the old one and let it 404 everything that wasnt it. since the sites redirect to its own site, wget seems to pickup on it and doesnt download them accordingly.

the userboard now has ojiji.net scraped, a total of 152 sites scraped, and 115 successfully scraped. it took like 2 hours to download the files from this http://shima.ojiji.net/ fucker.. but it was done! ヽ(´ー`)ノ their site alone took up like 100mb... was i archiving lhqhq?

https://up.heyuri.net/user/boards/jparc/

commenting on how fast it was, it seems xargs is really good at downloading lots of domains concurrently but handles each domain exclusively with a single thread. upping wait times down and giving it random wait, alongside a little bump in dl speed seems to have made it a little faster, but still overall slow. ChatGPT-san will have to answer to how to make it faster...! (゚血゚#)
>>
>>137193
General purpose fs'es do suck when you need to store millions of read-only, often small blob's.

>>137120
Fed it into my piece-of-shit crawler. Maybe it'll manage to save something.
>>
86 GiB
2M files
2.2M urls crawled
1.8M urls in queue

why am I doing it
>>
>>136956
They use archive.org


[Top]
Delete post: []