fc2web and a whole (…) - Off-Topic@Heyuri

File: GnhnsopaAAAbj_R.jpg

(238 KB, 1226x745)

ImgOpsHide image

▶

Anonymous 2025/04/02(Wed)21:15:36 No.136839 +

▶

fc2web and a whole bunch of other free japanese hosts are to be shut down at the end of june

it has a page on archiveteam but it seems nothing is being scraped right now
https://wiki.archiveteam.org/index.php/Fc2web

a lot of old sites may disappear sad

Marked for deletion (Old)

Viewing last 50 posts

30 Anonymous 2025/04/03(Thu)19:30:59 No.136969 +

▶

Are web.fc2.com sites staying up?

E.g. http://strangewalker.web.fc2.com/

31 Anonymous 2025/04/03(Thu)20:05:22 No.136971 +

▶

Yes, nothing indicating those ones are shutting down. It's probably worth having our own Ayashii Walker anyways, because I think that's the only one left up

32 Anonymous 2025/04/03(Thu)20:18:33 No.136972 +

▶

>>136918
used >>136945 and tried to grab all the domains with fc2web & fc2.kt. first time doing something like this so i hope i didnt fuck it up
https://up.heyuri.net/src/4343.txt

will try follow up with other domains but i gotta find out all the subdomains that's dying...

33 Anonymous 2025/04/03(Thu)20:44:19 No.136986 +

▶

>>136844

I promise you, the UK doesn't have that much sway, nor have our incompetent government's expensive attempts to mediate the Internet ever actually done anything. It's all just WORDSWORDSWORDS intended to placate people who spend all day on Mumsnet complaining about how Pornhub needs to be banned because their impressionable shota saw one porn video and turned ghay (´～`)

34 Anonymous 2025/04/03(Thu)20:45:13 No.136987 +

▶

File: image.png

(47 KB, 784x313)

ImgOpsHide image

▶

http://futabajinro.fc2web.com/
Futaba's werewolf game's logs
we should salvage teh icons here for Heyuri's usage

35 Anonymous 2025/04/03(Thu)20:49:52 No.136988 +

▶

>japs themselves think about it?
You could read through Twitter
https://xcancel.com/search?f=tweets&q=fc2web

36 Anonymous 2025/04/03(Thu)20:59:29 No.136990 +

▶

if there are 10K sites disappearing in total, if we say they are .1GB in average (this may even be generous)
10.000*0.1 = 1000GB = 1TB
I think it wouldn't be too impossible to archive everything? unsure

37 Anonymous 2025/04/03(Thu)21:34:44 No.136995 +

▶

>>136972
some of these are like
com.fc2web.toukei135.html.comic
com.fc2web.toukei135.txt.robots
which looks like they should be toukei135.fc2web.com/robots.txt and /comic.html

38 Anonymous 2025/04/03(Thu)21:40:28 No.136997 +

▶

>toukei135
gives 404 unsure

39 Anonymous 2025/04/03(Thu)22:57:33 No.137011 +

▶

>>136995
you got that right. my mistake, i made an updated one. also has the other domains that are in the script in >>136918
https://up.heyuri.net/src/4344.txt

>>136990
theres about 13600 lines in this file, and if we assume a third are dead, and that every site also takes up 100mb, that would mean roughly 900gb of storage. or uh... 1.36tb i guess. still somewhat in reach for anyone with a spare drive

there is a considerable amount of 404s and 403s, so i think it could be alot less if we figure how to just not download the ones that'll error out

40 Anonymous 2025/04/03(Thu)23:33:26 No.137014 +

▶

This probably needs a python script (thxfully we're in the age of chatgpt)

It should use links.txt 137011-san posted, if a page is redirecting to anywhere on error.fc2.com from the index page, it shouldn't save it and put the link to error.txt with a new line. Successful ones should be written to something like done.txt (will be used to generate the link index if we ever put it on Heyuri etc) after completing. Also it should have a "resuming" mechanism somehow using the two files, checking the last link in both of them & comparing which one is furher down in links.txt, and continuing from the one after that.

After downloading all pages, we should
1- convert downloaded .htm and .html files from SHIFT_JIS to UTF-8
there is a linux command noted on Museum@Heyuri for this: find . -type f \( -name '*.htm' -o -name '*.html' \) -exec sh -c 'iconv -f SHIFT_JIS -t UTF-8//TRANSLIT "{}" > "{}.tmp" && mv "{}.tmp" "{}"' \; or find . -type f \( -name '*.htm' -o -name '*.html' \) -exec sh -c 'iconv -f SHIFT_JIS -t UTF-8 "{}" | sed -r "s/‾/~/g;s/¥/\\\\/g" > "{}.tmp" && mv "{}.tmp" "{}"' \; for saving yen symbols
Or use teh powershell script: https://up.heyuri.net/src/3485.ps1 (Convert-Encoding.ps1)
2- convert <html> at the beginning of each file to <html lang="ja"> so browsers can display intended fonts.
also a command for this noted on Museum@Heyuri: find . -type f \( -name "*.html" -o -name "*.htm" \) -exec sed -i 's/<html\( lang="[^"]*"\)\?>/<html lang="ja">/Ig' {} +
3- remove the end of service notification popup thing
It should detect starting from "<div id="popup-container">" to "</script>" and remove in between
I didn't test but ChatGPT suggested this command: find . -type f \( -iname "*.html" -o -iname "*.htm" \) -exec perl -0777 -i -pe 's|<div id="popup-container">.*?</script>\s*||gs' {} +

Then all that's left is sharing with internets somehow biggrin

Depending how much it actually takes (I doubt the average gets anywhere close to 100MB), we can host it ourselves.

There will be some sites like http://uguisuhp.fc2web.com which doesn't link anything from its index but needs to be linked from http://uguisuhp.fc2web.com/newpage2.htm
I will remember to save this one specifically, but there are probably some others with hidden pages that will get lost...

41 Anonymous 2025/04/03(Thu)23:45:51 No.137015 +

▶

>1- convert downloaded .htm and .html files from SHIFT_JIS to UTF-8
>2- convert <html> at the beginning of each file to <html lang="ja"> so browsers can display intended fonts.
The potential collateral damage that this could cause (corruption, wrong characters, general b0rkage) isn't worth it IMHO - I think it'd be better to keep the pages in their original form sweat2

42 Anonymous SAGE! 2025/04/03(Thu)23:46:52 No.137016 +

▶

*Not to mention that some pages may not even be Shift-JIS, but EUC-JP or even UTF-8

43 Anonymous 2025/04/04(Fri)00:03:06 No.137017 +

▶

>The potential collateral damage that this could cause (corruption, wrong characters, general b0rkage) isn't worth it IMHO
I recall running into issues with serving SJIS files over web servers, or maybe it was Cloudflare. I think <html lang="ja"> was necessary too, but maybe it could use some kind of check to see if it should do dizzy

If someone cares about just having them not disappear, they could just do Step 3 first and distribute that with torrent too

44 Anonymous 2025/04/04(Fri)02:05:12 No.137022 +

▶

File: 77.jpg

(16 KB, 320x240)

ImgOpsHide image

▶

ｷﾀ━━━(ﾟ∀ﾟ)━━━!!

45 Anonymous 2025/04/04(Fri)06:42:00 No.137042 +

▶

File: Solo Leveling - S02E04.jpg

(785 KB, 1920x1080)

ImgOpsHide image

▶

>>137022

46 Anonymous 2025/04/04(Fri)06:47:22 No.137043 +

▶

File: a64b4e028f116cbeaaf0559244db0e08.jpg

(332 KB, 1650x2850)

ImgOpsHide image

▶

Why not use wget or wget2 for archiving teh sites? It have recursive mode, option preventing it from blindly following external domains, option to convert links.

47 Anonymous 2025/04/04(Fri)06:50:52 No.137044 +

▶

And all thanks to our Pater RMS!

48 Anonymous 2025/04/04(Fri)09:51:08 No.137051 +

▶

>>136855
What website? (I want to know)

49 Anonymous 2025/04/04(Fri)14:18:22 No.137056 +

▶

File: y609kjivmpse1.png

(131 KB, 1794x1397)

ImgOpsHide image

▶

Speaking of storage space... It is going to make you envy (yes, a reddit link dark

) https://reddit.com/r/StableDiffusion/comments/1jqej32/vram_is_not_everything_today/
I wonder if they buy it from their parents' money. Or is it not so expensive, relative to the monthly income, to buy computer parts in US.

50 Anonymous 2025/04/04(Fri)14:36:16 No.137058 +

▶

File: hyh.jpg

(12 KB, 320x240)

ImgOpsHide image

▶

ｷﾀ━━━(ﾟ∀ﾟ)━━━!!

51 Anonymous 2025/04/04(Fri)14:37:29 No.137059 +

▶

>>137022
What is he doing with his hand? That's not a peace sign. Looks like he's trying to symbolize a gun

52 Anonymous 2025/04/04(Fri)14:39:04 No.137060 +

▶

>>137056
Harddrive's arent that expensive. I make 15 an hour and I can afford as much space as I want

53 Anonymous 2025/04/04(Fri)14:58:15 No.137065 +

▶

>>137060
I do not have even a 100GB of free space currently dark

And the number of SATA ports on ur motherboard is limited, what do?

54 Anonymous 2025/04/04(Fri)15:00:30 No.137066 +

▶

>>137056
computer parts are definitely not that expensive in the US

55 Anonymous 2025/04/04(Fri)16:03:12 No.137071 +

▶

>>137065
I just delete stuff that I'm probably not going to use again. Anime I've already watched that wasn't great, obsolete AI models, old versions of software, video games I probably won't replay, all of it can go. Someone who isn't a NEET can spend their money on HDDs to archive that stuff.

56 Anonymous 2025/04/04(Fri)16:21:50 No.137073 +

▶

File: dd.jpg

(65 KB, 600x450)

ImgOpsHide image

▶

ｷﾀ━━━(ﾟ∀ﾟ)━━━!!

57 Anonymous 2025/04/04(Fri)17:36:12 No.137075 +

▶

>>137043
i tried wget to take all 55street sites since fc2 was too large
55street has 178 sites according to the host lvl record, and the downloaded data adds up to a total of 600mb, giving an average of 3.3mb per site. this COULD mean that the total archive could add up to 52.8gb, assuming the 16k site figure maintains a 3.3mb average. dont know how many sites just 404d, but lets just keep with the 3.3mb average

the compressed size of the 55street archive (450mb, 7z) is too big to put on heyuri uploader, so if anyone has a file uploader or a better compression algorithm do let me know

note that i havent done any of the cleaning mentioned in >>137014, just the raw site data. terminal says it was 23min to download everything, and i think i can scale it up a bit more too so it'll be faster. i only used 8threads and limited it to 512k & had a wait of .3s, so theres probably alot of room for improvement

58 Anonymous 2025/04/04(Fri)18:05:09 No.137076 +

▶

>>137075
Just put it in parts! ヽ(´ー｀)ノ And it's better to make a separate user board.

59 Anonymous 2025/04/04(Fri)18:15:51 No.137077 +

▶

I am trying to write a script that checks if links are redirecting to error.fc2.com (and put the links that doesn't redirect to werks.txt file, those that redirect to error.txt) but I can't get it to work with 100% certainty yet (;´Д`)
Once I have a clean list of subdomains, using wget or httracker or watever should be easy

>this COULD mean that the total archive could add up to 52.8g
That's nothing... I'd put all on Museum@Heyuri biggrin

60 Anonymous 2025/04/04(Fri)18:26:19 No.137078 +

▶

>>137076
ooo good point. forgot i can do that. it's below:
https://up.heyuri.net/user/boards/jparc/index.php

i'll probably try put some more site data on there so hopefully it doesnt fill up too quick (´～`) or uh, that the data doesnt just become redundant ┐(ﾟ～ﾟ)┌

61 Anonymous 2025/04/04(Fri)18:41:16 No.137080 +

▶

>>137077
I think wget can do it too. ヽ(´ー｀)ノ

wget --spider --max-redirect=0 -a wget.log --tries=1 --wait=0.4 -i list.txt

then you'll need just a script that parses its log

--2025-04-04 21:32:16--  http://toukei135.fc2web.com/
Resolving toukei135.fc2web.com (toukei135.fc2web.com)... 199.48.208.133
Connecting to toukei135.fc2web.com (toukei135.fc2web.com)|199.48.208.133|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://error.fc2.com/web/ ［following］
0 redirections exceeded.

>should be easy
Oh, it'll require so much man reading and debugging once it werks dark

62 Anonymous 2025/04/04(Fri)19:02:08 No.137082 +

▶

File: thxgpt.png

(23 KB, 690x506)

ImgOpsHide image

▶

THX, I think I already solved it - it was a false assumption of mine that they were all be index.html sweat2

I'm using curl for now but once I have a good list I should probably use wget or httracker. I'm not quite l33t enough (´ー`)

63 Anonymous 2025/04/04(Fri)20:00:16 No.137083 +

▶

>>137011
how complete is this?
I noticed it at least doesn't have http://njtown.fc2web.com/ unsure

64 Anonymous SAGE! 2025/04/04(Fri)20:04:48 No.137084 +

▶

and neither domains below from >>136847
http://saomix.fc2web.com/
http://rva139.fc2web.com/
http://ichinoa.fc2web.com/

65 Anonymous 2025/04/04(Fri)21:08:35 No.137087 +

▶

looks like the Common Crawl list of 293 million websites still isn't the entire internet

66 Anonymous 2025/04/04(Fri)21:19:04 No.137091 +

▶

>>137082
Whoa, C code and even having a GUI is pretty cool dizzy

I only can into console scripts and just started to experiment with Tkinter closed-eyes2

67 Anonymous 2025/04/04(Fri)22:15:26 No.137095 +

▶

i hope you guyz can save teh japanese internets from the demonic forces trying to maek the web all boring. ヽ(´∇`)ノ

68 Anonymous 2025/04/05(Sat)09:02:52 No.137113 +

▶

File: 2025-04-05_09-55.jpg

(66 KB, 709x402)

ImgOpsHide image

▶

>>137087
sure doesnt (´～`) we might need to find other crawler lists. or maybe the older data sets of common crawler have domains that arent included in the one sent by >>136945
looks like the 16k site figure might go up Σ(;ﾟДﾟ)

69 Anonymous 2025/04/05(Sat)09:46:15 No.137116 +

▶

File: image.png

(3 KB, 323x155)

ImgOpsHide image

▶

I think there is no land.to domain that doesn't return 404 at all anymoar? unsure

The way it returns 404 confuses my script, so I'll just halt it and give teh result list. Less than 10K sites to archive if it werked right biggrin

70 Anonymous 2025/04/05(Sat)09:59:32 No.137120 +

▶

links that return an error - from 350wuen.fc2web.com onwards, they timeout instead of sending to error.fc2: https://up.heyuri.net/src/4347.txt
links that are ready to archive https://up.heyuri.net/src/4348.txt

Seems it got them correct but I didn't check thoroughly

>>137113
Would be bettar to include moar of course, but I think 10K sites that made to that crawl list is still sumthing biggrin

71 Anonymous 2025/04/05(Sat)16:10:31 No.137160 +

▶

There may be leaked h4ck3d DNS zone lists somewhere on the net. Dunno if google can find it, only real haxxors know where to get them now probably.
https://zonefiles.io/

https://stackoverflow.com/questions/131989/how-do-i-get-a-list-of-all-subdomains-of-a-domain

72 Anonymous 2025/04/05(Sat)16:33:34 No.137169 +

▶

>>136855
WHAT SITE
I never knew it existed and now I never will cry

73 Anonymous 2025/04/05(Sat)17:24:05 No.137185 +

▶

File: 55sites.png

(2 KB, 266x54)

ImgOpsHide image

▶

55 sites, average is 4MB... in theory dark

probably doesn't help my only available drive is an encrypted one intended for saving big videos (exfat), and it haets small icons
I should find a smaller drive to maek NTFS...

74 sage SAGE! 2025/04/05(Sat)17:30:33 No.137188 +

▶

imagine using teh proprietary dark

75 Anonymous 2025/04/05(Sat)18:37:16 No.137193 +

▶

>>137185
exfat genuinely does suck in general

all my external drives are NTFS, particularly since I do remember a time when exfat support was a lot iffier on Linux
although really, my really lazy bastard with a hammer instead of a drill solution would be a veracrypt file volume of 300MB and a lazy password lolol

76 Anonymous 2025/04/06(Sun)10:56:42 No.137283 +

▶

>>137120
i tried to use the list here but wget came up with weird errors when i tried to use it (´人｀) just used the old one and let it 404 everything that wasnt it. since the sites redirect to its own site, wget seems to pickup on it and doesnt download them accordingly.

the userboard now has ojiji.net scraped, a total of 152 sites scraped, and 115 successfully scraped. it took like 2 hours to download the files from this http://shima.ojiji.net/ fucker.. but it was done! ヽ(´ー｀)ノ their site alone took up like 100mb... was i archiving lhqhq?

https://up.heyuri.net/user/boards/jparc/

commenting on how fast it was, it seems xargs is really good at downloading lots of domains concurrently but handles each domain exclusively with a single thread. upping wait times down and giving it random wait, alongside a little bump in dl speed seems to have made it a little faster, but still overall slow. ChatGPT-san will have to answer to how to make it faster...! (ﾟ血ﾟ#)

77 Anonymous 2025/04/06(Sun)15:03:05 No.137294 +

▶

File: Screenshot 2025-04-06 at 16.54.14.png

(58 KB, 750x442)

ImgOpsHide image

▶

>>137193
General purpose fs'es do suck when you need to store millions of read-only, often small blob's.

>>137120
Fed it into my piece-of-shit crawler. Maybe it'll manage to save something.

78 Anonymous 2025/04/14(Mon)01:56:43 No.138197 +

▶

File: 428979e9960d5ea54f5d5fe345b0edd17c0eb161.jpg

(145 KB, 640x480)

ImgOpsHide image

▶

86 GiB
2M files
2.2M urls crawled
1.8M urls in queue

why am I doing it

79 Anonymous SAGE! 2025/04/21(Mon)19:45:32 No.140036 +

▶

>>136956
They use archive.org

Name
E-mail	sagenokodump
Subject
Comment Tegaki	Emotes Kaomoji Emoji BBCode
File
Password	(for deletion)
Allowed file types are: gif, jpg, jpeg, png, bmp, swf, webm, mp4 Maximum file size allowed is 50000 KB. Images greater than 200 * 200 pixels will be thumbnailed. 21 unique users in the last 10 minutes (including lurkers) Switch form position \| BBCode reference \| Banned? \| Quick reply \| Post API Read the rules before you post. Protect your username, use a tripcode! 日本のへゆり

Heyuri!

Bulletin Boards

Heyuri★CGI

Other

Off-Topic@Heyuri

Posting mode: Reply