[Home] [Catalog] [Search] [Inbox] [Write PM] [Admin]
[Return]

Posting mode: Reply

(for deletion, 8 chars max)
  • Allowed file types are: gif, jpg, jpeg, png, bmp, swf, webm, mp4
  • Maximum file size allowed is 50000 KB.
  • Images greater than 200 * 200 pixels will be thumbnailed.





File: GnhnsopaAAAbj_R.jpg
(238 KB, 1226x745)[ImgOps]
238 KB
fc2web and a whole bunch of other free japanese hosts are to be shut down at the end of june

it has a page on archiveteam but it seems nothing is being scraped right now
https://wiki.archiveteam.org/index.php/Fc2web

a lot of old sites may disappear sad
Marked for deletion (Old)
>>
why are they being shut down?
>>
>>136840
they dont align culturally with the NWO emo
>>
>>136840
>The FC2WEB system has been in operation for over 20 years,
The system and servers are aging, and it is difficult to maintain them.

if you visit a site you get a popup
http://saomix.fc2web.com/

if theyre abandoned, they will be gone. but if they can migrate, the old visitor counts will prob reset
( ´,_ゝ`)
>>
File: image.png
(13 KB, 296x223)[ImgOps]
13 KB
I will try to scrape some stuff later on, but I hope there will be a way to remove this warning from teh archives dark
>>
My guess would be that it's related to the UK's "no fun allowed" Online Safety Act, which goes into full effect this summer dark

It applies to any large-ish "platform" that serves UK users, and they're trying to bully the entire world into kowtowing to it. Perhaps those Japanese services decided to just shut down instead of receiving a fine or geoblocking the UK - especially since their services have likely been mostly dead for the past 15 years

Sage for serious business
>>
File: usada.jpg
(19 KB, 297x221)[ImgOps]
19 KB
Let's try saving at least Futaba/2channel/2D etc related fc2web sites to Museum@Heyuri happy

I will attempt to scrape these sites soon:
http://saomix.fc2web.com/
http://convenies.fc2web.com/
http://tuneari.fc2web.com/
http://uguisuhp.fc2web.com/newpage2.htm
http://meganekko2.fc2web.com/
http://nijineta.fc2web.com/
http://mypacekame.fc2web.com/ (there is a notice of moving the site, but the new one is empty - noting down here, will only put up if he gives up)
http://rva139.fc2web.com/
http://ichinoa.fc2web.com/
http://riceballman.fc2web.com/

I found these sites with a google search of "虹裏" site:fc2web.com, u may search with similar keywords and suggest the sites (or just post them ITT for the sake of sharing)
>>
File: 1724642291260865.jpg
(69 KB, 700x848)[ImgOps]
69 KB
someone quick save that japanese nudist website kuma6
>>
>>136855
I will save it with my photographic memory and i'll describe it to you whenever you need it biggrin.
>>
>>136847
ganbatte dance2
>>
>archiveteam
These fucking queermoes do jack diddly shit. They let geocities.jp die unarchived, teacup, etc. Fuck them.
>>
>>136873
why did they do that? :(
>>
>>136844
Why are lawmakers so mad? nyaoo
>>
File: 050516-3.jpg
(10 KB, 320x240)[ImgOps]
10 KB
Post some cool things you find from random fc2 pages you find
>>136874
Not trying is easier than trying biggrin
>>
File: 95473.jpg
(10 KB, 320x240)[ImgOps]
10 KB
キタ━━━(゚∀゚)━━━!!
>>
File: 20030713_01.jpg
(57 KB, 640x480)[ImgOps]
57 KB
キタ━━━(゚∀゚)━━━!!
>>
File: yy.jpg
(12 KB, 246x500)[ImgOps]
12 KB
キタ━━━(゚∀゚)━━━!!
>>
Public archives are dead.
I used to maintain a few pages dedicated to archiving the history of certain boards, but these days archive.ph just won't cooperate. I can't even use it to archive the index of a slow imageboard without it spitting a bullshit error at me.
I still feel really bad about not saving anything from geocities.jp beyond a screencap of a page or two, and a few flash files. To be fair, I actually trusted archiveteam would pull through and scrape everything, so I wasn't really worried at the time closed-eyes2
Whenever anyone goes on a "copypaste some kanji before:2015 in the search bar and find weird images" rabbithole hootenanny, the first couple thousand results you're going to receive are hosted on fc2, so it's certainly going to be weird to see what sites will surface once it's gone.
>>
File: f.jpg
(5 KB, 259x194)[ImgOps]
5 KB
キタ━━━(゚∀゚)━━━!!
>>
>>
>>136901
>hosted on fc2, so it's certainly going to be weird to see what sites will surface once it's gone
To my understanding sites hosted on the fc2 main site will survive, only fc2web (and others in OP) will b gone
>>

from urllib.parse import urlparse
from googlesearch import search

urls = [
# "55street.net",
# "easter.ne.jp",
# "finito-web.com",
# "ojiji.net",
# "zero-yen.com",
"fc2web.com",
# "k-free.net",
# "gooside.com",
# "ktplan.net",
# "kt.fc2.com",
# "zero-city.com",
# "k-server.org",
# "land.to"
]

results = search(f"site:*.{urls[0]}", num_results=10000, unique=True, safe=None, sleep_interval=5, region="ja")

parsed_urls = []
for s in results:
parsed_url = urlparse(s)
domain = f"{parsed_url.scheme}://{parsed_url.netloc}/"
parsed_urls.append(domain)
print(domain, flush=True)

unique_urls = list(set(parsed_urls))

filename = f"{urls[0]}.txt"
with open(filename, 'w') as file:
for url in unique_urls:
file.write(url + '\n')

print(f"saved {filename}")


>>136847
i tried similar with this py but google doesnt have many results.
archive.org cdx api wont return just subdomains either.
>>
What futaba posters, ie japs themselves think about it?
>>
Here's a host-level list of a 293 million websites (5.3 GB download, extracts to 20 GB txt file):
https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-jan-feb-mar/host/cc-main-2025-jan-feb-mar-host-ranks.txt.gz
Some text editors like Notepad++ will need 24GB+ RAM to open the file, but I'm sure there are programs/scripts that can search through it without loading it all into RAM. The domains are backwards like "com.fc2web.nigger" instead of "nigger.fc2web.com".

8347 matches for com.fc2web
180 matches for net.55street
393 matches for com.fc2.kt
>>
>8347 matches for com.fc2web
>180 matches for net.55street
>393 matches for com.fc2.kt
There should probably be a list only for these, and fix teh URLs with regex
Though it still takes someone willing with enuf storage/bandwidth to archive thousands of sites sweat2
>>
>>136919
futaba posters dont talk. its all sentence-long bullshit and (AI) image posting.
>>
>>136952
Are they okay?
>>
In Japan, imageboards are for images 💁‍♀️
>>
File: image.png
(99 KB, 711x625)[ImgOps]
99 KB
Futaba is on the chattier side
The img server doesn't even have image replies
They don't currently have a discussion about this topic, but this doesn't mean they never had - they don't have public archives like 4chan's archived.moe etc, so we can't just search it to find related discussions in the past.
When you visit img, you can always find discussions of toshiaki about very average ecchi pictures/screenshots. It's a place to worth patterning after ( ´ω`)
>>
>fc2web and a whole bunch of other free japanese hosts are to be shut down at the end of june
LMAO
>>
Are web.fc2.com sites staying up?

E.g. http://strangewalker.web.fc2.com/
>>
Yes, nothing indicating those ones are shutting down. It's probably worth having our own Ayashii Walker anyways, because I think that's the only one left up
>>
>>136918
used >>136945 and tried to grab all the domains with fc2web & fc2.kt. first time doing something like this so i hope i didnt fuck it up
https://up.heyuri.net/src/4343.txt

will try follow up with other domains but i gotta find out all the subdomains that's dying...
>>
>>136844

I promise you, the UK doesn't have that much sway, nor have our incompetent government's expensive attempts to mediate the Internet ever actually done anything. It's all just WORDSWORDSWORDS intended to placate people who spend all day on Mumsnet complaining about how Pornhub needs to be banned because their impressionable shota saw one porn video and turned ghay (´~`)
>>
File: image.png
(47 KB, 784x313)[ImgOps]
47 KB
http://futabajinro.fc2web.com/
Futaba's werewolf game's logs
we should salvage teh icons here for Heyuri's usage
>>
>japs themselves think about it?
You could read through Twitter
https://xcancel.com/search?f=tweets&q=fc2web
>>
if there are 10K sites disappearing in total, if we say they are .1GB in average (this may even be generous)
10.000*0.1 = 1000GB = 1TB
I think it wouldn't be too impossible to archive everything? unsure
>>
>>136972
some of these are like
com.fc2web.toukei135.html.comic
com.fc2web.toukei135.txt.robots
which looks like they should be toukei135.fc2web.com/robots.txt and /comic.html
>>
>toukei135
gives 404 unsure
>>
>>136995
you got that right. my mistake, i made an updated one. also has the other domains that are in the script in >>136918
https://up.heyuri.net/src/4344.txt

>>136990
theres about 13600 lines in this file, and if we assume a third are dead, and that every site also takes up 100mb, that would mean roughly 900gb of storage. or uh... 1.36tb i guess. still somewhat in reach for anyone with a spare drive

there is a considerable amount of 404s and 403s, so i think it could be alot less if we figure how to just not download the ones that'll error out
>>
This probably needs a python script (thxfully we're in the age of chatgpt)

It should use links.txt 137011-san posted, if a page is redirecting to anywhere on error.fc2.com from the index page, it shouldn't save it and put the link to error.txt with a new line. Successful ones should be written to something like done.txt (will be used to generate the link index if we ever put it on Heyuri etc) after completing. Also it should have a "resuming" mechanism somehow using the two files, checking the last link in both of them & comparing which one is furher down in links.txt, and continuing from the one after that.

After downloading all pages, we should
1- convert downloaded .htm and .html files from SHIFT_JIS to UTF-8
there is a linux command noted on Museum@Heyuri for this: find . -type f \( -name '*.htm' -o -name '*.html' \) -exec sh -c 'iconv -f SHIFT_JIS -t UTF-8//TRANSLIT "{}" > "{}.tmp" && mv "{}.tmp" "{}"' \; or find . -type f \( -name '*.htm' -o -name '*.html' \) -exec sh -c 'iconv -f SHIFT_JIS -t UTF-8 "{}" | sed -r "s/‾/~/g;s/¥/\\\\/g" > "{}.tmp" && mv "{}.tmp" "{}"' \; for saving yen symbols
Or use teh powershell script: https://up.heyuri.net/src/3485.ps1 (Convert-Encoding.ps1)
2- convert <html> at the beginning of each file to <html lang="ja"> so browsers can display intended fonts.
also a command for this noted on Museum@Heyuri: find . -type f \( -name "*.html" -o -name "*.htm" \) -exec sed -i 's/<html\( lang="[^"]*"\)\?>/<html lang="ja">/Ig' {} +
3- remove the end of service notification popup thing
It should detect starting from "<div id="popup-container">" to "</script>" and remove in between
I didn't test but ChatGPT suggested this command: find . -type f \( -iname "*.html" -o -iname "*.htm" \) -exec perl -0777 -i -pe 's|<div id="popup-container">.*?</script>\s*||gs' {} +

Then all that's left is sharing with internets somehow biggrin
Depending how much it actually takes (I doubt the average gets anywhere close to 100MB), we can host it ourselves.

There will be some sites like http://uguisuhp.fc2web.com which doesn't link anything from its index but needs to be linked from http://uguisuhp.fc2web.com/newpage2.htm
I will remember to save this one specifically, but there are probably some others with hidden pages that will get lost...
>>
>1- convert downloaded .htm and .html files from SHIFT_JIS to UTF-8
>2- convert <html> at the beginning of each file to <html lang="ja"> so browsers can display intended fonts.
The potential collateral damage that this could cause (corruption, wrong characters, general b0rkage) isn't worth it IMHO - I think it'd be better to keep the pages in their original form sweat2
>>
*Not to mention that some pages may not even be Shift-JIS, but EUC-JP or even UTF-8
>>
>The potential collateral damage that this could cause (corruption, wrong characters, general b0rkage) isn't worth it IMHO
I recall running into issues with serving SJIS files over web servers, or maybe it was Cloudflare. I think <html lang="ja"> was necessary too, but maybe it could use some kind of check to see if it should do dizzy

If someone cares about just having them not disappear, they could just do Step 3 first and distribute that with torrent too
>>
File: 77.jpg
(16 KB, 320x240)[ImgOps]
16 KB
キタ━━━(゚∀゚)━━━!!
>>
File: Solo Leveling - S02E04.jpg
(785 KB, 1920x1080)[ImgOps]
785 KB
>>
332 KB
Why not use wget or wget2 for archiving teh sites? It have recursive mode, option preventing it from blindly following external domains, option to convert links.
>>
And all thanks to our Pater RMS!
>>
>>136855
What website? (I want to know)
>>
File: y609kjivmpse1.png
(131 KB, 1794x1397)[ImgOps]
131 KB
Speaking of storage space... It is going to make you envy (yes, a reddit link dark) https://reddit.com/r/StableDiffusion/comments/1jqej32/vram_is_not_everything_today/
I wonder if they buy it from their parents' money. Or is it not so expensive, relative to the monthly income, to buy computer parts in US.
>>
File: hyh.jpg
(12 KB, 320x240)[ImgOps]
12 KB
キタ━━━(゚∀゚)━━━!!
>>
>>137022
What is he doing with his hand? That's not a peace sign. Looks like he's trying to symbolize a gun
>>
>>137056
Harddrive's arent that expensive. I make 15 an hour and I can afford as much space as I want 🐵
>>
>>137060
I do not have even a 100GB of free space currently dark And the number of SATA ports on ur motherboard is limited, what do?
>>
>>137056
computer parts are definitely not that expensive in the US
>>
>>137065
I just delete stuff that I'm probably not going to use again. Anime I've already watched that wasn't great, obsolete AI models, old versions of software, video games I probably won't replay, all of it can go. Someone who isn't a NEET can spend their money on HDDs to archive that stuff.
>>
File: dd.jpg
(65 KB, 600x450)[ImgOps]
65 KB
キタ━━━(゚∀゚)━━━!!
>>
>>137043
i tried wget to take all 55street sites since fc2 was too large
55street has 178 sites according to the host lvl record, and the downloaded data adds up to a total of 600mb, giving an average of 3.3mb per site. this COULD mean that the total archive could add up to 52.8gb, assuming the 16k site figure maintains a 3.3mb average. dont know how many sites just 404d, but lets just keep with the 3.3mb average

the compressed size of the 55street archive (450mb, 7z) is too big to put on heyuri uploader, so if anyone has a file uploader or a better compression algorithm do let me know

note that i havent done any of the cleaning mentioned in >>137014, just the raw site data. terminal says it was 23min to download everything, and i think i can scale it up a bit more too so it'll be faster. i only used 8threads and limited it to 512k & had a wait of .3s, so theres probably alot of room for improvement
>>
>>137075
Just put it in parts! ヽ(´ー`)ノ And it's better to make a separate user board.
>>
I am trying to write a script that checks if links are redirecting to error.fc2.com (and put the links that doesn't redirect to werks.txt file, those that redirect to error.txt) but I can't get it to work with 100% certainty yet (;´Д`)
Once I have a clean list of subdomains, using wget or httracker or watever should be easy

>this COULD mean that the total archive could add up to 52.8g
That's nothing... I'd put all on Museum@Heyuri biggrin
>>
>>137076
ooo good point. forgot i can do that. it's below:
https://up.heyuri.net/user/boards/jparc/index.php

i'll probably try put some more site data on there so hopefully it doesnt fill up too quick (´~`) or uh, that the data doesnt just become redundant ┐(゚~゚)┌
>>
>>137077
I think wget can do it too. ヽ(´ー`)ノ

wget --spider --max-redirect=0 -a wget.log --tries=1 --wait=0.4 -i list.txt


then you'll need just a script that parses its log
--2025-04-04 21:32:16--  http://toukei135.fc2web.com/
Resolving toukei135.fc2web.com (toukei135.fc2web.com)... 199.48.208.133
Connecting to toukei135.fc2web.com (toukei135.fc2web.com)|199.48.208.133|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://error.fc2.com/web/ [following]
0 redirections exceeded.


>should be easy
Oh, it'll require so much man reading and debugging once it werks dark
>>
File: thxgpt.png
(23 KB, 690x506)[ImgOps]
23 KB
THX, I think I already solved it - it was a false assumption of mine that they were all be index.html sweat2
I'm using curl for now but once I have a good list I should probably use wget or httracker. I'm not quite l33t enough (´ー`)
>>
>>137011
how complete is this?
I noticed it at least doesn't have http://njtown.fc2web.com/ unsure
>>
>>
looks like the Common Crawl list of 293 million websites still isn't the entire internet
>>
>>137082
Whoa, C code and even having a GUI is pretty cool dizzy I only can into console scripts and just started to experiment with Tkinter closed-eyes2
>>
i hope you guyz can save teh japanese internets from the demonic forces trying to maek the web all boring. ヽ(´∇`)ノ
>>
File: 2025-04-05_09-55.jpg
(66 KB, 709x402)[ImgOps]
66 KB
>>137087
sure doesnt (´~`) we might need to find other crawler lists. or maybe the older data sets of common crawler have domains that arent included in the one sent by >>136945
looks like the 16k site figure might go up Σ(;゚Д゚)
>>
File: image.png
(3 KB, 323x155)[ImgOps]
3 KB
I think there is no land.to domain that doesn't return 404 at all anymoar? unsure
The way it returns 404 confuses my script, so I'll just halt it and give teh result list. Less than 10K sites to archive if it werked right biggrin
>>
links that return an error - from 350wuen.fc2web.com onwards, they timeout instead of sending to error.fc2: https://up.heyuri.net/src/4347.txt
links that are ready to archive https://up.heyuri.net/src/4348.txt

Seems it got them correct but I didn't check thoroughly

>>137113
Would be bettar to include moar of course, but I think 10K sites that made to that crawl list is still sumthing biggrin
>>
There may be leaked h4ck3d DNS zone lists somewhere on the net. Dunno if google can find it, only real haxxors know where to get them now probably.
https://zonefiles.io/

https://stackoverflow.com/questions/131989/how-do-i-get-a-list-of-all-subdomains-of-a-domain
>>
>>136855
WHAT SITE
I never knew it existed and now I never will cry
>>
File: 55sites.png
(2 KB, 266x54)[ImgOps]
2 KB
55 sites, average is 4MB... in theory dark
probably doesn't help my only available drive is an encrypted one intended for saving big videos (exfat), and it haets small icons
I should find a smaller drive to maek NTFS...
>>
imagine using teh proprietary dark
>>
>>137185
exfat genuinely does suck in general

all my external drives are NTFS, particularly since I do remember a time when exfat support was a lot iffier on Linux
although really, my really lazy bastard with a hammer instead of a drill solution would be a veracrypt file volume of 300MB and a lazy password lolol
>>
>>137120
i tried to use the list here but wget came up with weird errors when i tried to use it (´人`) just used the old one and let it 404 everything that wasnt it. since the sites redirect to its own site, wget seems to pickup on it and doesnt download them accordingly.

the userboard now has ojiji.net scraped, a total of 152 sites scraped, and 115 successfully scraped. it took like 2 hours to download the files from this http://shima.ojiji.net/ fucker.. but it was done! ヽ(´ー`)ノ their site alone took up like 100mb... was i archiving lhqhq?

https://up.heyuri.net/user/boards/jparc/

commenting on how fast it was, it seems xargs is really good at downloading lots of domains concurrently but handles each domain exclusively with a single thread. upping wait times down and giving it random wait, alongside a little bump in dl speed seems to have made it a little faster, but still overall slow. ChatGPT-san will have to answer to how to make it faster...! (゚血゚#)
>>
58 KB
>>137193
General purpose fs'es do suck when you need to store millions of read-only, often small blob's.

>>137120
Fed it into my piece-of-shit crawler. Maybe it'll manage to save something.
>>
145 KB
86 GiB
2M files
2.2M urls crawled
1.8M urls in queue

why am I doing it
>>
>>136956
They use archive.org


Delete post: []