BugmenotEncore
08/23/25 07:50PM
HHub Archiver Development Log
As of yesterday, I believe the existence of this software requires no justification.

HHub may be a janky, broken pile of rubble, but it is our pile, and I'll be f'd before I let anyone take it away from us.

The HHub Archiver is intended to be intuitive to any user. Time is of the essence, so I will not bother with a graphical user interface - text will have to do. It is and will remain in plain English, though. So long as you can read, you can use it.

I'm determined to work on the project daily until it is in at least usable condition. This thread is meant to hold me to that and keep me honest. I'll also accept feature requests.

We'll call the current state of the script pre-alpha 1.

UI: Barebones. Currently lacks a main menu. It will have to be done for the public beta.
Media/MediaTags/Comments: Functional, but buggy. Some of the solutions already exist in the wiki's side of things, though, just needs to be moved over.
Wiki: Fully functional as of today.
Tags: Fully functional.
Artists: Not implemented.
Pools: Not implemented. Development starts today, hope to have good news tomorrow.
Forum: Not implemented.


Progress report:

- Found and implemented a workaround for PowerShell's aliasing system preventing certain types of quotation marks from being displayed or stored correctly. It runs at the same time as the adjustments for HTML's little idiosyncrasies.
It's not supported by PowerScript, by definition. As a result, it looks very, very cursed. But, as long as it works...

- The wiki's archival step now supports rare and non-latin characters. A small issue, as you'd rarely see Japanese characters (misty's wiki page being an example) and such there, but user names and comments do have them with some frequency, so this will be much more useful later.

- Increased the efficiency a little. A budget connection should handle the whole wiki in a few minutes. It fits into half an MB in plain text files.
BugmenotEncore
08/24/25 04:01PM
pre-alpha 2
UI: Barebones, though slightly less so. No menu yet.
Media/MediaTags/Comments: Functional, but buggy. No change since last time.
Wiki: Fully functional.
Tags: Fully functional. Will tighten up some loose ends here tomorrow.
Artists: Not implemented.
Pools: Functional. Hopefully not too buggy. Testing in progress.
Forum: Not implemented.

Progress report:

- Pool archival now possible.

This has been trickier than previous sections for several reasons. One is that pool data is actually stored on multiple pages per pool. Another is wrestling with regular expressions.

Mostly, the issue is that empty, deleted, edited and unedited pools all behave slightly differently. I'll do a lot more testing to ensure all the logic gates handle them as intended.

Interesting to note that ports of old pools from a previous version of the site did not properly sanitize HTML entities like my script is doing.
Fortunately, both their and my pipe parsed the entities as unicode, so fixing their mistake is easy on my part - I just have to run my own fix twice.

- Wiki text files are now named after their title, followed by their ID.

Previously, only the ID was used. This was convenient for debugging, but I realized in hindsight it's not user friendly. Now the files names can be easily parsed by the reader, or ordered alphabetically.

I was paranoid that introducing this feature would also come with a lot more problems, but it just forced me to tighten up some poor code. It ended up cleaner in the end, very happy with that.
R_of_Tetra
08/25/25 05:27PM
God speed, you magnificent bastard *Salute*

We will be waiting for your stable release with trepidation, mate.
Detour
08/25/25 06:31PM
BugmenotEncore said:
- Pool archival now possible.

This has been trickier than previous sections for several reasons. One is that pool data is actually stored on multiple pages per pool. Another is wrestling with regular expressions.


I thought pools display their full contents on a single page, are you talking about the API or favorites?
BugmenotEncore
08/25/25 06:39PM
pre-alpha 3
UI: Barebones. No change since last time.
Media/MediaTags/Comments: Functional, but buggy. No change since last time.
Wiki: Fully functional.
Tags: Fully functional. Updated, and will be again tomorrow.
Artists: Not implemented.
Pools: Functional. Testing in progress. Hope to implement full automation by tomorrow.
Forum: Not implemented.

Progress report:

- Tag archiver now properly line breaks after each processed page.

Nothing to say, a minor bugfix.

- Tag archiver now processes more rare characters.

When I started this one, I was naively under the belief it was only required to process the occasional broken tag, so I'd be done in twenty minutes.
This naivety had been shattered by the number of letters in the Vietnamese alphabet, and the insistence of artists to be quirky. I've been doing this for three hours and I'm still not completely done. Help.

Speaking of...

- anonlv000 may be joining the team

We're in talks currently. I cannot say anything definitive until we figure things out, but hopefully they can join up. Another hand on deck would make things go much faster.
BugmenotEncore
08/25/25 07:03PM
R_of_Tetra said:
God speed, you magnificent bastard *Salute*

We will be waiting for your stable release with trepidation, mate.


I shall not disappoint < - >7

Detour said:

I thought pools display their full contents on a single page.


Oh, if you are talking about just the media in them, sure. Pools also contain the title, the description, the edit history, the people who created and edited them... if you want all that, it will take 2-3 pages per pool.

Also, there is no integration between the pool archiving and image archiving script at the moment. All these sections are in their own neat little box for now, to make testing easier.

The current goal is to back everything up, so at least we will Have all the media and data archived, even if we cannot get it done on time for users to archive their desired stuffs.

Backing up images from specific pools(something already handled by your userscript) would be a distraction from this first goal. We want to be able to get All the media files. We can worry about range requests later.


Edit:
are you talking about the API or favorites?


If the API grants access to pools, the documentation sure does not say< w >7 I'd be happy to know how to get to it if you know.
R_of_Tetra
08/25/25 08:00PM
BugmenotEncore said:
The current goal is to back everything up, so at least we will Have all the media and data archived, even if we cannot get it done on time for users to archive their desired stuffs.


I am already moving to support this idea. I'm asking some contacts if they have spare old HDDs, possibly big ones, or if I can consider making a multi-HDD system for my old, rustbucket PC. Once that is set, thanks to my decent connection and your system, I can start backing up everything en masse to physical storage, which I can reupload in the future if the ship goes down and is floating again whoknowswhen. A choom of mine is already explaining how to use FreeNAS with a Rasperry as a central hub, so... good so far.

Will your system be able to estimate the actual space needed to save all, or at least most, of the media in here?
BugmenotEncore
08/25/25 09:03PM
Will your system be able to estimate the actual space needed to save all, or at least most, of the media in here?


:thinking:

Let's do some rough maths for an upper estimate. Properly encoded pixelart, lineart, greyscale etc. would improve these numbers considerably, but let's assume all the encoding is garbage. We will also be rounding up for the final calculation.

There are four types of media files you can download from hypnohub. jpegs, gifs, pngs and mp4s.

the MP4s are the easiest to estimate since HHub insists, against all common sense, to reencode every video you upload to it, so we know the maximum size of a video is roughly 55 MBs, recompressed from a 100 MB video, which is the upload limit.

Jpegs used to be the standard image format simply because the internet was much slower when the boorus got started. These images are 1 MB each at most.
These days they are usually used for massive absurdres images to improve loading times. There is no universe where a jpeg of this type is much larger than 10 MBs, so let's use that number as our ceiling.
There are also jpegs that should honestly be pngs but aren't, we will fold those in with the pngs further down.

Gifs are a tricky one. Your standard gif cannot go higher than 256 colors and 800x600 resolution. But it is certainly Possible to push it. Let's assume all the absurdres gifs are 40 MBs. The others we assume are 15 MBs.


Pngs are the hardest to estimate because some people just refuse to properly compress their exports (I'm looking at you thehguy). The majority of them (along with the average jpeg that is a jpeg for no good reason) will be around 2-3 MBs, but let's say 5 MB just to be safe. Stacked image sequences and wallpapers are up to 30 MBs. And we will say the outliers are 70 MBs.

At time of writing, HHub has 246952 media IDs. 27436 are deleted, according to the API. So we have 219,516 files.

3709 are videos, or 200 GBs.
7427 are animated gifs, about a 100 of those are absurdres. 112 GBs.

We have 208,380 images left. Let's say it's 20% classic jpegs, 15% modern jpegs, 50% normal pngs, 10% absurdres pngs, and the remaining 5% is no-compression png. That seems like a reasonable distribution.

~41,676 jpeg classic, for 41 GBs.
~31,257 jpeg modern, for 306 GBs.
~104,190 png average, for 509 GBs.
~20,838 png wallpaper, for 21 GBs.
~10,419 png uncompressed, for 11 GBs.

That adds up to ~1.2 terabytes.

Keeping in mind this is a very pessimistic estimation, since I've rounded everything up. It Should be lower than this.
But, even assuming the worst, as long as you have two terabytes of storage around, you should be fine.


R_of_Tetra
08/25/25 09:53PM
So far, I've pulled out:
- 4 external storage systems of 1 Terabyte each (well, not a precise Tera because you know how they sell them, but it's still easily 3.50 Tera total). One is currently in need to be moved somewhere else, because it keeps all my AI stuff (particularly WAN stuff and LLMS, which are chonky), in case free space is needed, but...
- 3 internal HDDs of 1 TB each, that need to be placed inside a rig, but I can easily arrange 2 into my old PC (the Orange Box <3).
- My main internal archive in my primary Rig has still 1.98 Terabytes of free space: I prefer to have a backup, in case, well, shit happens, but that one alone is colossal, and I don't use it that much anyway in these days (STLs and stuff, I may be considering moving some AI models into there and connect the path to the main SSD to free space, but... we will see).


So, in total, I have a potential of 7.9 Tera of free available space.

By your estimation, I believe we have a MAJOR margin of error tolerance. Kinda makes me less worried about this.
Detour
08/25/25 09:59PM
BugmenotEncore said:
If the API grants access to pools, the documentation sure does not say< w >7 I'd be happy to know how to get to it if you know.


No I don't think it does, I was just confused by what you were referring to. But the rest of your post does explain it.
EdgeOfTheMoon
08/26/25 10:48AM
I remember playing around with the API a while back and it is frustratingly limited. Guessing Encore is using good old fashioned webpage scraping?

How tricky is HTML parsing in PowerScript? Last time I had to do HTML manipulation it was in JS and I used DOMParser which made life easier. Downside being. It's JS. And only time I've used PowerShell it was entirely using it to run Python scripts and UE builds
BugmenotEncore
08/26/25 11:41AM
EdgeOfTheMoon said:
I remember playing around with the API a while back and it is frustratingly limited. Guessing Encore is using good old fashioned webpage scraping?


Indeed I am.

How tricky is HTML parsing in PowerScript?


Depends on what you use it for. PowerShell is built for automation. It's very good at telling other functions of your computer to perform simple, repetitive tasks. Those tasks serving to parse HTML does not really make a difference.

Since this project needs exactly that, it works like a charm, really. The hardest part is using -replace to handle incidental hiccups.
If you tried doing something more involved, you'd have a mental breakdown several times before you finished, and it would run like a snail.
BugmenotEncore
08/26/25 09:10PM
pre-alpha 4
UI: Barebones, but slightly less so.
Media/MediaTags/Comments: Functional, but buggy. No change since last time.
Wiki: Fully functional. No change since last time.
Tags: Fully functional. Updated.
Artists: Not implemented.
Pools: Functional. Automation is now online.
Forum: Not implemented.

Progress report:

- Tag archiver now processes yet more rare characters.

A combined five hours of work, but it is done.
The conga line of nonsense had been joined by a couple of kanji.

Pretty sure using syllabary characters in the tags is against the rules, but, fuck it. Script can handle them now, whether they are allowed or not.

- Pool archiving is now fully automated.

All a user has to do is start it, and the script will handle the rest.

I hope to make all of the script automated to this extent. I am not sure if this is even possible for the wiki, but I'm holding out hope I'll get an idea for it later down the line.
EdgeOfTheMoon
08/26/25 09:13PM
BugmenotEncore said:
PowerShell is built for automation. It's very good at telling other functions of your computer to perform simple, repetitive tasks.


Yeah that's what I thought. As for parsing I meant do you have a library or something like DOMParser that lets you search though the DOM for particular elements? Or are you dealing with the HTML as like one big string and pulling stuff out with RegExs?

I'm a game dev and not a web programmer so don't really know too much about this kinda thing so curious.
BugmenotEncore
08/26/25 09:17PM
EdgeOfTheMoon said:
Yeah that's what I thought. As for parsing I meant do you have a library or something like DOMParser that lets you search though the DOM for particular elements? Or are you dealing with the HTML as like one big string and pulling stuff out with RegExs?

I'm a game dev and not a web programmer so don't really know too much about this kinda thing so curious.


Keeping in mind that I make no claim my method is the optimal one, what I'm doing is Almost your second idea. Only difference is I break it down into separate strings using -split whenever it is beneficial to do so.

It is a big difference, though - this would take ages without those artificial line breaks.
1 2>>>


Reply | Forum Index