Non paywalled link https://archive.is/VcoE1
It basically boils down to making the browser do some cpu heavy calculations before allowing access. This is no problem for a single user, but for a bot farm this would increase the amount of compute power they need 100x or more.
Exactly. It’s called proof-of-work and was originally invented to reduce spam emails but was later used by Bitcoin to control its growth speed
It’s funby that older captchas could be viewed as proof of work algorithms now because image recognition is so good. (From using captchas.)
That’s actually a good idea. A very simple “click the frog” captcha might be solvable by an AI but it would work as a way to make it more expensive for crawlers without wasting compute resources (energy!) on the user or slowing down old devices to a crawl. So in some ways it could be a better alternative to Anubis.
Interesting stance. I have bought many tens of thousand of captcha soves for legitimate reasons, and I have now completely lost faith in them
<Stupidquestion>
What advantage does this software provide over simply banning bots via robots.txt?
</Stupidquestion>
Robots.txt expects that the client is respecting the rules, for instance, marking that they are a scraper.
AI scrapers don’t respect this trust, and thus robots.txt is meaningless.
the scrapers ignore robots.txt. It doesn’t really ban them - it just asks them not to access things, but they are programmed by assholes.
I’d like to use Anubis but the strange hentai character as a mascot is not too professional
Honestly, good. Getting sick of the “professional” world being so goddamn stiff and boring. Push back against sanitized corporate aesthetics.
deleted by creator
Ooh can this work with Lemmy without affecting federation?
Yeah, it’s already deployed on slrpnk.net. I see it momentarily every time I load the site.
This is fantastic and I appreciate that it scales well on the server side.
Ai scraping is a scourge and I would love to know the collective amount of power wasted due to the necessity of countermeasures like this and add this to the total wasted by ai.
All this could be avoided by making submit photo id to login into a account.
I don’t think this would help:
By photo ID, I don’t mean just any photo, I mean “photo id” cryptographically signed by the state, certificates checked, database pinged, identity validated, the whole enchilada
That would have the same effect as just taking the site offline…
No one is giving a random site their photo ID.
You’d be surprised, many humans have simply no backbone, common sense nor self respect so I think they very probably would still, in large numbers. Proof is facebook and palantir.
I get that website admins are desperate for a solution, but Anubis is fundamentally flawed.
It is hostile to the user, because it is very slow on older hardware andere forces you to use javascript.
It is bad for the environment, because it wastes energy on useless computations similar to mining crypto. If more websites start using this, that really adds up.
But most importantly, it won’t work in the end. These scraping tech companies have much deeper pockets and can use specialized hardware that is much more efficient at solving these challenges than a normal web browser.
I agree. When I run into a page that demands I turn on Javascript for whatever purpose, I usually just leave. I wish there was some way to just not even see links to sites that require this.
she’s working on a non cryptographic challenge so it taxes users’ CPUs less, and also thinking about a version that doesn’t require JavaScript
Sounds like the developer of Anubis is aware and working on these shortcomings.
Still, IMO these are minor short term issues compared to the scope of the AI problem it’s addressing.
To be clear, I am not minimizing the problems of scrapers. I am merely pointing out that this strategy of proof-of-work has nasty side effects and we need something better.
These issues are not short term. PoW means you are entering into an arms race against an adversary with bottomless pockets that inherently requires a ton of useless computations in the browser.
When it comes to moving towards something based on heuristics, which is what the developer was talking about there, that is much better. But that is basically what many others are already doing (like the “I am not a robot” checkmark) and fundamentally different from the PoW that I argue against.
Go do heuristics, not PoW.
Youre more than welcome to try and implement something better.
“You criticize society yet you participate in it. Curious.”
A javascriptless check was released recently I just read about it. Uses some refresh HTML tag and a delay. Its not default though since its new.
I don’t like it either because my prefered way to use the web is either through the terminal or a very stripped down browser. I HATE tracking and JS
I don’t understand how/why this got so popular out of nowhere… the same solution has already existed for years in the form of haproxy-protection and a couple others… but nobody seems to care about those.
Probably because the creator had a blog post that got shared around at a point in time where this exact problem was resonating with users.
It’s not always about being first but about marketing.
It’s not always about being first but about marketing.
And one has a cute catgirl mascot, the other a website that looks like a blockchain techbro startup.
I’m even willing to bet the amount of people that set up Anubis just to get the cute splash screen isn’t insignificant.Compare and contrast.
High-performance traffic management and next-gen security with multi-cloud management and observability. Built for the enterprise — open source at heart.
Sounds like some over priced, vacuous, do-everything solution. Looks and sounds like every other tech website. Looks like it is meant to appeal to the people who still say “cyber”. Looks and sounds like fauxpen source.
Weigh the soul of incoming HTTP requests to protect your website!
Cute. Adorable. Baby girl. Protect my website. Looks fun. Has one clear goal.
Just recently there was a guy on the NANOG List ranting about Anubis being the wrong approach and people should just cache properly then their servers would handle thousands of users and the bots wouldn’t matter. Anyone who puts git online has no-one to blame but themselves, e-commerce should just be made cacheable etc. Seemed a bit idealistic, a bit detached from the current reality.
Someone making an argument like that clearly does not understand the situation. Just 4 years ago, a robots.txt was enough to keep most bots away, and hosting personal git on the web required very little resources. With AI companies actively profiting off stealing everything, a robots.txt doesn’t mean anything. Now, even a relatively small git web host takes an insane amount of resources. I’d know - I host a Forgejo instance. Caching doesn’t matter, because diffs berween two random commits are likely unique. Ratelimiting doesn’t matter, they will use different IP (ranges) and user agents. It would also heavily impact actual users “because the site is busy”.
A proof-of-work solution like Anubis is the best we have currently. The least possible impact to end users, while keeping most (if not all) AI scrapers off the site.
This would not be a problem if one bot scraped once, and the result was then mirrored to all on Big Tech’s dime (cloudflare, tailscale) but since they are all competing now, they think their edge is going to be their own more better scraper setup and they won’t share.
Maybe there should just be a web to torrent bridge sovtge data is pushed out once by the server and tge swarm does the heavy lifting as a cache.
No, it’d still be a problem; every diff between commits is expensive to render to web, even if “only one company” is scraping it, “only one time”. Many of these applications are designed for humans, not scrapers.
If the rendering data for scraper was really the problem Then the solution is simple, just have downloadable dumps of the publicly available information That would be extremely efficient and cost fractions of pennies in monthly bandwidth Plus the data would be far more usable for whatever they are using it for.
The problem is trying to have freely available data, but for the host to maintain the ability to leverage this data later.
I don’t think we can have both of these.















