The Open-Source Software Saving the Internet From AI Bot Scrapers

fattyfoods@feddit.nl · 11 months ago

The Open-Source Software Saving the Internet From AI Bot Scrapers

unexposedhazard@discuss.tchncs.de · edit-2 11 months ago

Non paywalled link https://archive.is/VcoE1

It basically boils down to making the browser do some cpu heavy calculations before allowing access. This is no problem for a single user, but for a bot farm this would increase the amount of compute power they need 100x or more.

ElectricVocalist@jlai.lu · 11 months ago

Exactly. It’s called proof-of-work and was originally invented to reduce spam emails but was later used by Bitcoin to control its growth speed

JackbyDev@programming.dev · 11 months ago

It’s funby that older captchas could be viewed as proof of work algorithms now because image recognition is so good. (From using captchas.)

Ferk@lemmy.ml · edit-2 11 months ago

That’s actually a good idea. A very simple “click the frog” captcha might be solvable by an AI but it would work as a way to make it more expensive for crawlers without wasting compute resources (energy!) on the user or slowing down old devices to a crawl. So in some ways it could be a better alternative to Anubis.

ElectricVocalist@jlai.lu · edit-2 11 months ago

Interesting stance. I have bought many tens of thousand of captcha soves for legitimate reasons, and I have now completely lost faith in them

medem@lemmy.wtf · 11 months ago

What advantage does this software provide over simply banning bots via robots.txt?

</Stupidquestion>

kcweller@feddit.nl · 11 months ago

Robots.txt expects that the client is respecting the rules, for instance, marking that they are a scraper.

AI scrapers don’t respect this trust, and thus robots.txt is meaningless.

PlantPowerPhysicist@discuss.tchncs.de · 11 months ago

the scrapers ignore robots.txt. It doesn’t really ban them - it just asks them not to access things, but they are programmed by assholes.

fox2263@lemmy.world · 11 months ago

I’d like to use Anubis but the strange hentai character as a mascot is not too professional

TimeSquirrel@kbin.melroy.org · 11 months ago

Honestly, good. Getting sick of the “professional” world being so goddamn stiff and boring. Push back against sanitized corporate aesthetics.

☂️-@lemmy.ml · edit-2 11 months ago

deleted by creator

bdonvr@thelemmy.club · 11 months ago

Ooh can this work with Lemmy without affecting federation?

infinitesunrise@slrpnk.net · 11 months ago

Yeah, it’s already deployed on slrpnk.net. I see it momentarily every time I load the site.

fuzzy_tinker@lemmy.world · 11 months ago

This is fantastic and I appreciate that it scales well on the server side.

Ai scraping is a scourge and I would love to know the collective amount of power wasted due to the necessity of countermeasures like this and add this to the total wasted by ai.

interdimensionalmeme@lemmy.ml · 11 months ago

All this could be avoided by making submit photo id to login into a account.

HaraldvonBlauzahn@feddit.org · 11 months ago

I don’t think this would help:

https://thispersondoesnotexist.com/

interdimensionalmeme@lemmy.ml · 11 months ago

By photo ID, I don’t mean just any photo, I mean “photo id” cryptographically signed by the state, certificates checked, database pinged, identity validated, the whole enchilada

Russ@bitforged.space · 11 months ago

That would have the same effect as just taking the site offline…

No one is giving a random site their photo ID.

interdimensionalmeme@lemmy.ml · 11 months ago

You’d be surprised, many humans have simply no backbone, common sense nor self respect so I think they very probably would still, in large numbers. Proof is facebook and palantir.

koper@feddit.nl · 11 months ago

I get that website admins are desperate for a solution, but Anubis is fundamentally flawed.

It is hostile to the user, because it is very slow on older hardware andere forces you to use javascript.

It is bad for the environment, because it wastes energy on useless computations similar to mining crypto. If more websites start using this, that really adds up.

But most importantly, it won’t work in the end. These scraping tech companies have much deeper pockets and can use specialized hardware that is much more efficient at solving these challenges than a normal web browser.

swelter_spark@reddthat.com · 11 months ago

I agree. When I run into a page that demands I turn on Javascript for whatever purpose, I usually just leave. I wish there was some way to just not even see links to sites that require this.

Luke@lemmy.ml · 11 months ago

she’s working on a non cryptographic challenge so it taxes users’ CPUs less, and also thinking about a version that doesn’t require JavaScript

Sounds like the developer of Anubis is aware and working on these shortcomings.

Still, IMO these are minor short term issues compared to the scope of the AI problem it’s addressing.

koper@feddit.nl · edit-2 11 months ago

To be clear, I am not minimizing the problems of scrapers. I am merely pointing out that this strategy of proof-of-work has nasty side effects and we need something better.

These issues are not short term. PoW means you are entering into an arms race against an adversary with bottomless pockets that inherently requires a ton of useless computations in the browser.

When it comes to moving towards something based on heuristics, which is what the developer was talking about there, that is much better. But that is basically what many others are already doing (like the “I am not a robot” checkmark) and fundamentally different from the PoW that I argue against.

Go do heuristics, not PoW.

Vendetta9076@sh.itjust.works · 11 months ago

Youre more than welcome to try and implement something better.

koper@feddit.nl · 11 months ago

“You criticize society yet you participate in it. Curious.”

seang96@spgrn.com · 11 months ago

A javascriptless check was released recently I just read about it. Uses some refresh HTML tag and a delay. Its not default though since its new.

Spice Hoarder@lemmy.zip · 11 months ago

I don’t like it either because my prefered way to use the web is either through the terminal or a very stripped down browser. I HATE tracking and JS

refalo@programming.dev · edit-2 11 months ago

I don’t understand how/why this got so popular out of nowhere… the same solution has already existed for years in the form of haproxy-protection and a couple others… but nobody seems to care about those.

Flipper@feddit.org · 11 months ago

Probably because the creator had a blog post that got shared around at a point in time where this exact problem was resonating with users.

It’s not always about being first but about marketing.

JohnEdwa@sopuli.xyz · edit-2 11 months ago

It’s not always about being first but about marketing.

And one has a cute catgirl mascot, the other a website that looks like a blockchain techbro startup.
I’m even willing to bet the amount of people that set up Anubis just to get the cute splash screen isn’t insignificant.

JackbyDev@programming.dev · 11 months ago

Compare and contrast.

High-performance traffic management and next-gen security with multi-cloud management and observability. Built for the enterprise — open source at heart.

Sounds like some over priced, vacuous, do-everything solution. Looks and sounds like every other tech website. Looks like it is meant to appeal to the people who still say “cyber”. Looks and sounds like fauxpen source.

Weigh the soul of incoming HTTP requests to protect your website!

Cute. Adorable. Baby girl. Protect my website. Looks fun. Has one clear goal.

Kazumara@discuss.tchncs.de · 11 months ago

Just recently there was a guy on the NANOG List ranting about Anubis being the wrong approach and people should just cache properly then their servers would handle thousands of users and the bots wouldn’t matter. Anyone who puts git online has no-one to blame but themselves, e-commerce should just be made cacheable etc. Seemed a bit idealistic, a bit detached from the current reality.

Ah found it, here

deadcade@lemmy.deadca.de · 11 months ago

Someone making an argument like that clearly does not understand the situation. Just 4 years ago, a robots.txt was enough to keep most bots away, and hosting personal git on the web required very little resources. With AI companies actively profiting off stealing everything, a robots.txt doesn’t mean anything. Now, even a relatively small git web host takes an insane amount of resources. I’d know - I host a Forgejo instance. Caching doesn’t matter, because diffs berween two random commits are likely unique. Ratelimiting doesn’t matter, they will use different IP (ranges) and user agents. It would also heavily impact actual users “because the site is busy”.

A proof-of-work solution like Anubis is the best we have currently. The least possible impact to end users, while keeping most (if not all) AI scrapers off the site.

interdimensionalmeme@lemmy.ml · 11 months ago

This would not be a problem if one bot scraped once, and the result was then mirrored to all on Big Tech’s dime (cloudflare, tailscale) but since they are all competing now, they think their edge is going to be their own more better scraper setup and they won’t share.

Maybe there should just be a web to torrent bridge sovtge data is pushed out once by the server and tge swarm does the heavy lifting as a cache.

deadcade@lemmy.deadca.de · 11 months ago

No, it’d still be a problem; every diff between commits is expensive to render to web, even if “only one company” is scraping it, “only one time”. Many of these applications are designed for humans, not scrapers.

interdimensionalmeme@lemmy.ml · 11 months ago

If the rendering data for scraper was really the problem Then the solution is simple, just have downloadable dumps of the publicly available information That would be extremely efficient and cost fractions of pennies in monthly bandwidth Plus the data would be far more usable for whatever they are using it for.

The problem is trying to have freely available data, but for the host to maintain the ability to leverage this data later.

I don’t think we can have both of these.