• unexposedhazard@discuss.tchncs.de
    link
    fedilink
    arrow-up
    2
    ·
    edit-2
    9 months ago

    Non paywalled link https://archive.is/VcoE1

    It basically boils down to making the browser do some cpu heavy calculations before allowing access. This is no problem for a single user, but for a bot farm this would increase the amount of compute power they need 100x or more.

    • Mubelotix@jlai.lu
      link
      fedilink
      arrow-up
      2
      ·
      9 months ago

      Exactly. It’s called proof-of-work and was originally invented to reduce spam emails but was later used by Bitcoin to control its growth speed

      • JackbyDev@programming.dev
        link
        fedilink
        English
        arrow-up
        1
        ·
        9 months ago

        It’s funby that older captchas could be viewed as proof of work algorithms now because image recognition is so good. (From using captchas.)

        • Ferk@lemmy.ml
          link
          fedilink
          arrow-up
          1
          ·
          edit-2
          9 months ago

          That’s actually a good idea. A very simple “click the frog” captcha might be solvable by an AI but it would work as a way to make it more expensive for crawlers without wasting compute resources (energy!) on the user or slowing down old devices to a crawl. So in some ways it could be a better alternative to Anubis.

        • Mubelotix@jlai.lu
          link
          fedilink
          arrow-up
          1
          ·
          edit-2
          9 months ago

          Interesting stance. I have bought many tens of thousand of captcha soves for legitimate reasons, and I have now completely lost faith in them

  • medem@lemmy.wtf
    link
    fedilink
    arrow-up
    1
    ·
    9 months ago

    <Stupidquestion>

    What advantage does this software provide over simply banning bots via robots.txt?

    </Stupidquestion>

    • kcweller@feddit.nl
      link
      fedilink
      arrow-up
      1
      ·
      9 months ago

      Robots.txt expects that the client is respecting the rules, for instance, marking that they are a scraper.

      AI scrapers don’t respect this trust, and thus robots.txt is meaningless.

  • fox2263@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    9 months ago

    I’d like to use Anubis but the strange hentai character as a mascot is not too professional

  • fuzzy_tinker@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    9 months ago

    This is fantastic and I appreciate that it scales well on the server side.

    Ai scraping is a scourge and I would love to know the collective amount of power wasted due to the necessity of countermeasures like this and add this to the total wasted by ai.

  • koper@feddit.nl
    link
    fedilink
    arrow-up
    0
    ·
    9 months ago

    I get that website admins are desperate for a solution, but Anubis is fundamentally flawed.

    It is hostile to the user, because it is very slow on older hardware andere forces you to use javascript.

    It is bad for the environment, because it wastes energy on useless computations similar to mining crypto. If more websites start using this, that really adds up.

    But most importantly, it won’t work in the end. These scraping tech companies have much deeper pockets and can use specialized hardware that is much more efficient at solving these challenges than a normal web browser.

    • swelter_spark@reddthat.com
      link
      fedilink
      English
      arrow-up
      1
      ·
      9 months ago

      I agree. When I run into a page that demands I turn on Javascript for whatever purpose, I usually just leave. I wish there was some way to just not even see links to sites that require this.

    • Luke@lemmy.ml
      link
      fedilink
      English
      arrow-up
      1
      ·
      9 months ago

      she’s working on a non cryptographic challenge so it taxes users’ CPUs less, and also thinking about a version that doesn’t require JavaScript

      Sounds like the developer of Anubis is aware and working on these shortcomings.

      Still, IMO these are minor short term issues compared to the scope of the AI problem it’s addressing.

      • koper@feddit.nl
        link
        fedilink
        arrow-up
        0
        ·
        edit-2
        9 months ago

        To be clear, I am not minimizing the problems of scrapers. I am merely pointing out that this strategy of proof-of-work has nasty side effects and we need something better.

        These issues are not short term. PoW means you are entering into an arms race against an adversary with bottomless pockets that inherently requires a ton of useless computations in the browser.

        When it comes to moving towards something based on heuristics, which is what the developer was talking about there, that is much better. But that is basically what many others are already doing (like the “I am not a robot” checkmark) and fundamentally different from the PoW that I argue against.

        Go do heuristics, not PoW.

    • seang96@spgrn.com
      link
      fedilink
      arrow-up
      1
      ·
      9 months ago

      A javascriptless check was released recently I just read about it. Uses some refresh HTML tag and a delay. Its not default though since its new.

    • Spice Hoarder@lemmy.zip
      link
      fedilink
      arrow-up
      1
      ·
      9 months ago

      I don’t like it either because my prefered way to use the web is either through the terminal or a very stripped down browser. I HATE tracking and JS

  • refalo@programming.dev
    link
    fedilink
    arrow-up
    0
    ·
    edit-2
    9 months ago

    I don’t understand how/why this got so popular out of nowhere… the same solution has already existed for years in the form of haproxy-protection and a couple others… but nobody seems to care about those.

    • Flipper@feddit.org
      link
      fedilink
      arrow-up
      0
      ·
      9 months ago

      Probably because the creator had a blog post that got shared around at a point in time where this exact problem was resonating with users.

      It’s not always about being first but about marketing.

        • JackbyDev@programming.dev
          link
          fedilink
          English
          arrow-up
          1
          ·
          9 months ago

          Compare and contrast.

          High-performance traffic management and next-gen security with multi-cloud management and observability. Built for the enterprise — open source at heart.

          Sounds like some over priced, vacuous, do-everything solution. Looks and sounds like every other tech website. Looks like it is meant to appeal to the people who still say “cyber”. Looks and sounds like fauxpen source.

          Weigh the soul of incoming HTTP requests to protect your website!

          Cute. Adorable. Baby girl. Protect my website. Looks fun. Has one clear goal.

  • Kazumara@discuss.tchncs.de
    link
    fedilink
    arrow-up
    0
    ·
    9 months ago

    Just recently there was a guy on the NANOG List ranting about Anubis being the wrong approach and people should just cache properly then their servers would handle thousands of users and the bots wouldn’t matter. Anyone who puts git online has no-one to blame but themselves, e-commerce should just be made cacheable etc. Seemed a bit idealistic, a bit detached from the current reality.

    Ah found it, here

    • deadcade@lemmy.deadca.de
      link
      fedilink
      arrow-up
      0
      ·
      9 months ago

      Someone making an argument like that clearly does not understand the situation. Just 4 years ago, a robots.txt was enough to keep most bots away, and hosting personal git on the web required very little resources. With AI companies actively profiting off stealing everything, a robots.txt doesn’t mean anything. Now, even a relatively small git web host takes an insane amount of resources. I’d know - I host a Forgejo instance. Caching doesn’t matter, because diffs berween two random commits are likely unique. Ratelimiting doesn’t matter, they will use different IP (ranges) and user agents. It would also heavily impact actual users “because the site is busy”.

      A proof-of-work solution like Anubis is the best we have currently. The least possible impact to end users, while keeping most (if not all) AI scrapers off the site.

      • interdimensionalmeme@lemmy.ml
        link
        fedilink
        arrow-up
        0
        ·
        9 months ago

        This would not be a problem if one bot scraped once, and the result was then mirrored to all on Big Tech’s dime (cloudflare, tailscale) but since they are all competing now, they think their edge is going to be their own more better scraper setup and they won’t share.

        Maybe there should just be a web to torrent bridge sovtge data is pushed out once by the server and tge swarm does the heavy lifting as a cache.

        • deadcade@lemmy.deadca.de
          link
          fedilink
          arrow-up
          0
          ·
          9 months ago

          No, it’d still be a problem; every diff between commits is expensive to render to web, even if “only one company” is scraping it, “only one time”. Many of these applications are designed for humans, not scrapers.

          • interdimensionalmeme@lemmy.ml
            link
            fedilink
            arrow-up
            0
            arrow-down
            1
            ·
            9 months ago

            If the rendering data for scraper was really the problem Then the solution is simple, just have downloadable dumps of the publicly available information That would be extremely efficient and cost fractions of pennies in monthly bandwidth Plus the data would be far more usable for whatever they are using it for.

            The problem is trying to have freely available data, but for the host to maintain the ability to leverage this data later.

            I don’t think we can have both of these.