• daniskarma@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    1
    arrow-down
    3
    ·
    edit-2
    1 day ago

    Why would they request so many times a day the same data if the objective was AI model training. It makes zero sense.

    Also google bots obeys robots.txt so they are easy to manage.

    There may be tons of reasons google is crawling your website. From ad research to any kind of research. The only AI related use I can think of is RAG. But that would take some user requests aways because if the user got the info through the AI google response then they would not enter the website. I suppose that would suck for the website owner, but it won’t drastically increase the number of requests.

    But for training I don’t see it, there’s no need at all to keep constantly scraping the same web for model training.

    • grysbok@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      7
      ·
      edit-2
      1 day ago

      Like I said, [edit: at one point] Facebook requested my robots.txt multiple times a second. You’ve not convinced me that bot writers care about efficiency.

      [edit: they’ve since stopped, possibly because now I give a 404 to anything claiming to be from facebook]

      • The Quuuuuill@slrpnk.net
        link
        fedilink
        English
        arrow-up
        3
        ·
        1 day ago

        You’ve not convinced me that bot writers care about efficiency.

        and why should bot writers care about efficiency when what they really care about is time. they’ll burn all your resources without regard simply because they’re not who’s paying

        • grysbok@lemmy.sdf.org
          link
          fedilink
          English
          arrow-up
          4
          ·
          1 day ago

          Yep, they’ll just burn taxpayer resources (me and my poor servers) because it’s not like they pay taxes anyway (assuming they are either a corporation or not based in the same locality as I am).

          There’s only one of me and if I’m working on keeping the servers bare minimum functional today I’m not working on making something more awesome for tomorrow. “Linux sysadmin” is only supposed to be up to 30% of my job.

          • grysbok@lemmy.sdf.org
            link
            fedilink
            English
            arrow-up
            3
            ·
            edit-2
            1 day ago

            I mean, I enjoy linux sysadmining, but fighting bots takes time, experimentation, and research, and there’s other stuff I should be doing. For example, accessibility updates to our websites. But, accessibility doesn’t matter a lick if you can’t access the website anyway due to timeouts.