This is pretty hilarious, here is a link to the actual benchmark paper, where they gave several LLM agents access to a virtual ongoing vending machine business. Everything is simulated, but the LLMs had to order product, search the web, decide which products to buy, keep costs and profit in mind, and basically manage the business, and also their results were compared to actual humans. Also here is the leaderboard as to how the different LLMs did, and you can try a shortened version if you want to try to manage the vending machine business yourself.

  • 🍪CRUMBGRABBER🍪@lemm.eeOP
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 days ago

    To be fair, some of the LLMs like Claude had a higher profitability than humans. The average human made 800 bucks in this business and one of the latest Claude models made 2700, so it searched, picked its inventory well and achieved results. The only thing is that humans were profitable 100 percent of the time and if they experienced existential dread, they were good at hiding it from HR.

    • SpikesOtherDog@ani.social
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 days ago

      To be clear, that was the BEST run, when it wasn’t attempting to email the FBI or the fundamental laws of the universe.

    • Brkdncr@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 days ago

      In a few years matching tiles containing crosswalks will be replaced with how well you can fake being content.