Hi, I'm Anya Jacoby! I'm currently working as a Compete Solutions Architect and AI/ML strategist. Here I write about cloud architecture, generative AI, and the evolving competitive landscape (or anything that really catches my eye).
The AI industry evaluates models the way you’d hire a surgeon based entirely on a written exam. Two recent pieces of work show why the problem is worse than most people realize.
▶ READ ARTICLESOLUTIONS ARCHITECT, GENERATIVE AI · AMAZON WEB SERVICES
I'm a seasoned AWS Solutions Architect specializing in AI/ML and competitive cloud strategy, with a proven track record of multi-cloud expertise and cross-functional leadership. My work sits at the intersection of deep technical architecture and strategic communication.
I bring deep expertise across AWS, Azure, OpenAI, Google Cloud Platform, NVIDIA, Databricks, and Snowflake. From Amazon Bedrock and agentic workflows to GPU-accelerated deep learning and competitive deal advisory — I've worked across the stack and across the competitive landscape.
Before joining AWS, I was an Embedded Software Engineer turned Data Scientist at Carrier Global Corporation. I hold dual master's degrees (Applied Data Science and Machine Learning (with a focus on human-centered Artificial Intelligence), and an MBA) and dual bachelor's degrees (Applied Mathematics and Economics) from Syracuse University.
This site is where I publish my writing — long-form thinking on AI strategy, cloud architecture, and the ideas I find most worth sharing.
Whether you want to collaborate, discuss AI strategy, or just say hi, I'd love to hear from you. Shoot me a message!
Selected projects, case studies, and technical work.
Projects and portfolio pieces are being prepared. In the meantime, feel free to get in touch or read more on my background.
FEBRUARY 2026
I love an analogy so I’ll start that way: Let’s say a surgeon is hired, but based entirely on their performance on a written multiple-choice exam. They ace it (they memorized it overnight maybe?), and you hire them. In the operating room, things go not-so-great fast, because cutting people open turns out to involve a different skill set than filling in bubbles.
This is, more or less, what the AI industry is doing right now with model evaluation.
Two recent pieces of work make the case, from different angles, that the problem is worse than most people realize.
The first is a February 2026 academic paper, “The Necessity of a Unified Framework for LLM-Based Agent Evaluation,” from researchers at SUNY and the University of Illinois (arxiv.org/abs/2602.03238). Its core argument is simple and damning: we have no reliable way to compare AI agents against each other, because every research team tests them differently. Different prompts, different tools, different environments; what looks like a model getting smarter might just be a researcher writing a better setup. When you can’t isolate the thing you’re trying to measure, the number you get back isn’t a measurement. It’s a guess-timate.
The second is Artificial Analysis, an independent benchmarking organization testing every major AI model themselves, on their own hardware, using the same conditions for everyone (x.com/ArtificialAnlys). No submitted scores from the labs, no cherry-picking, and this matters because in late 2025 it came out that major AI labs had been submitting only their best results to public leaderboards, which (as you can probably “guess-timate”) inflated rankings by up to 100 points. Artificial Analysis’s approach, don’t let anyone grade their own homework, turns out to be the bare minimum standard for honesty.
In the end, both pieces of work land on the same uncomfortable truth. Even careful, independent testing can’t fully solve the problem the academic paper identifies, because the issue isn’t just who runs the tests, but what the tests are actually measuring. An AI agent isn’t a calculator that spits out the same answer every time. It’s more like an employee whose performance depends on how you manage them, what tools you give them, and what you ask them to do. Change the instructions slightly and you get wildly different results from the exact same model. The score doesn’t travel.
This is something I keep coming back to in my own work: the “best” model on a leaderboard is rarely the best model for your specific job. A smaller, faster, cheaper model with the right setup will outperform a frontier model with the wrong one almost every time. The benchmark tells you who won the standardized test. It says nothing about fit.
Despite the rambling, this does all mean something practically. Every AI benchmark score you see, including the careful independent ones, is a screening tool, not a prediction! It tells you which models aren’t worth your time. It does not tell you which model will work for you.
What the field actually needs is something medicine figured out a long time ago (and what we like to scare ourselves with when we start feeling something... not good... cough, cough, WebMD): and it’s published protocols. Every AI evaluation should come with a full description of exactly how it was run, the prompt, the tools, the environment, so results can be compared and trusted. Without that, the leaderboards are just marketing with better fonts.
The scores will keep climbing, the gap between the number and reality will keep widening, and people deploying these systems will keep learning the hard way that the report card and the job are two completely different things.