PROFESSIONAL WRITING & IDEAS

Peeling back layers...

Hi, I'm Anya Jacoby! I'm currently working as a Compete Solutions Architect and AI/ML strategist. Here I write about cloud architecture, generative AI, and the evolving competitive landscape (or anything that really catches my eye).

ABOUT ME GET IN TOUCH

Latest Writing

// BLOG

★ NEW

FEATURED February 2026

Some Thoughts on Model Benchmarking and Use Cases

The AI industry evaluates models the way you’d hire a surgeon based entirely on a written exam. Two recent pieces of work show why the problem is worse than most people realize.

▶ READ ARTICLE

ABOUT ME

Anya Jacoby

SOLUTIONS ARCHITECT, GENERATIVE AI · AMAZON WEB SERVICES

I'm a seasoned AWS Solutions Architect specializing in AI/ML and competitive cloud strategy, with a proven track record of multi-cloud expertise and cross-functional leadership. My work sits at the intersection of deep technical architecture and strategic communication.

I bring deep expertise across AWS, Azure, OpenAI, Google Cloud Platform, NVIDIA, Databricks, and Snowflake. From Amazon Bedrock and agentic workflows to GPU-accelerated deep learning and competitive deal advisory — I've worked across the stack and across the competitive landscape.

Before joining AWS, I was an Embedded Software Engineer turned Data Scientist at Carrier Global Corporation. I hold dual master's degrees (Applied Data Science and Machine Learning (with a focus on human-centered Artificial Intelligence), and an MBA) and dual bachelor's degrees (Applied Mathematics and Economics) from Syracuse University.

This site is where I publish my writing — long-form thinking on AI strategy, cloud architecture, and the ideas I find most worth sharing.

Technical Skills

AI/ML Strategy & Deployment Cloud Competitive Analysis Technical Enablement Content Architecture & Stakeholder Design Amazon Bedrock SageMaker AI Azure OpenAI / Foundry Agentic Workflows GPU-Accelerated ML (EC2/NVIDIA) PyTorch Databricks Snowflake Python Spark / SQL Data Engineering Pipelines Public Speaking Workflow Optimization

Certifications

AWS Certified Machine Learning – Specialty

AWS Certified Solutions Architect – Associate

AWS Certified Speaker

AWS Certified Cloud Practitioner

Microsoft Certified: Azure Administrator Associate

Experience

Feb 2025 – Present

AMAZON WEB SERVICES

Compete Solutions Architect II

Led end-to-end AI/ML and agentic architecture across AWS and Azure. Produced 120+ competitive assets and 9 flagship deliverables; served as trusted advisor in high-stakes competitive deals while influencing internal product and roadmap discussions.

Dec 2022 – Feb 2025

AMAZON WEB SERVICES

Compete Solutions Architect I

Designed end-to-end AI systems integrating AWS analytics, PyTorch-based training, feature stores, and scalable data engineering pipelines for generative, predictive, and agentic AI use cases.

Jul – Dec 2022

AMAZON WEB SERVICES

Associate Solutions Architect

Began technical enablement career at AWS — delivering architecture reviews, workshops, and competitive intelligence to field teams.

Aug 2021 – Jul 2022

CARRIER GLOBAL CORPORATION

Data Scientist

Built end-to-end Python ML models predicting condenser blockages and faults, deployed on Databricks. Collaborated on an agile Scrum team analyzing HVAC/R data and simulating failure modes.

Apr – Aug 2021

CARRIER GLOBAL CORPORATION

Software Engineer

Designed software representations of Container systems using Python and embedded C. Built automation and simulation tools to streamline IoT software testing.

Jun – Aug 2020

OMNICELL, INC.

Data Science / ML Intern

Built a text classification solution using Python, Spark, and SQL on Databricks, improving identification accuracy by ~30%. Delivered a production-ready MVP in under 3 months.

Education

SYRACUSE UNIVERSITY

M.S. Applied Data Science

2020 – 2021 · Focus: Human-Centered AI

SYRACUSE UNIVERSITY

MBA

2020 – 2021

SYRACUSE UNIVERSITY

B.S. Applied Mathematics

2016 – 2020

SYRACUSE UNIVERSITY

B.S. Economics

2016 – 2020

GET IN TOUCH

Let's Connect

Whether you want to collaborate, discuss AI strategy, or just say hi, I'd love to hear from you. Shoot me a message!

✦ MESSAGE SENT! I'LL BE IN TOUCH SOON.

DIRECT CONTACT

EMAIL

anyamjacoby@gmail.com

ALT EMAIL

anyampatel@gmail.com

PHONE

(805) 453-1532

ONLINE

linkedin.com/in/anya-jacoby

OPEN TO

Speaking engagements, collaborations, technical writing opportunities, building, and much more, so don't be shy!

WORK

Portfolio

Selected projects, case studies, and technical work.

Coming Soon

Projects and portfolio pieces are being prepared. In the meantime, feel free to get in touch or read more on my background.

← BACK TO WRITING

AI & MODEL EVALUATION

Some Thoughts on Model Benchmarking and Use Cases

FEBRUARY 2026

I love an analogy so I’ll start that way: Let’s say a surgeon is hired, but based entirely on their performance on a written multiple-choice exam. They ace it (they memorized it overnight maybe?), and you hire them. In the operating room, things go not-so-great fast, because cutting people open turns out to involve a different skill set than filling in bubbles.

This is, more or less, what the AI industry is doing right now with model evaluation.

Two recent pieces of work make the case, from different angles, that the problem is worse than most people realize.

The first is a February 2026 academic paper, “The Necessity of a Unified Framework for LLM-Based Agent Evaluation,” from researchers at SUNY and the University of Illinois (arxiv.org/abs/2602.03238). Its core argument is simple and damning: we have no reliable way to compare AI agents against each other, because every research team tests them differently. Different prompts, different tools, different environments; what looks like a model getting smarter might just be a researcher writing a better setup. When you can’t isolate the thing you’re trying to measure, the number you get back isn’t a measurement. It’s a guess-timate.

The second is Artificial Analysis, an independent benchmarking organization testing every major AI model themselves, on their own hardware, using the same conditions for everyone (x.com/ArtificialAnlys). No submitted scores from the labs, no cherry-picking, and this matters because in late 2025 it came out that major AI labs had been submitting only their best results to public leaderboards, which (as you can probably “guess-timate”) inflated rankings by up to 100 points. Artificial Analysis’s approach, don’t let anyone grade their own homework, turns out to be the bare minimum standard for honesty.

In the end, both pieces of work land on the same uncomfortable truth. Even careful, independent testing can’t fully solve the problem the academic paper identifies, because the issue isn’t just who runs the tests, but what the tests are actually measuring. An AI agent isn’t a calculator that spits out the same answer every time. It’s more like an employee whose performance depends on how you manage them, what tools you give them, and what you ask them to do. Change the instructions slightly and you get wildly different results from the exact same model. The score doesn’t travel.

This is something I keep coming back to in my own work: the “best” model on a leaderboard is rarely the best model for your specific job. A smaller, faster, cheaper model with the right setup will outperform a frontier model with the wrong one almost every time. The benchmark tells you who won the standardized test. It says nothing about fit.

Despite the rambling, this does all mean something practically. Every AI benchmark score you see, including the careful independent ones, is a screening tool, not a prediction! It tells you which models aren’t worth your time. It does not tell you which model will work for you.

What the field actually needs is something medicine figured out a long time ago (and what we like to scare ourselves with when we start feeling something... not good... cough, cough, WebMD): and it’s published protocols. Every AI evaluation should come with a full description of exactly how it was run, the prompt, the tools, the environment, so results can be compared and trusted. Without that, the leaderboards are just marketing with better fonts.

The scores will keep climbing, the gap between the number and reality will keep widening, and people deploying these systems will keep learning the hard way that the report card and the job are two completely different things.

← BACK TO ALL POSTS