How I went from SDET to AI Test Engineer — the honest timeline

I spent my first year in tech as an SDET at a mid-size product company in Bangalore. Test cases, regression suites, Selenium frameworks, bug reports filed into Jira. Good work. Honest work. The kind of work where you know at the end of the day whether you did it well or not.

Then AI started showing up in the product roadmap.

Not AI as a feature — AI as the feature. “We’re integrating an LLM here.” “The chatbot will be powered by GPT.” “We want to test the AI assistant before we ship it to customers.” That last sentence landed on my desk.

I had no idea how to do it.

The First Three Months: Mostly Flailing

My instinct was to Google “how to test AI” and find some tutorial. The tutorials were… fine. They talked about bias testing, fairness metrics, adversarial inputs. All real things. None of them told me what I actually needed to know: how do you test a system that doesn’t return the same answer twice?

I tried the obvious thing first — wrote some Selenium-style tests. The test passes a question to the chatbot and checks if the response contains certain keywords. This works until you realize that “contains the word ‘refund’” is not the same as “correctly handles a refund query.” And LLMs are inconsistent enough that the test fails one run in ten for no apparent reason.

The first tests I wrote were largely useless. I knew it while I was writing them. I shipped them anyway because something felt better than nothing.

Looking back: this is normal. Everyone’s first eval suite is useless. The important thing is noticing it’s useless and asking why.

The Question That Changed Things

About four months into the AI testing project, a product manager asked me: “Is the chatbot getting better or worse?”

I didn’t have an answer.

Not “I don’t know the exact number” — I genuinely had no systematic way to answer the question. The tests I’d written could tell me if specific things broke. They couldn’t tell me about overall quality trajectory.

This is the moment I started taking eval seriously as a discipline. Not testing in the traditional sense — evaluation as measurement of quality across a distribution of inputs. The difference sounds subtle. It isn’t.

To answer “is it getting better or worse,” you need:

A representative sample of inputs
A definition of “better” that’s measurable
A way to apply that measurement consistently

This is the core problem of LLM evaluation. I had none of these things, and I hadn’t noticed because I was doing testing, not evaluation.

Learning the Actual Skills

Once I had the right framing, the learning path clarified.

I needed to understand: what are the actual failure modes of LLM systems? Hallucination, yes — but specifically what? When a RAG system makes something up, it’s because the generated text wasn’t grounded in the retrieved context. That’s a measurable property: faithfulness. When a response doesn’t address the actual question, that’s relevancy. When a retrieval step returns documents that aren’t useful for the query, that’s poor context precision.

These aren’t new concepts — they come from information retrieval research that predates LLMs by decades. Precision, recall, F1. Repurposed for a new domain.

I spent three weeks reading papers I didn’t fully understand and building toy examples where I did. The RAGAS paper was the most clarifying — not because of the framework but because of how it decomposed the RAG evaluation problem. Once I understood what I was actually trying to measure, the tooling made sense.

The Python Gap

I need to be honest about this: my automation background was Java and Selenium. Python fluency was not in my skill set.

I spent a month learning Python properly — not “I know the syntax” but “I can debug an async function and write clean, readable scripts that other people can run.” This took longer than I expected and was more important than I expected.

The entire AI tooling ecosystem is Python. RAGAS, DeepEval, LangChain, the Hugging Face libraries — all Python. If you’re coming from a non-Python automation background, budget real time for this. It’s not optional.

I’d recommend: pick one real project and build it entirely in Python, even if you could do it faster in your familiar language. The learning sticks when you’re doing something real.

Six Months In: Something Clicked

Around the six-month mark, something shifted. I started thinking about problems differently.

When someone described a new AI feature, my first question stopped being “how do I write a test for this” and started being “what does failure look like here?” Then: “how would I know if it was failing silently?” Then: “what’s the minimum viable eval dataset to catch the most important failures?”

This is the QA mindset applied to a probabilistic system. It took me a while to make the translation because the surface-level language is so different, but the underlying instinct is the same.

By month eight, I had built a working eval pipeline in CI. Fifty golden test cases. Faithfulness and relevancy metrics running on every PR. Regressions blocked from merging. It wasn’t perfect. It was real infrastructure that caught real problems before they reached users.

What the Transition Actually Costs

I want to be honest about the parts that don’t fit neatly into a “learning journey” narrative.

Time. The first six months, I was doing my day job plus learning eval engineering on evenings and weekends. That’s around 10 extra hours a week for 26 weeks. It’s a real investment, and it affects everything else in your life.

Uncertainty tolerance. Traditional QA has clear answers: the test passes or it fails. LLM evaluation is rarely that clean. A faithfulness score of 0.72 — is that good? It depends on the domain, the threshold you set, the baseline you’re comparing against. Learning to be comfortable with “probably okay, here’s why” instead of “definitely right” is a genuine cognitive adjustment.

Explaining your work. When I caught the first real regression — a prompt change that caused faithfulness to drop 12% — the product manager asked “how do you know it’s worse?” I had to explain faithfulness scores to someone who had never heard of them, while convincing them to delay a release. That conversation was harder than any technical problem I solved that year.

What I’d Tell Someone Starting Now

The shortest path I can see:

Get Python fluent first. Not a prerequisite, but removing that friction early makes everything else easier.
Read the RAGAS paper — not to use RAGAS specifically, but to understand the measurement problem clearly. It’s 16 pages.
Build one eval suite, end to end, for a toy system. Document your criteria before you touch any framework. The criteria are the hard part.
Run it in CI. This step forces you to build something real instead of something demo-able.
Write about what you find. Not polished articles — working notes. The AI eval community is small and pays attention. Being a known voice in a narrow domain opens more doors than credentials.

The transition took me about a year to feel confident in the new role. That feels about right. It’s not a short path, but it’s a real one — and the skills you bring from QA are genuinely valuable in a way that most people on the AI side don’t fully appreciate yet.

If you’re somewhere in this transition and want to compare notes, I’m on X at @actual_amit.