A/B testing when building a fintech app

Saves us from gut-feel decisions and endless opinion wars

Jun 26, 2025

I have lost count of how many product debates end with someone saying, “Let’s just ship it and see what happens.” I prefer to learn in a more disciplined way, so I lean on A/B testing whenever I tweak the user experience of a trading or investing app. I do not do it alone: whenever the statistics get gnarly I rope in our data-science partner who handles the heavy quantitative lifting. Below I walk through exactly how we plan, run, and interpret an A/B test, along with the pitfalls we sidestep as a UX researcher working hand-in-glove with product, design, engineering, and a statistician.

When I decide an A/B test is the right tool

I reach for an A/B test when I need clarity on incremental changes that could nudge metrics such as first-trade success, fund-transfer completion, or watch-list engagement. Small copy updates, button placements, and micro-interactions shine here because the signal-to-noise ratio is high enough for us to detect meaningful lifts.

If I am planning a wholesale redesign or introducing a brand-new feature like crypto derivatives, I do not jump into A/B testing straight away. I start with qualitative interviews, concept tests, or diary studies because users need context before they can react. Only once I have converged on two viable design directions do I use an A/B test to pick the stronger one.

How we frame a strong hypothesis

Every experiment we run starts with a pair of hypotheses.

Null hypothesis (H0): The new variant does not change the chosen metric relative to the current experience.
Alternative hypothesis (H1): The new variant leads to a measurable improvement (or decline) in the metric.

To keep ourselves honest, we use the PICOT checklist:

Population: First-time funders on iOS in the United Kingdom
Intervention: A contextual tooltip that appears above the “Confirm Deposit” button
Comparison: The existing flow with no tooltip
Outcome: Increase in successful deposits within ten minutes of starting the flow
Time: Two full trading weeks

This level of specificity stops us from moving the goalposts once data starts to arrive.

Implementation and my role

Most A/B-testing tools simply give us a few lines of JavaScript (or an SDK call). Our engineers paste that into the app, and just like that, we can show two versions to different users without pushing a new release. When we want to test new images or animations, the designers add those files first, and we switch them on or off with a feature-flag toggle.

We randomise users at the session level so each person sees only one version throughout the experiment. For financial tools, we never randomise at the screen level; switching variants mid-journey would be reckless given the money at stake.

From where I sit as a UX researcher, once I’ve drafted the hypothesis and defined what we’re testing, let’s say a new call-to-action button or a different wording on a tooltip. I start by sketching out the two versions of the experience. Sometimes that’s a simple wireframe, sometimes I partner with a designer to create polished mocks.

Then comes the setup. Most A/B testing tools provide a tiny JavaScript snippet or an SDK package. I don’t handle the code myself, but I pass it to our engineers, who add it into the app. That snippet is what allows the testing tool to randomly show one version to one group of users, and the other version to another group, all without needing to release a new app update. It’s what makes the test run quietly in the background.

If the change involves any images, animations, or new interface components, I work with the designer to make sure those assets are uploaded into our system ahead of time. These assets are then linked to a feature flag: a simple switch we can turn on or off in the test platform or build environment. The flag makes it easy to control which users see which experience.

This setup lets me stay close to the user journey and not get bogged down in the technical side. I don’t need to wait for the next sprint release or app store approval. Instead, I can test real interactions with live users and get clean data in just a few days.

Once the experiment is live, I check the dashboards daily to make sure data is coming through correctly, things like impressions, click-throughs, drop-offs, or any primary metrics we defined in the hypothesis. And once the test hits the required sample size, I sit down with our data scientist to dig into the results and understand what’s actually happening from a user behaviour point of view.

Sample size and statistical power

Because retail trading apps rarely see the traffic of social networks, our statistician runs a power analysis upfront. If we cannot detect at least a 5 percentage-point lift with 80 per cent power inside four weeks, we usually park the idea or broaden the audience. Chasing microscopic lifts with thin traffic is a quick path to false positives and wasted sprint cycles.

Collecting and analysing the data

We set alpha at 0.05 to cap the risk of a type I error. Once the test reaches the predetermined sample size, my data-science colleague freezes the data and runs a two-proportion z-test. If the p-value is lower than alpha, we reject the null hypothesis and plan the rollout. Otherwise, we keep the existing experience.

We resist the temptation to peek midway and stop early. Sequential looks inflate the chance of spurious wins, and we do not want to mislead stakeholders with half-baked evidence.

Common Mistakes We Avoid

Vague hypotheses: If we cannot describe the expected outcome in one sentence, we refine the idea before writing a single line of code.
Testing too many things at once: Multivariate tests demand large sample sizes, which we seldom have. We test one lever at a time and stack the wins.
Ignoring external factors: We never compare data from earnings day against a sleepy bank holiday. Seasonality matters in trading behaviour.
Stopping for “trends”: We run the full duration even when early numbers look promising because regression to the mean can and does happen.

How A/B testing fits into broader research toolkit

A/B testing is only one voice in the product choir. We start with analytics to spot friction, back the insights with interviews to understand why, then sketch a solution, and finally test it with an experiment. When the change ships, we keep an eye on longitudinal metrics like churn and asset growth so we can catch any slow-burn side-effects.

Cheers

Unwinding Intelligent Systems: Research, Design and Strategy

Discussion about this post