Real-World Data Is Noisy. Your Model Has to Know That.

by Oshri Mashluf

May 7, 2026

8 Min Read

Blog Post

White Papers

Anyone who has worked with electronic health records knows an unspoken truth: the data is rarely as clean as the questions we ask of it. Notes are typed in a hurry, scales switch between pounds and kilograms, and simple typos are common. When you multiply these small errors by tens of thousands of patients over several years, you are left with a dataset that looks rich but is alarmingly noisy.

For data scientists in pharma and biotech, this is everyday terrain. Finding ways to see through this noise is where some of the most critical work happens.

The Challenge: Tracking Real Progress

In a recent observational study, we wanted to understand how different weight-loss approaches — such as medication, surgery, or lifestyle changes — affected people living with obesity and related conditions like sleep apnea or heart disease. The analysis hinged on a simple question: how much weight did each patient actually lose? If we get that number wrong, every other finding about which treatments impact becomes unreliable.

The problem is that simple data cleaning isn't enough. If a patient who usually weighs 200 pounds is suddenly recorded at 90, it might be a typo, or it might be that the weight was recorded in kilograms (90 kg is about 198 lbs). Hard rules often throw away real information or keep errors. To solve this, we stopped looking at measurements in isolation and started looking at the patient's whole journey.

From noisy measurements to a coherent trajectory

We used a method called a Hidden Markov Model (HMM). Think of it as a detective. It assumes there is a "true" weight we can't see directly, and the numbers on the chart are just noisy clues. By looking at a sequence of visits, the model weighs the clues against each other to find the most plausible underlying trajectory.

Put in slightly more technical terms, the model separates a patient’s “state” (weight at a given time) from the observations (recorded measurements) it “emits”. The state model captures gradual, coherent trajectories; the emission model accounts for noise, unit errors, and outliers. Fitted jointly across the full time series, each observation is interpreted in context — suspicious values are explained, not merely discarded. We taught the model to recognize three types of errors: normal daily fluctuations, unit mix-ups (pounds vs. kilograms), and random outliers. The result is a cleaned "trajectory" that represents what the patient likely actually weighed at each point in time.

‍

Why This Matters

When the data is cleaner, the results are more honest. We can more accurately see which treatments help patients and by how much. This ensures that the insights we share are based on what was really happening to people, not just what was typed into a computer.

The Broader Lesson

What we like about this story isn't that hidden Markov models are exotic — they're not, in the right hands. The real challenge wasn't statistical; it was domain-driven. You can't design a model that distinguishes pounds from kilograms unless someone on the team has seen that failure mode firsthand. You can't trust the trajectories unless a clinical eye confirms that's how patients on this regimen actually behave.

That cross-pollination — statisticians who think in trajectories, clinicians who think in pathways, engineers who think in pipelines — is where the creative solutions live. The unglamorous truth of working with real-world medical data is that the clever model is a byproduct. The actual value comes from having the right people willing to take the noise seriously.

‍

Technical Appendix

The HMM is composed of state transitions (the change of true weight in pounds), and of emissions (the observed weight).

The state transitions of patient i at measurement t are given according to

$$w_{i,t} \mid w_{i,t-1} \sim \mathcal{N}\!\left(w_{i,t-1},\; \sigma_{\Delta}^{2}\,\Delta t_{i}\right)$$

where t_{it} is the difference in time passing between the two observations, we allow larger variation as the time between visits increases. The Markovian assumption becomes clear here, all information on the measurement t is provided by the observation at time t-1.

The emissions are provided by a mixture distribution,

$$y_{i,t} \mid w_{i,t-1} \sim \pi_{\text{kg/lbs}}\,\mathcal{N}\!\left(\frac{w_{i,t}}{c},\; \frac{\sigma_{0}^{2}}{c}\right) + \pi_{\text{noise}}\,\mathcal{N}\!\left(w_{i,t},\; \sigma^{2}\right) + \pi_{\text{outlier}}\,\mathcal{U}\!\left(\min W,\; \max W\right)$$

where c=2.2 is the pound to kg conversion ratio. Each mixture component corresponds to a different recording mechanism. The first component represents the case where a patient’s weight is recorded in kilograms instead of pounds. The second component represents the standard case, in which the observed value is a noisy measurement of the true weight in pounds. The third component represents an outlier, modeled as a uniformly distributed value over the plausible range of observed weights. The sum of the mixing probabilities must be 1.

Parameter Estimation

The model parameters (mixing probabilities, $$\sigma^{2}, \sigma_{\Delta}^{2} $$) can be estimated using the expectation-maximization algorithm. EM is appropriate here because the true weight trajectory wi,t is unobserved, and because each recorded value may arise from one of several latent emission mechanisms: a standard pounds measurement, a kilograms-entry error, or an outlier. At a high level, the expectation step uses the current parameter values to infer the likely latent weight trajectory and to assign each observation a probability of belonging to each emission component.

The maximization step then updates the model parameters using these inferred probabilities. Observations that are likely to be standard pounds measurements contribute more strongly to the standard measurement-error component, observations likely to be kilogram entries contribute more strongly to the kilogram-entry component, and observations likely to be outliers contribute to the outlier component. The transition variability is updated from the inferred changes in true weight over time. These steps are repeated until convergence, following the standard EM framework for latent-variable models and standard HMM inference methods (Rabiner, 1989).

References

Take the Next Step

Want to Learn More?

For more information including detailed platform application guides, live platform demos, or additional case studies, please contact a member of our team

Take the Next Step

Looking to learn more?

Whether you're looking for comprehensive guides, expert tips, or detailed tutorials, we have exactly what you need to succeed.

Talk to Our Team

Featured Blog Posts

Apr 3, 2026

6 Min Read

Blog Post

What the latest FDA updates mean for the future of Clinical Trial Design

Recent FDA updates are signaling a meaningful shift in how clinical trials can be designed, analyzed, and approved. A recent PhaseV webinar brought together industry leaders to unpack these changes—from Bayesian methods to AI-enabled review processes—and what they mean for sponsors navigating an increasingly complex regulatory landscape.

Apr 8, 2026

6 Min Read

Blog Post

We simulated 60 million trials. Here's what we found.

Adaptive trials are not new, but what’s changing is our ability to quantify their value with precision. By combining rigorous statistical frameworks with large-scale simulation, we can now make more informed decisions before a single patient is enrolled.In an industry where every trial counts, that’s a powerful advantage.And as this analysis shows, when you let trials learn and adapt, the results speak for themselves.

Apr 16, 2026

3 Min Read

Blog Post

Doing Good While Doing Business

In clinical development, we spend a lot of time talking about data, trial design, endpoints, and probability of success. It’s the language of how we make progress. But for many of us, the reason we’re here isn’t abstract. It’s personal.

Featured White Papers

11 Min Read

White Papers

Advancing Oncology Trials with Covariate Adjustment

PhaseV and iOMEDICO combine causal inference and rich oncology RWD to enable advanced covariate adjustment, reducing variability and bias, enhancing trial power, and accelerating accurate treatment effect estimation in oncology clinical trials

2 Min Read

White Papers

Optimizing Clinical Trial Design: A Novel Adaptive Enrichment Design for Heterogeneous Diseases

PhaseV’s novel adaptive enrichment design dynamically identifies responder subgroups during the trial while rigorously controlling type I error and avoiding selection bias, delivering superior efficiency, reduced risk, and faster drug development

8 Min Read

White Papers

Unlocking the Future of Clinical Research: The Transformative Power of Adaptive Trials

Adaptive clinical trials accelerate drug development by enabling data-driven design changes, improving efficiency, patient outcomes, and success rates. PhaseV’s platform makes even the most complex designs accessible and actionable for clinical teams.

Real-World Data Is Noisy. Your Model Has to Know That.

The Challenge: Tracking Real Progress

From noisy measurements to a coherent trajectory

Why This Matters

The Broader Lesson

Technical Appendix

Parameter Estimation

References

Want to Learn More?

Looking to learn more?

Featured Blog Posts

What the latest FDA updates mean for the future of Clinical Trial Design

We simulated 60 million trials. Here's what we found.

Doing Good While Doing Business

Featured White Papers

Advancing Oncology Trials with Covariate Adjustment

Optimizing Clinical Trial Design: A Novel Adaptive Enrichment Design for Heterogeneous Diseases

Unlocking the Future of Clinical Research: The Transformative Power of Adaptive Trials

Frequently Asked Questions