eTech Insight – Synthetic Data Advances Clinical Research - Cover

eTech Insight – Synthetic Data Advances Clinical Research

Personal Health Information Regulations Can Slow Clinical Research

The U.S. government enacted the Privacy Rule* to protect individually identifiable health information, known as Protected Health Information (PHI). The protected information consists of “the individual’s past, present, or future physical or mental health condition, the provision of health care to the individual, or the past, present, or future payment for the provision of health care to the individual, [and] any information that identifies the individual or for which there is a reasonable basis to believe it can be used to identify the individual.”

“A central aspect of the Privacy Rule is the principle of ‘minimum necessary’ use and disclosure. A covered entity must make reasonable efforts to use, disclose, and request only the minimum amount of protected health information needed to accomplish the intended purpose of the use, disclosure, or request.”** Policies and procedures to ensure compliance with the minimum necessary rule must be created and implemented.

Ensuring a research project is compliant for using patient data to support clinical research projects can stall these projects for months or longer. There is a clear need for establishing a substantial repository of anonymized patient data that can be used to support data analytics driving clinical research.

An anonymized database of patients’ clinical and socioeconomic data that meets the needs of healthcare researchers can be expensive to create and maintain. If data from several organizations are used, the ability to ensure PHI compliance and meet the Institutional Review Board (IRB) requirements becomes more complex.

Synthetic Data Ensures Personal Heath Information Rules Compliance

Synthetic data is statistically identical to real patient data but allows clinical researchers to quickly create data sets to support focused analytics. Synthetic data does not violate PHI regulations as it cannot be associated with real patient data. Washington University in St. Louis has created a process for producing synthetic data from their enterprise patient data using MDClone. The process takes enterprise patient data collected by the organization, adds research data from other sources, and is loaded to the MDClone data lake. The MDClone query tool then generates the computationally derived synthetic data from the original patient data set.

MDClone has created a Global Network that includes some prestigious U.S. healthcare provider organizations that are creating a repository of synthetic data to facilitate more efficient data analytics to support research projects. The Global Network will focus on three pillars of research: health services, clinical medicine, and precision medicine. The synthetic database comprises data from 30 million patients. Healthcare organizations have used the database to manage populations with chronic kidney disease, improve sepsis monitoring, and understand the treatments and outcomes of patients with COVID-19.

The Global Network provides a foundation for expanding and extending synthetic data by incorporating data sets from multiple academic medical centers that will continue to improve insights into patient treatments and outcomes.

Creating Multiorganizational Data Sets Without PHI or Ownership Issues

Current data platforms for multicenter collaborations using raw PHI across provider organizations may be conflicted by data ownership and PHI compliance. The ability to convert the raw patient data into synthetic data removes these barriers and will likely produce more collaboration among global medical-research organizations. The use of larger synthetic data sets will drive more targeted and sophisticated analytics that will support breakthroughs in medical treatments, new medications, and precision medicine.

Synthetic data can also be used to improve bioinformatics education and training. Removing PHI and IRB components from data projects will support an increase in educational informaticist programs. Synthetic data can also be used to support AI test-model data sets.  

Healthcare May or May Not Be a Focus

Emerging synthetic data companies apply their solutions to banking, insurance, and healthcare. Representative vendors include the following:

  • MDClone  – an Israel-based company with an impressive US healthcare client base
  • Hazy – a company that focuses on financial institutions
  • Syntho – a general-purpose synthetic platform with a focus on supporting AI

Vendors of synthetic data will become a catalytic agent for driving higher levels of shared healthcare data for collaborative research.

Success Factors

  1. Healthcare organizations should consult with peer organizations before using synthetic data solutions or joining in collaborative synthetic data projects.
  2. Once synthetic data platforms have achieved acceptance from clinicians, they should be evaluated for informaticist education and training extensions.
  3. Institutions that are using or testing AI platforms to support their medical services should evaluate using synthetic data for the AI training models.


The use of patient data to support data analytics and clinical research projects can be stymied by PHI regulations and IRB processes. Synthetic data provides a new approach for ensuring patient data has been converted into a format that cannot be associated with any patient. Synthetic data will provide a platform for driving higher levels of clinical research collaboration as healthcare organizations share and synthesize their data.

Higher levels of research collaborations by healthcare organizations will deliver new insights for improving care delivery, medical devices, medication use and discovery, oncology protocols, health and wellness, and chronic disease management.

Using synthetic data to improve the training databases for AI will likely drive more timely and advanced progress for AI applications. As the availability and use of synthetic data solutions increase, that ability to establish a minimum data set for AI training modules may emerge that will improve clinician acceptance and adoption of AI guidance.

Synthetic data will emerge to become the superior solution to the legacy product.


*45 C.F.R. § 160.103.
 ** 45 C.F.R. §§ 164.502(b) and 164.514 (d).

Photo Credit: Adobe Stock, ryzhi