Interview with Stéphanie Combes HDH Director
[Emmanuel Mawet]: Can you quickly outline your background that led you to pilot the Health Data Hub?
[Stéphanie Combes]: After graduating from the École Polytechnique, ENSAE and the Paris School of Economics, I joined the French Treasury in 2010 where I worked on energy policy issues. I then coordinated a team in charge of producing short-term forecasts of the evolution of GDP in France, before joining INSEE in 2014 where I was in charge of the creation of the Big Data activity, which prefigures the creation of INSEE’s innovation laboratory. My involvement in digital health issues began with my position at DREES, where I was recruited to represent the ministerial administrator of health data. At that time, I had to develop innovative uses of the SNDS within DREES and the Ministry. Faced with the various obstacles to the simple use of this data, we launched the project to create a health data platform – the Health Data Hub – whose sole mission would be to facilitate access to health data while respecting the rights of individuals and security, by accompanying projects from start to finish. This project was in line with the recommendations of the deputy Cédric Villani, as part of his work on artificial intelligence, and was selected as one of the actions of the national strategy for artificial intelligence, and by the Fonds de Transformation de l’Action Publique which provided seed funding. I was then appointed rapporteur of the prefiguration mission of the Health Data Hub, led by Marc Cuggia, Dominique Polton and Gilles Wainrib and in charge of setting up the structure once the Minister, at the time Agnès Buzyn, had entrusted the DREES with the implementation of the roadmap thus proposed.
[EM]: Before addressing the issues that are less agreed upon, can you explain the motivations and goals of this project?
[SC]: France has one of the largest medico-administrative databases in the world, linked to reimbursements for procedures and care for beneficiaries of all compulsory health insurance schemes. For a long time, this management database has been under-utilized. Following several reports, the call for better accessibility to health data was heard in 2016. The “modernization of our health system” law created the National Institute of Health Data (INDS) in charge of fostering dialogue between actors and simplifying access procedures.
On March 29, 2018, in line with the recommendations of the report by Deputy Cédric Villani on artificial intelligence that I mentioned above, the President of the Republic announced the creation of a health data hub, a partnership structure whose purpose is to guarantee simplified access to health data, through a secure technological platform with state-of-the-art analysis tools, all with respect for citizens’ rights.
Throughout the summer of 2018, at the request of the Minister of Solidarity and Health, Agnès Buzyn, a mission led by three experts in the field of health data is conducted, which will submit its report on October 12. DREES was then entrusted with the implementation of the roadmap proposed in the report, and 8 months later, the law on the “organization and transformation of the healthcare system” (OTSS) was passed. Article 41 of the law expanded the National Health Data System (SNDS) and officially created the HDH.
The Health Data Hub is therefore constituted as a public interest grouping, whose constitutive agreement was approved by the General Assembly of the Hu and published through a ministerial order on November 29, 2019. The General Assembly of the Health Data Hub brings together its 56 members, divided into 9 colleges: the State; health insurance funds; complementary health insurance organizations; research organizations; health institutions; health professionals; agencies; operators and independent public authorities; representatives of users of the health system and industrialists.
What is it for? In the digital age, each act of care results in the creation of data. All this data is a precious and essential raw material for research. By gathering and analyzing it, researchers can answer concrete questions to improve the quality of care and treatment, such as studying the side effects of prescriptions. Companies can develop new solutions, such as algorithms for detecting heart failure. This data is particularly abundant in France, which can be a competitive advantage internationally for research and innovation.
Until now, obtaining access to this data has raised many difficulties for those who wish to use it in the context of projects of general interest:
- Data are scattered among multiple databases, not well known and not easily understandable from the outside;
- The procedures for accessing health data are complex, due to the very sensitivity of these data, and are governed by different, sometimes discretionary, rules;
- The tools and skills needed to process the data securely, as required, are expensive and often inaccessible to small research teams or start-ups.
As a result, some projects that could bring real benefits to patients take several years to get off the ground, if at all. Some French start-ups, who want to develop new solutions, are forced to partner with foreign players to collect data. Their innovations will not necessarily be adapted to French patients or even available to them.
[EM]: What projects based on the Health Data Platform allow us to measure its interest?
[SC]: The HDH service offer is complementary to that of data collections (administrative databases, registers, cohorts, or surveys) under the responsibility of public or private organizations in the health field.
It responds to two categories of projects:
- Projects aiming at crossing different data sources when these crossings, chaining or matching are complicated to achieve.
- Projects aiming to build up large masses of data by aggregating multicentric collections.
For these two types of projects, the HDH technology platform acts as a trusted third party to securely bring together the various sources of data and enable them to be used with state-of-the-art technologies.
The Health Data Hub supports 72 projects, some of which are pilot projects selected by calls for projects and some of which are related to the epidemic.
Of the 72 projects supported by the HDH:
- 62 partner projects
- More than a third are accompanied with industrialists,
- 12 are related to the epidemic.
Of all the projects, 34 have received a favorable opinion from CESREES and 31 have been or are about to receive a response from the CNIL. 35 projects have been authorized by the CNIL out of 45 eligible projects. Data analysis projects can take several years to produce results, but to date 3 projects have been completed and 1 may be completed soon.
Of the 72 projects, 57 require more than one data source, 9 require up to three sources, and 34 require a cross-reference with the main SNDS database. Of the 34, 11 require probabilistic matching. Of the 72 projects, 57 are using the HDH platform and 30 will have arrived on it by the end of 2022.
Of the 72 projects, 15 require data preparation at the level of a health data warehouse prior to transferring to the Health Data Hub’s analysis platform: 7 have support from the HDH to do so, 6 have completed this data preparation phase or are about to complete it.
While it is generally too early to attribute any medically useful advances to these projects, important milestones have been reached that allow them to do so.
- Exploitation of emergency room passage data for analysis of care use and monitoring of the Covid-19 health crisis (DREES): this project aimed to analyze for study purposes the use of care (particularly of patients with pathologies other than Covid) during the health crisis.
- GLUCO (EMA, IQVIA, PeLyon): this Europe-wide project commissioned by the EMA has been completed to date and the results are being published in a scientific paper. It studies the use of systemic glucocorticoids in the treatment of COVID-19 and the risks of associated adverse events. The HDH performed the transformation to OMOP-CDM format of the CNAM extraction performed on 300,000 patients with a hospital diagnosis of COVID-19. Before transfer, the HDH performed quality tests on the transformed data. This study is the first project conducted on SNDS data transformed by the HDH to an international and interoperable format. An article is currently being finalized. The HDH has participated in various conferences such as Medical Informatics Europe 2022, OHDSI Europe 2022 where a poster on alignments was presented. The HDH also plans to open the scripts allowing to move from the native SNDS to the SNDS omop in open source soon…
- BACTHUB (AP-HP and INSERM): this project helps to understand the link between antibiotic use and the development of antibiotic-resistant bacteria. The HDH consolidated data from 50,000 patients from 37 AP-HP hospitals over a 5-year period. An academic article will be submitted very soon to Eurosurveillance to present the richness to date of the database from the AP-HP hospital databases. The data have been prepared and consolidated with the support of 2 data engineers provided by the HDH for more than 1 year.
- HYDRO (Implicity): the HDH contributed to the transfer of data from 27,000 pacemakers to Implicity’s heart failure platform. In doing so, it contributed to a 50% to 80% improvement in the matching rate with the SNDS. In the end, data from more than 1,000 tables in the SNDS were extracted and will be analyzed together with data from single cardiac implants, including biology, microbiology and drug prescription data. In total, more than 12 million lines of data have been subjected to automated quality control. At present, the development of the algorithm is underway, and the first results have not yet been communicated but seem promising
- ORDEI (ANSM) : il s’agit d’un outil qui transmet les effets indésirables de la prise de médicaments. Le HDH a mis à disposition une première maquette de l’outil grâce aux données disponibles en open data et travaille à la substitution par les données de l’Assurance Maladie. Une première maquette de l’outil a été réalisée en s’appuyant sur des données de consommation de médicaments disponibles en open data et des travaux sont en cours pour les substituer par les données du SNDS. L’outil aura vocation à être mis en ligne.
- NHANCE (AP-HP): this tool improves the interpretation of ultrasound images of ventral organ lesions. The dedicated Health Data Hub team extracted 80,000 anonymous ultrasound images when it would have taken 2 years of work to perform the extraction manually. The project team developed tools to pre-process the data for ultrasound studies to allow both perfect de-identification of images and standardization of image content across different databases. This work has been published at the IEEE International Symposium on Biomedical Imaging in 2021. (Sourcehttps://gitlab.inria.fr/hdadoun/pre-process-US) and the subject of an article in the journal Radiological Society of North America on March 2, 2022 available at the following link. A scientific article also thanks the HDH on the following link.
- INNERVE (Quantmetry): the purpose of the study is to develop a software that integrates directly with the scanner whose purpose is to refine the diagnosis of small fiber neuropathies. These are small nerve cells that allow to feel pain and temperature: small fiber neuropathies can notably lead to pain and loss of sensitivity. Algorithms for the detection of features of interest on images have been developed and tested with an average accuracy of 70%: membrane detection, fiber detection, fiber-membrane intersection detection. The intra- and inter-operator variance confirms that the developed model can automatically replace the analysis work done by the physician. A publication is being written for the journal “AI in medicine”
- TARPON (University of Bordeaux): the project aims to develop an artificial intelligence (AI) to automatically analyze the text written by health professionals contained in the medical records of patients taken in emergency departments. For the first time, information on traumatic events contained in patient records will be cross-referenced with patient data on their pre-trauma medication consumption. The first phase of the project, which consisted in developing the methodology, allowed the development of an algorithm to identify patients who are treated for trauma in the emergency room. Two scientific articles are available on this link and this link, submitted for publication in a scientific journal soon.
- HugoShare (Hugo Network) studies drug interactions that may cause adverse events, based on hospital drug prescriptions. Part of the financial support of the HDH has allowed to support the data preparation phase, now finalized and more particularly the implementation of data flows from the local EDS, the interoperability of the HUGO EDS, finally, the semantic quality setting of the biology data, the data are currently being ingested in the HDH platform and the study will be able to start in the next days
- Deep-Sarc (Centre Léon Bérard) is a study conducted on “real-life data” identifying the most suitable treatments for patients with sarcomas, nearly half of whom in France do not respond to standard treatments. This study will cross-reference patient data from the clinical reference network for sarcoma in France and the SNDS over 7 years (2010-2017). The financial support of the HDH has allowed, at this stage, to support the project in the phase of preparation of the Netsarc data and setting up the ingestion channel, in the matching of the Netsarc data with the SNDS and in the analysis of the data. The study is underway on the platform
- Deep Piste (RCDC Occitania) is cross-referencing data from mammograms collected between 2004 and 2019 at the RCDC Occitania with corresponding SNDS data to improve breast cancer screening programs. These databases are currently being transferred to the HDH platform so that the project team can launch the study in the coming weeks. Researchers are then annotating the mammograms on the HDH platform using the Cytomine tool, which is integrated into the HDH platform’s technology offering, to help the cancer recognition algorithm learn and decrease the rate of interval cancers. A gitlab is available on the following link
- Rexetris (Limoges University Hospital, ABM) measures the long-term impact of exposure to immunosuppressive drugs in kidney transplant patients. The financial support of the HDH has allowed the University Hospital of Limoges to recruit an engineer who will soon be mobilized on the HDH technological platform to develop algorithms for data interpretation and to develop risk models for loss of graft function (taking into account or not exposure to immunosuppressive drugs). The matching between the databases is in progress and some elements will be put in opensource to facilitate the reuse.
[EM]: One of the biggest criticisms that has been levelled at the Health Data Platform is that it chose the Microsoft cloud. Do you understand the reactions? What are the answers you can provide?
[SC]: In parallel with the prefiguration mission, the target functionalities of the technological platform were defined within the framework of a technical working group involving representatives of the entire ecosystem (hospitals, startups, researchers, CNAM) and were publicly communicated on the website of the Directorate for Research, Studies, Evaluation and Statistics (DREES) in early 2019
The Health Data Hub technology platform requires:
- An infrastructure that provides elastic storage and computing capacity for advanced data science processing that meets the need for progressive scaling inherent in the Health Data Hub offering;
- A foundation of integrated services providing the necessary functionalities for the technological platform such as data processing and governance, results visualization, identity management, traceability, security maintenance, project space management, automation of resource deployment according to the principles of “programmable infrastructure” or “infrastructure as code”, etc.
The technology platform is built to provide access to sensitive data in a highly secure environment, under pressure from an ecosystem that demands this access. To meet this requirement, it is essential that the hosting provider offers integrated services that enable end-to-end management of secure processing and operations on the technology platform while reducing integration time and costs. For example, it is essential that the platform’s technical components generate logs that can be received by a centralized storage and processing service; or that the identity management service can be integrated with all of the platform’s technical components to verify the access rights of people and machines to these different components. Most of the French hosting service providers studied offered a predominantly “infrastructure”-oriented service that required a significant amount of integration of the services that are essential to the platform, thus weakening security management.
In addition, the physical security of the data centers is also a primary criterion. When the hosting solution was chosen in February 2019, only a small number of players held the “Health Data Hosting” certification, which is obtained through a technical compliance audit carried out by an accredited certification body as defined by Article R.1111-10 of the Public Health Code. In this respect, it was also essential that the resources required for artificial intelligence processing, such as graphics processing units (GPUs), be included in an HDS-certified hosting package.
More than a dozen leading industrial and research players were consulted, first during the prefiguration mission, then at the end of 2018 (Thalès, Atos, Santeos, OVH, Docaposte, Orange, Teralab, Institut Pasteur, CASD, Genci, Outscale, Saagie, Amazon, Google, Microsoft). The possible options were assessed by the team in charge of setting up the project at the Ministry of Solidarity and Health in terms of their coverage of functional and security requirements, and the existence of an appropriate contractual vehicle to enable implementation within the required timeframe.
After analyzing the French players approached to develop the platform, it became clear that Microsoft’s Azure cloud solution was the only one to offer the necessary features and certifications in an integrated manner.
The choice of the Microsoft Azure solution to host the data for the Health Data Hub technology platform is reversible. The objective of reversibility was also included in the Health Data Hub’s first three-year strategic roadmap, which was voted on in January 2020 when the HDH was created…
Technically, the technological platform is developed according to a logic of “programmable infrastructure”, or “Infrastructure as Code (IaC)” using languages independent of the hosting solution chosen, allowing it to be easily redeployed on another solution of the same level of maturity.
[EM]: What is your response to the legal aspects of the extraterritorial laws to which American actors are subject?
[SC]: The invalidation of the Privacy Shield agreement by the Court of Justice of the European Union on July 16, 2020 (Schrems II ruling) has caused uncertainty about the framework for transfers of personal data between the European Union and the United States and, more generally, the use of U.S. providers to process personal data of European citizens.
The March 23, 2020 order allowing the Health Data Hub to collect and make available data related to the outbreak to support health crisis management was challenged for this reason by a group of stakeholders on September 28, 2020. The Conseil d’Etat issued an order on October 13, 2020, in which it recognized that the technical and contractual measures implemented by the Health Data Hub and Microsoft prevented any transfer of personal health data outside the European Union. The only data for which the transfer is useful is telemetry data, to monitor the proper functioning of the services offered by Microsoft, as well as billing data.
At the legal level, the Health Data Hub and Microsoft have significantly strengthened their contractual framework over time with the implementation of additional legal and technical measures. Several amendments were successively signed between the Health Data Hub and Microsoft to better define the terms of the subcontracting.
In parallel, the Health Data Hub conducted a detailed legal investigation with a law firm regarding the extraterritorial risks applicable to HDH and concluded that the Schrems II Decision should not apply in the HDH context. The Schrems II Decision initially applies to a case of data transfer between an EU company, Facebook Ireland, and a US company, Facebook US, whereas in the HDH case, the health data is hosted in France and cannot be transferred by HDH, in accordance with the prohibition set forth in the contract with Microsoft and the ministerial order of October 9, 2020. Specifically, this analysis shows that the conditions for the application of U.S. surveillance laws are not verified in the context of the processing performed by HDH. This memo has been made public.
[EM]: It is worth remembering that a consultancy firm on behalf of the Dutch government has carried out a legislative audit on these legal aspects and their conclusions are that initiatives like the Trusted Cloud are under extraterritorial laws. What is your feeling as a citizen and not as a director on this sensitive subject of our data?
[SC]: As a citizen, my feeling is that the societal debate that underlies housing issues is not addressed. Today, we must answer the question “do we have to wait for a perfect solution to move forward on this or that issue? The answer is not simple, but the question is not really asked in those terms either. It is legitimate to think that we should wait until we have a fully sovereign solution to move forward in the field of health data research, but the opposite position is also defensible, especially when international competition is fierce and we are exposed to a similar debate in a few years, not about the cloud but about SaaS applications, especially in the field of health. What will we say to citizens if in 5 years, our smartphones provide us with mostly American or Chinese applications, about which we will have little information on how they were developed?
[EM]: Do you think it would be possible to change cloud provider? It seems to me that this was one of the commitments of the French government in the face of the outcry against the use of Microsoft. The horizon given at the time was 2020, where are we now?
[SC]: The choice of the Microsoft Azure solution to host the Health Data Hub’s technology platform is reversible. The goal of reversibility is also included in the Health Data Hub’s three-year strategic roadmaps for 2019-2022 and 2023-2025. This new roadmap, voted on June 9 by the HDH Board of Directors, also sets the implementation of the platform’s migration to a “trusted cloud” operator for 2025. However, this migration depends on the requirements that such an operator will have to verify on the one hand and the existing offer on the other. We continue to monitor the market.
[EM]: You state that Microsoft was chosen through a call for tender. What is your position with regard to the ongoing action of Anticor?
[SC]: According to articles L. 2113-2, L. 2113-3, and L. 2113-4 of the French Code de la commande publique, DREES, like the Health Data Hub, can legitimately and legally consume services through central purchasing agencies, such as UGAP. By purchasing services through central purchasing bodies, an administration does not itself carry out a competitive bidding process but deals with the successful contractor following a procurement procedure organized by the central body and during which the competition is carried out. These approaches, which give public sector organizations greater agility, have also been encouraged: the French government has set up a contractual purchasing system, via the UGAP public purchasing center, which brings together “off-the-shelf” commercial offers from specialized cloud providers, in accordance with the so-called “circle 3” level in the 2018 circular…
[EM]: Similarly, the call for tenders concerning SOC outsourcing for the Health Data Hub is causing a new controversy. Is it justified and are you directing your call for tenders towards the SIEM use of Splunk, an American player?
[SC]: The Health Data Hub’s purpose is to gather and make available SNDS data which, since the OTSS law, corresponds to all data associated with a social security reimbursement. The security requirements for the management of this data are therefore very high and are set out in an order dated March 22, 2017, known as the “SNDS security guidelines.”.
The repository includes the ability to trace all activities performed on the platform, whether they originate from users or operators. In this context, the Health Data Hub wishes to equip itself with a Security Operation Center (SOC) that will enable it to industrialize the collection of events from its components and to be able to detect, from these events, abnormal, prohibited or risky behavior.
In order to manage events, the SIEM or “Security Information & Event Management” is a tool of the SOC allowing the collection and the grouping of log data generated at the level of the whole infrastructure of the platform, of the project spaces. Thus, all devices related to networks, security, platform access, operator access, and user access generate traces. These are identified, categorized, and analyzed to highlight attack paths and generate alerts or initiate treatment in order to circumscribe incidents and events.
To build this future SOC, the strategy chosen by the Health Data Hub is to capitalize on the existing tool – i.e. Splunk, which has been integrated into the technological platform in order to reduce adherence to Microsoft – by building its evolution from an organizational, functional and technical point of view, in order to be able to detect deviant behaviors and thus to be as proactive as possible in order to limit the occurrence of incidents or to limit their impact by dealing with them as soon as possible.
The Health Data Hub therefore launched a public tender on June 10, 2010, the purpose of which is to provide support in defining detection rules, incident management and response, forensic investigation (search for digital evidence), security monitoring, as well as assistance with employee awareness. The consultation issued by the HDH takes place in two phases. The first phase, namely the application phase, is materialized by the rules of public call for applications available in free access on the following link. The second phase concerns the submission of offers. The market is open to all applicants in the first phase, but their knowledge of Splunk integration is considered in the application analysis criteria.
Splunk has been named one of the best SIEM solutions on the market by GARTNER in 2021. The two most commonly used solutions are Splunk or Elastic Search, neither of which is French nor no French solution is considered today as a reference in this field.
It should be noted that the Health Data Hub is sensitive to the use of French solutions whenever it can, but that in the field of cybersecurity, the best solutions on the market often remain foreign, even if this is not systematic.
[EM]: What were the reasons for the withdrawal of the request for authorization to the CNIL by the Plateforme des Données de Santé for the hosting of the SNDS (Système National des Données de Santé)?
[SC]: In agreement with the Ministry of Solidarity and Health, the HDH has temporarily withdrawn its request for authorization to host the main SNDS database and the catalog databases in the technological platform while awaiting the finalization of the CNIL’s examination of the order defining the composition of these databases.
In the meantime, the Health Data Hub is making data available to authorized projects, one by one, thus combining the regulatory timeframes for obtaining authorizations for projects with the timeframes for contracting and making data available, since for each project an extraction must be produced at the data producer level and then transmitted to the HDH. The HDH has a team that will allow, in target, to mutualize these efforts.
[EM]: What are the next major steps in your project?
[SC]: To build its new multi-year roadmap 2023-2025, the Health Data Hub conducted a consultation with 26 ecosystem stakeholders and 4 working groups, for 46 organizations met. The roadmap was then presented to the Board of Directors and the General Assembly and was unanimously approved on June 9. The HDH’s new multi-year roadmap is composed of four main areas: continuing actions to reduce delays in accessing health data and increasing the number of impactful projects; making data available in the main database, enriching it and facilitating its reuse; strengthening the HDH’s connections to ecosystem players; listening to civil society and co-constructing a health data culture…
Thus, by 2025, the Health Data Hub aims to: achieve a working capital of 200 projects using the platform per year (starting in 2024); reduce the average time to access data to 7 months for projects requiring matching between several data sources; establish more than 40 new partnerships with local or shared infrastructures at the national and European levels; and develop our exchanges with civil society.
[EM]: What additional elements would you like to share with the project’s detractors?
[SC]: Health data will play a major role in tomorrow’s medicine, which is why the European Commission has made it one of its health priorities. Through the opportunities it represents for medical research, for the improvement of clinical trials or for a better understanding of care paths, health data will soon become indispensable. It is necessary to address all the debates by taking into consideration the beneficiary of the research and his expectations in terms of health.
[EM]: We are coming to the end of this interview. Thank you for your availability and the quality of your answers. What would be your concluding words?
[SC]: Let the HDH move forward – despite the occasional headwind! The Health Data Hub is a great partnership adventure, supported by a wide range of players – from manufacturers to patient associations and the State – some of whom have never worked together before. Today, these partners are delivering the first concrete results and building together new milestones that will eventually mark our use of health data.