Chemistry labs generate a lot of data. However, some of them are still in paper format and are difficult to access in their entirety. Three EPFL scientists present a modular open science platform for managing the large amounts of data produced in chemical research. His study entitled Make collective knowledge of chemistry open and machine-readable ” was published in Nature Chemistry.
Data management in modern chemistry is difficult. If we take the example of the synthesis of a new compound, many trials and errors occur before finding the right reaction conditions and thus generating large amounts of raw data. This data is very important because, like humans, machine learning algorithms also learn from failed or partially successful experiments.
Currently, only the most successful experiments are published. artificial intelligence, in particular the machine learningcan allow failed experiment data to be processed as long as it is stored in a machine-readable format that anyone can use.
Professor Berend Smit, who heads the Molecular Simulation Laboratory at EPFL Valais Wallis, explains:
“For a long time, we had to compress data due to the limited number of pages of paper newspaper articles. Today, many newspapers no longer have paper editions. However, chemists still face reproducibility issues because journal articles lose important details. Researchers waste time and resources replicating the failed experiments of the authors. They find it difficult to rely on published results because raw data is rarely published. »
EPEND’s Berend Smit, Luc Patiny, and Kevin Jablonka have published a perspective that presents an open platform for the entire chemistry workflow – from project start to publication.
Machine readable FAIR data
His main thesis is that if we want to advance chemistry with data-intensive research and also solve reproducibility problems, we need to change the way we collect and report experimental data.
Three steps are essential: data collection, processing, and publishing, with minimal cost to researchers. The guiding principle is that data should be easy to find, accessible, interoperable, and reusable (FAIR).
Berend Smith says:
“At the time of data collection, the data will be automatically converted to a standard FAIR format, which will allow all failed or partially successful experiments to be published automatically, as well as the most successful experiment.”
The authors propose that data can also be exploited by machines.
Kevin Jablonka says:
“We are seeing more and more data science studies in chemistry. In fact, the latest machine learning findings are trying to address some of the issues that chemists believe cannot be solved. For example, our group has made significant progress in “Predicting optimal reaction conditions using machine learning models. These models would be much more valuable if they could also learn failing reaction conditions, but they remain biased because only successful conditions are published.”
To establish a FAIR data management plan, researchers present 5 measures:
- The chemical community should adopt its own rules and solutions;
- Journals should impose the deposit of reusable raw data, where Community standards exist;
- We must accept the publication of “failed” experiments;
- Laboratory electronic notebooks that do not allow all data to be exported in an open and machine-readable manner should be avoided;
- Data-driven research should be part of our resumes.
Luc Patiny says:
“We believe that there is no need to invent new file formats or technologies. In principle, we have all the technologies. We need to adopt them and make them interoperable. »
The authors point out that storing data in an electronic lab notebook, which is the current trend, does not mean that humans and machines can reuse it. Structuring and publishing data in a standardized format is the best alternative as long as there is sufficient context.
Berend Smit adds:
“Our perspective provides insight into what is believed to be the key to bridging the gap between data and machine learning for the fundamental problems of chemistry. We also offer an open science solution where EPFL can lead by example. . »
Kevin Maik Jablonka, Luc Patiny, Berend Smit. To make the collective knowledge of chemistry open and machine-operable. Nature Chemistry April 4, 2022. DOI: 10.1038 / s41557-022-00910-7