

PLOT MELTING POINT MEASURE VS LITERATURE VALUES MANUAL
Validating the measured property in any meaningful way is difficult but manual inspection can highlight obvious errors with the parameters as captured (vide infra). Redrawing of chemical compounds can be difficult and in many cases they are not available as structure depictions but only in the form of chemical names. The aggregation and curation of such datasets can be very exacting in terms of extraction of the data from the literature. The modeling of these properties is best facilitated by obtaining large, structurally diverse, high-quality datasets. Physicochemical parameters such as logP, pKa, logD, aqueous solubility and many others impact not only drug-related properties but also environmental chemicals such as surfactants, wetting agents and so on. The prediction of physicochemical properties is important in the pharmaceutical industry for structure design and for the purpose of optimizing ADME properties. The developed models and data are publicly available at. We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure–activity relationship models. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 ☌. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. A number of technical challenges were simultaneously solved to develop models based on these data. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform ( ). We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Melting point (MP) is an important property in regards to the solubility of chemical compounds.
