MechBERT: Language Models for Extracting Chemical and Property Relationships about Mechanical Stress and Strain Pankaj Kumar, Saurabh Kabra, and Jacqueline M. Cole* Cite This: J. Chem. Inf. Model. 2025, 65, 1873−1888 Read Online ACCESS Metrics & More Article Recommendations *sı Supporting Information ABSTRACT: Language models are transforming materials-aware natural- language processing by enabling the extraction of dynamic, context-rich information from unstructured text, thus, moving beyond the limitations of traditional information-extraction methods. Moreover, small language models are on the rise because some of them can perform better than large language models (LLMs) when given domain-specific question- answer tasks, especially about an application area that relies on a highly specialized vernacular, such as materials science. We therefore present a new class of MechBERT language models for understanding mechanical stress and strain in materials. These employ Bidirectional Encoder Representations for transformer (BERT) architectures. We showcase four MechBERT models, all of which were pretrained on a corpus of documents that are textually rich in chemicals and their stress−strain properties and were fine-tuned on question-answering tasks. We evaluated the level of performance of our models on domain-specific as well as general English-language question-answer tasks and also explored the influence of the size and type of BERT architectures on model performance. We find that our MechBERT models outperform BERT-based models of the same size and maintain relevancy better than much larger BERT-based models when tasked with domain-specific question-answering tasks within the stress−strain engineering sector. These small language models also enable much faster processing and require a much smaller fraction of data to pretrain them, affording them greater operational efficiency and energy sustainability than LLMs. ■ INTRODUCTION The automatic generation of material databases from the scientific literature has largely followed a step-by-step approach to data extraction. Thus far, information-extraction pipelines have been used to sequentially process an input through many different natural-language-processing (NLP) stages, including tokenization, chemical-named-entity recognition, text parsing, and relation extraction.1 With domain-specific adaptations, this approach has demonstrated capability in extracting reliable experimental data at large scales, which are efficiently organized into databases suitable for data-driven research.2 To this end, the ‘chemistry-aware’ NLP-based text-mining tool, ChemDataExtractor,3−6 has shown particular utility in autogenerating large repositories of experimental data on chemicals and their properties that suit various domains of materials science; see, for example, autogenerated databases of experimental measurements for thermoelectrics,7 refractive indices and dielectric constants,8 molecules exhibiting thermally activated delayed fluorescence,9 semiconductors,10 magnetism,11 batteries,12 photovoltaics,13 and photocatalysis for water-splitting applications.6 These methods have also been employed in our recent work on stress-engineering materials to curate a repository of related properties,14 where grain-size and yield-strength values are associated in the postprocessing stage of database autogenera- tion and used to statistically validate the Hall−Petch relation15 at a much larger scale compared to other work which is all manual. While these methods directly enable data-driven research, a sequential NLP-based approach to information extraction of experimental data can experience limitations and have significant shortcomings. Methods used for text parsing and relation extraction need to account for all the nuances in the text; a rule-based approach is time-consuming to construct, requires expert knowledge, and has limited adaptability when applied to varied sentences or domains; and machine-learning- based methods require a large amount of manually annotated experimental data for the specific task at hand, which are rarely available. Chemical-named-entity recognition and tokenization processes experience similar challenges, and the sequential Received: May 16, 2024 Revised: January 19, 2025 Accepted: January 20, 2025 Published: January 31, 2025 Articlepubs.acs.org/jcim © 2025 The Authors. Published by American Chemical Society 1873 https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 This article is licensed under CC-BY 4.0 https://pubs.acs.org/action/doSearch?field1=Contrib&text1="Pankaj+Kumar"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdf https://pubs.acs.org/action/doSearch?field1=Contrib&text1="Saurabh+Kabra"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdf https://pubs.acs.org/action/doSearch?field1=Contrib&text1="Jacqueline+M.+Cole"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdf https://pubs.acs.org/action/showCitFormats?doi=10.1021/acs.jcim.4c00857&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?goto=articleMetrics&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?goto=recommendations&?ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?goto=supporting-info&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=tgr1&ref=pdf https://pubs.acs.org/toc/jcisd8/65/4?ref=pdf https://pubs.acs.org/toc/jcisd8/65/4?ref=pdf https://pubs.acs.org/toc/jcisd8/65/4?ref=pdf https://pubs.acs.org/toc/jcisd8/65/4?ref=pdf pubs.acs.org/jcim?ref=pdf https://pubs.acs.org?ref=pdf https://pubs.acs.org?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://pubs.acs.org/jcim?ref=pdf https://pubs.acs.org/jcim?ref=pdf https://acsopenscience.org/researchers/open-access/ https://creativecommons.org/licenses/by/4.0/ https://creativecommons.org/licenses/by/4.0/ https://creativecommons.org/licenses/by/4.0/ nature of conventional NLP-based information extraction means that errors, at any stage of the pipeline, propagate and reduce the overall quality of an eventual knowledge base. Another practical limitation of conventional NLP-based methods is that they focus on the extraction of independent- property information. Yet, if one is to properly extract material relations accurately with complete information, complex knowledge representations of the target properties and their connection to one another in the text need to be developed which is an exceedingly difficult task as is often reflected in the final precision and recall metrics of autogenerated databases. For example, Isazawa and Cole implemented a nested knowledge representation for the extraction of photocatalysis data; in which, the F-score for the primary property remained high, but nested relations containing experimental conditions exhibited much lower F-scores, dipping to around 15% in some cases.6 When designing a knowledge representation for stress− strain properties, not only do factors related to manufacturing, processing, experimental, and measuring conditions need to be considered; but also, the semantic interrelations between mechanical properties, such as yield strength, tensile strength, and Young’s modulus, which are equally complex and diverse. A lack of such knowledge representations can lead to false positives in the extracted data, especially when multiple properties are discussed within the same text; accurately associating extracted values with the correct property being referenced becomes increasingly challenging. Therefore, the task of conceptualizing and programming a complete knowl- edge representation for properties found on the stress−strain curve is crucial, yet it is nontrivial, and its inherent difficulty will be evident in the final precision and recall metrics of a resulting stress-engineering database. Fortunately, the field of NLP, and in particular the statistical modeling of language, has been rapidly progressing. Recent research has focused on large language models that are mostly built on the Transformer architecture,16 such as Bidirectional Encoder Representations from Transformers (BERT),17 the Generative Pretrained Transformer (GPT) series of language models,18,19 the Robustly optimized BERT pretraining approach (RoBERTa),20 and the Text-to-Text Transfer Transformer (T5).21 These have all demonstrated state-of- the-art performance on a variety of natural-language tasks, from question-answering to text generation. BERT-based models and others tackle natural-language tasks in two stages. First, the model is pretrained on a large set of unlabeled text to learn the general-purpose language representations, allowing the model to understand the context and meaning of language, even when supplied with unseen sentences. The pretrained model is then fine-tuned for a specific task using labeled data, allowing the embedded knowledge to be repurposed for different tasks depending on the use case. In the context of information extraction from the scientific literature, a pretrained language model can be fine-tuned for question-answering tasks where the embedded knowledge can be utilized for extractive purposes. This approach alleviates many of the aforementioned problems that are encountered Figure 1. Examples of a few different use cases of our MechBERT models. In this study, we fine-tune MechBERT for extractive question-answering, which helps facilitate information extraction. Other researchers may find successful application of our MechBERT models in a variety of NLP tasks, including named entity recognition, similarity measurement, and classification tasks. Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1874 https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig1&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig1&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig1&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig1&ref=pdf pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as with the conventional NLP-based information-extraction pipeline. In particular, stages such as tokenization, named entity recognition, and text parsing are learnt simultaneously when pretraining a language model, thereby precluding the need for individual methods or expertly crafted rules to be developed for each; furthermore, the semantic connection between properties is automatically learned from the unlabeled text during the pretraining process, removing the need to manually design complex knowledge representations and enabling more accurate extraction of material properties from the text. Thus, language models have the potential to address some of the shortcomings of sequential NLP extraction pipelines. However, general-purpose language models, such as BERT, may struggle to capture domain-specific characteristics. As such, specialized BERT models, tailored to the unique linguistic patterns of a particular field, have been developed and demonstrate superior performance on downstream, domain-specific, tasks. For instance, BioBERT,22 FinBERT,23 SciBERT,24 and ChemBERTa25 (or even CamemBERT26 for the French language) have all been pretrained on text from their respective fields and achieved high performance in later natural-language tasks. Notably, BatteryBERT27 and Optical- BERT28 have surpassed other BERT-based models and conventional information-extraction pipelines. In contrast to other domains, no such model has been reported that is specialized for stress−strain information. Therefore, it is natural to question whether a similar type of BERT model could be useful for information extraction of stress−strain- related properties. To this end, four BERT models are developed in this study using different pretraining corpora, with a focus on text that contains stress−strain information, which is sourced from the scientific literature. Due to the nature of the corpora, including a plethora of mechanical properties, the four models are aptly named as a class of MechBERT models. Once parameters have been optimized for fine-tuning, it will be demonstrated that the domain-specific models perform better than other BERT-based models for extractive tasks within the field of stress engineering while also maintaining performance on general English- language question-answering data sets. For evaluation, a domain-specific question-answering data set was curated with the help of an annotation tool that has been purpose built in this work. The results herein will cement the importance of corpus specificity in pretraining language models for improved performance on downstream tasks within that domain; with MechBERT models outperforming language models that are not only pretrained on many more data but which are also significantly larger in terms of their number of model parameters. As such, our MechBERT models have the potential to be integrated into information extraction systems to overcome the limitations of sequential NLP extraction methods. While this paper focuses on the development of MechBERT models and assessment of their effectiveness for stress−strain engineering information-extraction tasks, it is worth high- lighting the broader applicability and importance that these material-domain-specific MechBERT models have for the mechanical engineering community. Our MechBERT models provide a foundation that can be fine-tuned to enhance performance across a wide range of downstream tasks tailored to the specific needs of this field. Figure 1 illustrates example use cases in which a stress−strain engineering language model, further optimized through fine-tuning, could be applied. Therefore, our overarching development of domain-specific MechBERT models offers a vehicle for the materials- engineering research community to interact with carefully crafted language models about the mechanical properties of chemicals and tailor them for their bespoke needs. ■ METHODS To explore the benefits of domain-specific language models for information extraction of properties found on the stress−strain curve, four models were developed in this study: Pure- MechBERT-cased, PureMechBERT-uncased, MechBERT- cased, and MechBERT-uncased. These models were fine- tuned for question-answering tasks, as this enables the embedded knowledge to be used for extractive tasks. This section will detail the pretraining and fine-tuning processes of these language models alongside training-data preparation and a tool that we have built for creating question-answering data sets. Model Overview. The model architecture follows the original model that was discussed by Devlin et al.17 to ensure consistency and allow for fair comparison. Pretraining model parameters are chosen such that the total number of parameters matches that of the original BERT models. The HuggingFace implementation of BERT, found in their Transformer package,29 was used, and the relevant parameters are listed in Table 1. Each BERT model has a total of 110 million parameters, and the four BERT models were pretrained on different corpora; PureMechBERT models were pretrained on only the domain-specific corpus, while MechBERT models were further pretrained starting from the weights of the original BERTBASE model. Alternative model architectures were also considered for our domain-specific language models, including certain increasingly popular Generative Pretrained Transformer (GPT) language models.18,19 However, many downstream domain-specific tasks, such as extractive question answering or chemical entity recognition, require precise, factual answers. In such cases, other transformer-based language model architectures can be more suitable, as GPT models tend to be quite susceptible to producing hallucinations, or they can necessitate additional postprocessing to ensure accuracy. Therefore, we utilize the BERT architecture in this study. Nonetheless, due to the widespread popularity of tools such as ChatGPT, further discussion and comparison between ChatGPT and our MechBERT models are provided in the Supporting Informa- tion of this paper. Pretraining Corpus. In this study, the pretraining corpus is composed of scientific articles that have been published by Elsevier, Springer Nature, and Wiley. For the former two publishers, the previously cultivated corpus by Kumar et al.14 is Table 1. Summary of the Main Configuration Parameters That Define the Architecture of the BERT Model parameter value vocabulary size 30,522 hidden size 768 hidden layers 12 attention heads 12 dropout probability 0.1 transformer version 4.25.1 Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1875 https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.4c00857/suppl_file/ci4c00857_si_001.pdf https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.4c00857/suppl_file/ci4c00857_si_001.pdf pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as further extended to contain more general stress−strain information through the use of additional search queries for relevant mechanical properties, following the method discussed therein. The text from each Elsevier and Springer Nature article was extracted using the reader package of ChemDataEx- tractor3 and then combined into a single text file. Unique to this study, access to the Wiley API was granted and was integrated into the article retrieval pipeline. However, this API is limited and does not provide an interface to search the Wiley article repositories; as such, the DOI of relevant articles had to be searched for separately. While other APIs, such as CrossRefAPI,30 could be used for the search process, it was found that articles not relevant to the search query would often be returned via such an approach, which would negatively skew pretraining toward irrelevant text. Instead, the online Wiley search engine was used directly to gather a list of relevant DOIs; although, the number of relevant articles identified comprised less than 1% of the overall pretraining corpus. Additionally, the Wiley API provides full-text articles only as Portable Document Format (PDF) files, which often causes difficulties in retrieving the contained text. Nonetheless, the plain text of these PDF files was extracted using PDFDataEx- tractor31 and consolidated into a single text file, as the inclusion of any additional text data is useful for pretraining each language model. Overall, the domain-specific text corpus consists of approximately 1.2 billion tokens sourced from over 400,000 articles, and the relative contribution of papers from each publisher is shown in Figure 2. In contrast, the original BERTBASE model was pretrained on a corpus of 3.3 billion tokens that had been sourced from English Wikipedia and the Book Corpus.32 While the domain-specific corpus used in this study is smaller than this corpus of generic English text, it is still within the range of optimal parameter/training token allocation.33 A word-cloud visualization was generated to illustrate the vocabulary distribution of our domain-specific corpus. As shown in Figure 3, high-frequency terms are emphasized, including “stress”, “fracture”, “tensile”, “strength”, and even the commonly used units for tensile properties, “MPa”. These further highlight the uniqueness of the collected text compared to other general purpose corpora, enabling the language model to learn the distinct linguistic patterns that are prevalent in the scientific literature, especially those concerning stress−strain property information. Tokenization. To prepare the corpus for pretraining the language models, the text was tokenized using WordPiece embeddings,34 following the approach used by the original BERT model.17 These are subword representations that allow for the tokenization of a corpus using a smaller vocabulary; thus, they are more computationally efficient. For the MechBERT models, the original cased and uncased BERT WordPiece tokenizers were used, whereas new tokenizers were trained for the PureMechBERT models. The training made use of the HuggingFace libraries29 and the domain-specific corpus to generate a vocabulary of the same size for both cased and uncased variants. The entire corpus was tokenized using the respective WordPiece tokenizers of each model to convert the text into appropriate input embeddings for the BERT architecture.17 As the vocabularies were obtained directly from the corresponding corpus, they are valuable tools for the comparison of different training sets used in distinct language models. The vocabulary overlap between three distinct language models is shown in Figure 4. This highlights the level of uniqueness between PureMechBERT, BERT,17 and MatSciBERT,35 the latter being a BERT model that was trained on a corpus that encompasses all of material science. The MatSciBERT model was specifically chosen for compar- Figure 2. Percentage contribution of each publisher to the domain- specific text corpus curated in this study. Figure 3. Word cloud of the most frequent words contained within our domain-specific corpus. Figure 4. Overlap of the different vocabularies used for BERT,17 MatSciBERT,35 and PureMechBERT models. Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1876 https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig2&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig2&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig2&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig2&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig3&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig3&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig3&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig3&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig4&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig4&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig4&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig4&ref=pdf pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as ison, as its material-science-focused corpus includes alloys and ceramics, which are often the target of stress−strain studies. Consequently, the PureMechBERT vocabulary has a greater overlap with MatSciBERT than with BERT, with 53 and 39% overlapping word pieces, respectively. Qualitatively, the vocabularies of PureMechBERT and MatSciBERT models are more similar than they are to those of BERT owing to the scientific nature of their corpora. Still, their significant differences from the foundational language model suggest that specialized terms that are unique to the scientific corpora are prevalent. Alongside the word cloud in Figure 3, Figure 4 demonstrates the focus of the corpus toward stress−strain information, which in turn will be reflected in the pretrained language model. Pretraining. Following text processing and tokenization, the next stage involves pretraining the four language models: cased and uncased variants of both PureMechBERT and MechBERT. The PureMechBERT models were initialized from scratch and were trained solely on the stress−strain corpus that was curated in this work. In contrast, MechBERT models were initialized from the pretrained weights of the original cased and uncased BERTBASE models. This was followed by further training on the domain-specific corpus, meaning that knowledge from both general and scientific texts is embedded into those models. During the pretraining stage, the masked language modeling objective was used for all models. This involves masking 15% of tokens within the pretraining corpus at random and requires the model to correctly predict the masked words. The HuggingFace Transformers library29 provided the implemen- tation for this task. The original BERT paper also employed next-sentence prediction for pretraining;17 however, Liu et al. demonstrated that this does not correspond to a significant difference in the overall performance.20 As such, this task was removed from the pretraining objective in this study to reduce the overall computational cost. Pretraining was performed using NVIDIA A100 GPUs within the Polaris cluster at the Argonne Leadership Computing Facility (ALCF), Illinois, USA. A total of ten computing nodes were used, each containing four GPUs. Pretraining took between 15 and 20 h to complete for each variant. DeepSpeed36 was employed to distribute training across multiple nodes, resulting in more efficient processing and a reduction in the time taken for pretraining. Moreover, a critical factor in pretraining is the cumulative number of training sequences that are processed. The original BERT model and other domain-specific adaptations, such as BatteryBERT27 and OpticalBERT,28 pretrained their models for 1 million steps with a batch size of 512, resulting in the processing of 2.56 × 108 sequences in total.17 In this work, pretraining was optimized while maintaining the overall exposure to training sequences. To this end, a total batch size of 2048 was used, and the training steps were reduced to 125,000 and 187,500 for MechBERT and PureMechBERT models, respectively. This approach maintained model performance while reducing the total pretraining time. The increased exposure to training sequences for the Pure- MechBERT models accommodated their initialization from scratch. This also aligned with BatteryBERT models that were trained from scratch, which were trained for a total of 1.5 million steps with a batch size of 256.27 Fine-Tuning for Question Answering. During the fine- tuning stage of language-model constructions, the pretrained model architecture was kept largely the same with the learned weights being “frozen”. Only the parameters associated with task-specific input and output layers were adjusted using labeled data. For information extraction, the models were fine- tuned for the question-answering tasks. This involved adding a single output layer, often referred to as the “answer head”, as suggested in the original BERT paper.17 This layer receives questions and context pairs that have been encoded by the pretrained BERT model. The answer head predicts the location of an answer span within the context by identifying the token positions i and j (where i ≤ j) that exhibit the highest dot product scores with learned start and end vectors, i.e., maximizing S·Ti + E·Tj for j ≥ i, where Ti is the final hidden vector for the ith input token, and S and E are the start and end vectors, respectively. By comparing these predicted answer locations with the “ground-truth” labels provided in an annotated training data set, solely the parameters of the new answer head need to be optimized, meaning that a minimal number of parameters need to be learnt from scratch. This approach allows the model to use the knowledge learnt during pretraining without additional extensive training. Indeed, fine- tuning is much less computationally expensive than pretraining and can be completed within a few GPU node hours. All models were fine-tuned on the Stanford Question and Answering Data set (SQuAD) v2.37 This data set is used as a benchmark for question-answering systems and contains 150,000 question-context pairs that have been annotated with crowd-sourced answers. Of these, 50,000 questions are unanswerable and have been labeled as such; this is in contrast to its predecessor, SQuAD v1,38 which only includes answerable question-context pairs and has been employed in related studies, e.g., by Huang et al. in BatteryBERT27 and Zhao et al. in OpticalBERT.28 However, in real-world applications, information-extraction systems will frequently encounter passages of text that miss the target information. A suitable model must not only correctly find an answer span within the context but also identify instances where no answer exists. Therefore, our models were primarily fine-tuned on SQuAD version 2.0 for information extraction purposes. To allow for comparison with other studies that used SQuAD v1, an additional set of fine-tuned models on this data set was produced. Fine-tuning was carried out using the training scripts provided by HuggingFace.29 Determining the best hyperparameters for fine-tuning a model can be a time-consuming and inefficient process, particularly when using a random or grid-search manual approach, which can easily overlook the best set of parameters. To this end, the computationally inexpensive nature of fine- tuning was leveraged to find the optimal set of hyper- parameters by using Bayesian optimization to guide the search. The search space of parameters used in this study is listed in Table 2, with the goal of maximizing the exact-match evaluation metric which reflects the percentage of correctly Table 2. Search Space for Finding the Optimal Hyperparameters for Fine-Tuning of BERT-Based Models hyperparameter min max learning rate 1 × 10−8 1 × 10−4 number of epochs 1 20 batch size 4 256 Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1877 pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as answered questions. This was facilitated by the Weights & Biases (W&B) framework.39 An initial set of hyperparameter configurations was chosen at random to gather sample data to construct a surrogate model. In the W&B framework, this utilizes a Gaussian process regression model that captures the relationship between hyperparameter selection and the exact match score. Using the surrogate model and expected improvement, hyper- parameters are suggested that balance ones that are expected to perform well and the ones that explore new regions of the hyperparameter space. Iteratively, models are fine-tuned using the newly suggested configurations, and the resulting exact match score is used to update the surrogate model. This continually improves its ability to estimate performance given a set of hyperparameters, and therefore, it becomes more capable of guiding the search toward the most optimal set that maximizes the exact match score. After approximately 100 iterations, the “best” configuration that achieved the highest exact match score was selected for the final models. Domain-Specific Question-Answering Evaluation Data. To evaluate the performance of our language models on domain-specific question-answering tasks, a custom data set was created, following the format of the SQuAD data sets. A total of 411 questions and answers were manually generated, with context paragraphs sourced from articles that were not in the pretraining corpus. Of these, 65 questions are unanswer- able and were labeled as such. A sample entry from the evaluation set is shown in Figure 5 and consists of a question- context pair with the corresponding ground-truth answer. The “answer_start” field specifies the starting-character index of the answer, allowing for easy conversion to a starting-token index, irrespective of the tokenizer choice, to enable more versatility when assessing different models. This field is also useful in cases where the initial token of an answer appears multiple times in a sequence, as it allows for the answer span to be accurately identified. To account for unanswerable questions, a Boolean field named “is_impossible” was included; if false, the answer resides within the context, and the model should predict the answer span; if true, the question does not have a corresponding answer in the given context. The questions within our data set have been written to primarily focus on the extraction of material-property information. As such, an underlying understanding of the knowledge representations between properties within the context is required for accurate answers to be found. Approaching the task of creating such data sets without any tool assistance can be tedious and prone to error. Manually typing each field and its contents leads to a higher risk of mislabeling, which does not allow for fair evaluation. Therefore, to facilitate the creation of SQuAD-like data sets, a web tool was designed to streamline the process of annotating context with questions and answers. On the frontend, the JavaScript library for handling web interfaces, ReactJS,40 was implemented. The backend was written in Python and utilizes the Django framework.41 The interface and annotation steps are exemplified in Figure 6, and the web app has been made openly available on GitHub.42 ■ RESULTS The pretraining progress of each model is shown in Figure 7, which depicts the evolution of the loss value, a unitless measure of the relative error between the prediction of masked tokens made by the model and their actual value, over the number of pretraining steps. While not mathematically conclusive, a decreasing loss suggests that the models are continuously learning and improving during the pretraining stage. For all models, the loss curves trend downward and level off without completely stagnating, indicating that the number of pretraining steps is sufficient and balances learning while preventing overfitting. Interestingly, in Figure 7a, the loss curves for the PureMechBERT models initially begin to converge at a loss value of around 5.5. This indicates an initial difficulty in accurately performing the masked language modeling task, and thus, a difficulty in successfully learning the linguistic patterns within the domain-specific corpus, perhaps due to these models being initialized from scratch. However, after around 25,000 pretraining steps, these models overcome a local minima and proceed to converge to a loss value of around 1, which aligns with the loss of MechBERT models and similar BERT models that have been developed for the scientific domain.27,28 The loss values of cased BERT models are all lower than their uncased counterparts, highlighting that the case of tokens provides valuable information for predicting masked tokens, and this may be reflected in the performance of downstream tasks. For a more comprehensive evaluation, the relative performance of BERT models fine-tuned for question-answering tasks was evaluated. Fine-Tuning Optimization. The optimal hyperparameter configuration for fine-tuning each BERT model is detailed in Table 3. During the search, over 100 different configurations were tested for each model, with each iteration aiming to improve the exact match score on the SQuAD v1 and v2 data sets. This exploration revealed the significant impact of hyperparameter selection; across all models, the average standard deviation of the exact-match score was 6.1%, with highs of 18.4%. In other words, for a given data set and pretrained model, an information-extraction system built with carefully selected hyperparameters could achieve almost 20% fewer errors compared to a suboptimally configured system. The variability in exact-match scores across different fine- tuning configurations is further illustrated in Figure 8; this showcases the top 100 performing hyperparameter config- Figure 5. Example SQuAD-like entry in our domain-specific evaluation set. Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1878 https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig5&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig5&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig5&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig5&ref=pdf pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as urations for fine-tuning MechBERT cased models on SQuAD v1 and v2, with the best performing model parameters being highlighted in red. Similar figures for each model can be found in the Supporting Information. These further highlight the benefit of the extensive hyperparameter search performed herein. In particular, they demonstrate the advantage of Bayesian optimization over other methods, such as grid searching, where the optimal configuration may lie outside the predefined value ranges. For instance, grid-search approaches typically test a smaller set of parameter values, such as batch sizes of 16 and 32; learning rates of 5 × 10−5, 3 × 10−5, and 2 × 10−5; or number of epochs ranging from 1 to 4.27,28 If a similar method was employed in this study, the resulting performance of the fine-tuned models would be very different, and the best model configuration would be overlooked. In contrast, Bayesian optimization explores the hyperparameter space more efficiently and leads to the identification of configurations that significantly improve the model performance. With these optimized hyperparameters, our eight fine-tuned models were evaluated for question-answering tasks to demonstrate their capability for use in information extraction. This evaluation encompassed both general-language and domain-specific tasks, using both SQuAD and a data set custom-built for this study. For comparison, the best performing BERT models that have been fine-tuned in related studies of material properties are sourced and tested against the same criteria. Evaluating Question-Answering Performance. Per- formance was evaluated by tasking each model to identify the relevant answer to a given question within a context paragraph. These are provided in an evaluation data set that also contains the “ground-truth” answer to the question. If the predicted answer is identical to the ground-truth in its entirety, then it is considered to be correct and is assigned a score of 1. Conversely, if any part of the prediction deviates from the ground-truth, it is considered to be incorrect and assigned a score of 0. This is analogous to labeling the predicted answer as a “True Positive” or “False Positive” depending on its correctness. The exact-match score of a model is determined by calculating the percentage of predictions that have been correctly made. However, since an answer is correct or not, this metric does not consider answers that are mostly correct. For instance, the question “What is the Ultimate Tensile Strength (UTS) of Material X?” given a sentence “Material X has an UTS of 500 MPa” can be answered as either “500 MPa” or “UTS of Figure 6. Example usage of the QA Annotator Web Tool that was created for this work. Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1879 https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.4c00857/suppl_file/ci4c00857_si_001.pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig6&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig6&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig6&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig6&ref=pdf pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as 500 MPa” which are both correct, but only the one matching the ground-truth would be accepted, resulting in a potential false-positive labeling. To account for this, the general- language SQuAD data set owners crowd sourced their ground-truth answers, and as such, these data sets have multiple variations of answers to a single question. This allows the predictions to be compared to a greater variety of human- annotated answers, minimizing the possibility of factually correct predictions being disregarded. A byproduct of this approach is that the human-level performance can be deduced by comparing the crowd-sourced answers to each other; this demonstrated that humans performing question-answering tasks achieve an exact-match score ranging from 77 to 86% with little difference being observed between the SQuAD data set versions. The domain-specific data set does not contain multiple variations of answers, as it was constructed using annotations from a single person (the first author of this paper). Instead, the F1 score can be used as a more reliable evaluation metric where each individual token in the prediction is compared to the ground-truth answer to calculate precision and recall. The definitions of precision, recall, and F1 score follow closely to the ones used in related papers.27,28 In this case, precision describes the percentage of predicted tokens that are in the ground-truth and are correct, i.e., the accuracy, and recall captures the completeness of predictions by measuring the percentage of ground-truth tokens that appear in the predicted answer. As usual, the F1 score balances both precision and recall to determine the ability of the model to give accurate and complete answers and allows for evaluation that is not strictly “all-or-nothing”, unlike the exact-match score. These metrics are calculated using the following: =Precision Shared Tokens in Prediction and Ground truth Number of Tokens in Prediction (1) =Recall Shared Tokens in Prediction and Ground truth Number of Tokens in Ground truth (2) = × × + F1 2 Precision Recall Precision Recall (3) Evaluating General Text Question-Answering Perform- ance. Summaries of the evaluation metrics on the general English-language data sets, SQuAD v1 and v2, are listed in Tables 4 and 5, respectively. Interestingly, all cased models in this study outperform their uncased counterparts, reflecting the trend in the loss function observed in their pretraining. This stands to reason since the cased variants have been provided with additional information such as the capitalization of proper nouns to help perform the masked-language modeling task, which also seems to carry over for question-answering tasks. Both cased and uncased versions of the MechBERT models maintain a similar level of performance to the BERTBASE model, with the cased version performing marginally better when dealing with only answerable questions (SQuAD v1). At the same time, MechBERT models perform slightly better than the BERTBASE models with the inclusion of unanswerable questions (SQuAD v2). This result may be due to the more optimized approach used to fine-tune the MechBERT models. Despite the pretraining corpus in this study being largely focused on scientific texts that are “out-of-scope” for the general English language question-answering task at hand, the Figure 7. Evolution of loss during pretraining of each model. Table 3. Optimal Hyperparameter Configurations for Fine-Tuning on SQuAD v1 and v2 dataset model learning rate epochs batch size SQuAD v1 MechBERT Cased 5.75409064017 × 10−5 3 60 MechBERT Uncased 6.69796562345 × 10−5 3 44 PureMechBERT Cased 2.34149250752 × 10−5 4 24 PureMechBERT Uncased 6.65858503326 × 10−5 3 140 SQuAD v2 MechBERT Cased 6.40600319349 × 10−5 2 128 MechBERT Uncased 1.494572498569 × 10−4 2 23 PureMechBERT Cased 4.90973747641 × 10−5 3 76 PureMechBERT Uncased 3.80820446232 × 10−5 4 60 Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1880 https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig7&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig7&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig7&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig7&ref=pdf pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as further pretraining of MechBERT from the original BERT weights does not negatively impact the performance of our models on downstream general question-answering tasks. However, PureMechBERT models perform worse than MechBERT models since they were pretrained from scratch on purely the domain-specific corpus, which only shares 39% of the original BERT vocabulary. Thus, the general English- language knowledge that was already embedded into the original BERT weights is not present in PureMechBERT models, meaning that PureMechBERT models are not suited for downstream tasks that are primarily focused on general language. Moreover, the corpus used to pretrained Pure- MechBERT models is relatively small, meaning that these pretrained models were not supplied with enough samples of general English-language patterns to properly tackle such tasks. Figure 8. Summary of the top 100 hyperparameters found during the optimization of fine-tuned MechBERT cased models. All configurations are compared using the exact-match score and the best performing model is highlighted in red. Table 4. Model Performance on SQuAD v1 model exact-match (%) F1 score (%) MechBERT Cased 81.41 88.61 MechBERT Uncased 80.41 87.95 PureMechBERT Cased 76.81 84.79 PureMechBERT Uncased 75.18 83.85 BERTBASE 17 80.80 88.50 BatteryOnlyBERT Cased27 79.61 87.30 BatteryOnlyBERT Uncased27 79.53 87.22 MatSciBERTa35 77.42 85.73 aThe MatSciBERT model was fine-tuned following the guidelines of the BERTBASE model. Table 5. Model Performance on SQuAD v2 model exact-match (%) F1 score (%) MechBERT Cased 74.84 77.95 MechBERT Uncased 74.78 77.89 PureMechBERT Cased 71.77 74.84 PureMechBERT Uncased 71.06 74.52 BERTBASE Cased17,43 71.15 74.67 BERTBASE Uncased17,43 73.68 77.88 MatSciBERTa35 73.14 76.46 aThe MatSciBERT model was fine-tuned following the guidelines of the BERTBASE model. Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1881 https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig8&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig8&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig8&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig8&ref=pdf pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as In comparison, BatteryBERT pretrains models from scratch (BatteryOnlyBERT) using a corpus that is almost 40% larger than those of PureMechBERT models;27 hence, performing better on question-answering tasks from SQuAD v1. Overall, the results that evaluate the performance of our language models against general English-language question- answering tasks follow expectations. Across the board, models struggle with data sets containing unanswerable questions due to their added difficulty, and, as such, the exact-match scores are significantly lower for SQuAD v2 which is demonstrated in Figure 9. Extending the pretraining of a BERT model does improve the model's performance, with MechBERT perform- ing slightly better than BERTBASE models. However, the overall improvements are minimal, which is, in part, due to the relatively small pretraining corpus used in this work. Pure- MechBERT models perform worse for general English- language question-answering tasks. However, to properly assess the benefit of the pretraining of the BERT models developed in this study, evaluation on a domain-specific question-answering set is required, as is discussed in the following section. Evaluating Domain-Specific Question-Answering Per- formance. Our domain-specific data set consists of questions that are more relevant to stress−strain information and, as such, are better equipped to evaluate the ability of our models to perform extractive tasks in the scientific domain. Table 6 provides a summary of the evaluation metrics for each model, which are visualized in Figure 10. The metrics of all models that were tested can be found in the Supporting Information. Note that, while the exact-match score may seem to be relatively low, this is a result of the question-answering data set containing only a single variation of an answer, as opposed to the multiple variations of answers that are contained in the general English-language SQuAD data sets described pre- viously. Therefore, the F1 score gives a more complete description of model performance when assessing domain- specific question-answering on our evaluation data set. It is clear that the domain-specific pretraining of a BERT model massively improves its performance when given domain-specific question−answer tasks, with all MechBERT variations performing over 20% better than their BERTBASE counterparts on our question-answering data set. Pretraining on the scientific text that describes stress-related properties embeds more useful knowledge into the language model, which significantly improves performance on downstream tasks within the target domain. This itself is an expected result; however, it is noteworthy that the PureMechBERT models maintain a similar level of proficiency to the MechBERT models, given that they are pretrained on a fraction of the data. In fact, PureMechBERT cased models, fine-tuned on both SQuAD v1 and v2, perform best when they are subjected to domain-specific question-answering tasks with F1 scores of 83.50 and 78.78%, respectively. As such, the relevancy of the pretraining corpus to the target domain of downstream tasks may hold greater importance than purely the amount and variety of pretraining data. To determine whether the performance improvements of our language models are due to the corpus being generally scientific or if the topic specificity is of more importance, scientifically aligned BERT models were evaluated on the domain-specific question-answering data set. For this, we performed an evaluation on our models, once fine-tuned on question-answering tasks from SQuAD v1; this is so they could be compared directly to the best performing models found by Huang et al.27 and Zhao et al.28 which are specialized for the Figure 9. Model performance on general English-language question-answering from the SQuAD data sets. The best performing model on both tasks is the case variant of the MechBERT models. Table 6. Performance of Each Fine-Tuned Model on the Domain-Specific Question-Answering Dataseta model version model exact F1 SQuAD v1 MechBERT Cased 69.36 82.32 MechBERT Uncased 69.08 82.51 PureMechBERT Cased 69.36 83.50 PureMechBERT Uncased 67.92 81.75 BERTBASE Cased 40.75 60.91 BERTBASE Uncased 43.06 63.09 SQuAD v2 MechBERT Cased 63.99 75.46 MechBERT Uncased 66.42 77.92 PureMechBERT Cased 67.88 78.58 PureMechBERT Uncased 63.75 75.12 BERTBASE Cased 40.15 53.19 BERTBASE Uncased 46.72 58.67 aThe models fine-tuned on SQuAD v1 are evaluated only on the answerable questions in the dataset. Models fine-tuned on SQuAD v2 are evaluated on the entire dataset. Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1882 https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.4c00857/suppl_file/ci4c00857_si_001.pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig9&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig9&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig9&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig9&ref=pdf pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as scientific domains of batteries and optical materials, respectively. Also included in our evaluation was the MatSciBERT35 model, which was pretrained on scientific language from the overarching domain of material science. A categorized scatter plot of the resulting performance metrics on the domain-specific evaluation data set is illustrated in Figure 11. Overall, BERT models that were pretrained on more scientific data performed better than the BERTBASE models that were pretrained on only the general English language. With increased domain specificity of the pretraining corpus, MechBERT models were able to outperform other scientific BERT models; for example, exact-match scores for Pure- MechBERT and MechBERT also afford the highest F1 scores, as shown in Figure 11. This result further highlights the impact of specialized pretraining, which can significantly improve the performance of extractive tasks within the domain of interest. This is particularly useful when incorporating language models into information-extraction systems, where the encountered context will be specific to the target domain; as such, a model with a depth of knowledge regarding language patterns in a particular domain is preferred. Evaluating the Influence of the Size and Type of BERT Architecture on Domain-Specific Question-Answering Per- formance. During the course of this study, improvements to the BERT architecture were being developed, owing to the fast-moving nature of the field. Larger foundational BERT models that had been pretrained with many more data and with modified architectures were showcasing state-of-the-art results in question-answering tasks. To this end, the SQuAD v2 variants of the fine-tuned MechBERT models were compared to updated foundational language models that are primarily based on the BERT architecture, such as RoBERTa20 and DeBERTa,44,45 using fine-tuned versions of these models produced by the Deepset team.43 Figure 12a presents a performance comparison of BERT models that had been pretrained with a similar number of parameters employed in this study (approximately 110 million). Figure 12b presents the performance levels of BERT models that had been pretrained with many more parameters (exceeding 340 million parameters) relative to that of our MechBERT models. Figure 11a shows that the newer foundational BERT-model architectures demonstrate enhanced performance over the original BERT models, with F1 scores of 63.46 and 74.82% being achieved on the domain-specific evaluation set for RoBERTa and DeBERTa v3, respectively. Indeed, the up-to- date foundational BERT architecture and improved pretraining objectives employed in the general English-language model, DeBERTa v345 even outperforms SciBERT which was pretrained on the scientific literature and is thus theoretically more suited to domain-specific question-answering tasks. The MatSciBERT model, with a pretraining corpus that is closely related to MechBERT models, also demonstrates good performance. Nonetheless, MechBERT models maintain the best performance, exceeding the next best model by 7.78% in an exact-match score and 3.76% in the F1 score. This superior performance is due to the specialized pretraining corpus and reinforces our conclusions made above. To reiterate, the domain-specificity of the corpus is crucial for the performance Figure 10. Exact-match and F1 scores on the domain-specific data sets that contain 411 questions related to stress−strain information. The “SQuAD v1-Like” results were derived from models fine-tuned on SQuAD v1 and were only evaluated on the answerable questions in the data set. The “SQuAD v2-Like” results were from models fine-tuned on SQuAD v2 that were evaluated for all questions, including unanswerable ones. Figure 11. Scatter plot depicting the domain-specific question- answering performance metrics of various cased and uncased BERT models fine-tuned on question-answering tasks from the SQuAD v1 data set. MechBERT includes all models created in this study, the BatteryBERT pretrained on only text pertaining to Battery by Huang et al.,27 OpticalBERT includes models trained by Zhao et al.,28 BERT represents the original base models by Devlin et al.17 and MatSciBERT from Gupta et al.35 Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1883 https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig10&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig10&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig10&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig10&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig11&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig11&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig11&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig11&ref=pdf pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as of BERT-based language models in downstream tasks within the target domain, as opposed to the size of the pretraining corpus alone. Figure 12b presents the level of performance of BERT models that had been pretrained on many more parameters (exceeding 340 million) relative to that of our MechBERT models. There is a prevailing assumption that larger founda- tional language models are more capable, given that they have been constructed using many more parameters and have been pretrained on a far more extensive corpus (of the order of 160 GB) and that they should therefore perform better on question-answering tasks than smaller language models. For example, the large variants, RoBERTaLARGE and DeBER- TaLARGE v3, achieve significantly improved F1 scores of 84.03 and 90.75% when fine-tuned with question-answering tasks from the general English-language SQuAD v2 data sets, respectively.20,45 Notably, all MechBERT language models are at least three times smaller in terms of the number of parameters, and PureMechBERT models were pretrained with only 7.4 GB of text data. Nonetheless, all variants of our MechBERT models can compete with these larger BERT- based models. For example, the PureMechBERT cased model outperforms RoBERTaLARGE and BERTLARGE on domain- specific question-answering by 7.63 and 8.22% in F1 score, and 7.54 and 9.24% in exact-match score, respectively. In doing so, the smaller models are also able to process the samples at six times the speed of the larger models, making them more appealing for an information-extraction system that needs to balance precision with performance. This finding is also important from an energy sustainability perspective; smaller language models could offer a better economy through more modest energy consumption. Figure 12. Performance metrics of various BERT-based models when evaluated on domain-specific question-answering tasks. MatSciBERT,35 DeBERTa v3,45 BERT,17 RoBERTa,20 and SciBERT24 have approximately the same model size as our MechBERT models. DeBERTaLARGE v3, RoBERTaLARGE, and BERTLARGE are three times larger than our MechBERT model architecture. Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1884 https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig12&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig12&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig12&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig12&ref=pdf pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as The only foundational language model that was deemed to be superior to the ones developed in this study is DeBERTaLARGE v3, as illustrated in Figure 12b. This is a direct result of the introduction of disentangled attention and a modified pretraining objective in the development of DeBERTaLARGE v3. In the original BERT models, a standard self-attention mechanism is used, and the pretraining objective is masked language modeling. In contrast, DeBERTa v3 disentangles this attention mechanism into two, one for word content and one for position, and uses the Replaced Token Detection pretraining objective.45 These architectural differ- ences result in improved model performance for downstream tasks and present a potential avenue for further improvements for our MechBERT models. Nevertheless, the difference between PureMechBERT and DeBERTa v3 (2.19% in exact- match score and 3.9% in F1 score) is smaller than the difference between the MechBERT models and other large BERT-based models. Our models outperform BERT-based models of the same size and maintain relevancy when compared to larger BERT models for use in downstream tasks within stress−strain-related domains. We have therefore demonstrated that our MechBERT models are powerful tools within the stress-engineering materials domain for extractive use cases. Moreover, they can compete with other state-of-the- art language models while being smaller in size, enabling faster processing and requiring a relatively small fraction of data to Figure 13. Visualization of attention46 for BERTBASE and PureMechBERT models on the example sentence: “The corresponding tensile properties displayed low strength (570 MPa yield strength (YS) and 1011 MPa UTS) and high ductility (35.6% TEL)”. Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1885 https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig13&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig13&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig13&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?fig=fig13&ref=pdf pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as pretrain them. Our small language models therefore stand to offer a “win−win” formula of greater operational efficiency by more energy-sustainable means. BERT Viz. The BERT architecture contains multiple attention heads that are used to process different aspects of a text sequence and learn the interrelations between words. A visualization of these attention patterns provides an intuitive view of how a language model views linguistic patterns and relationships. The attention patterns produced by the 12 attention heads in BERT and PureMechBERT models were obtained using the BertViz software46 on an example sentence sourced from the materials-engineering literature. For instance, Figure 13 portrays a side-by-side view of how the BERT and PureMechBERT models relate the word “tensile” to other tokens in the sentence and provides a visual comparison of the understanding of scientific terminology by each model. The self-attention in this visualization is represented by a line connecting each token in the sequence, with a different color being used to distinguish the different attention heads. The weight of the connection reflects the attention score; more opaque colors represent stronger relations between the respective tokens. For the BERTBASE model, which was primarily pretrained on general English-language text, the relation of the word “tensile” is mostly focused on adjacent words, with no significant connections to mentions of material properties. In stark contrast, the attention heads in the PureMechBERT models display a high attention score between “tensile” and mentions of tensile properties, such as “UTS”, “YS”, and “ductility”. As such, there is evidence that domain-specific pretraining allows the PureMechBERT models to capture the unique contextual relations between stress−strain properties which a general purpose language model cannot see. Qualitatively, this suggests that the PureMechBERT models, and MechBERT models to a lesser extent, have embedded knowledge of tensile properties and their representations within text that can be utilized for downstream tasks; from this, we presume that the same is true of other stress−strain properties. Moreover, it is worth remembering that these models have been learned from unlabeled text automatically, without the laborious task of manually designing such complex representations, thereby overcoming the limitations present in conventional informa- tion-extraction systems. ■ CONCLUSIONS In summary, four language models based on the BERT architecture have been pretrained using text sourced from the stress−strain-related scientific literature: MechBERT cased and uncased models were initialized from the original BERT weights, while PureMechBERT cased and uncased models were initialized from scratch. These models were fine-tuned for general English-language question-answering tasks using the SQuAD v1 and v2 data sets, resulting in eight fine-tuned models ready for extractive use cases. Bayesian optimization was employed to discover the best hyperparameter combina- tion for fine-tuning, which proved to be useful as the exact- match score could deviate by 6.1% on average depending on the model configuration. Evaluation was conducted using the general English-language SQuAD data sets, and it was found that the further pretraining did not negatively affect the performance of MechBERT models and, in most cases, offered improvements when compared to the original BERT models. An additional evaluation set was constructed using an annotation tool that was custom-built for this work which followed the guidelines of the SQuAD data sets. This contained questions and answers that specifically relate to stress−strain properties, with the context being sourced from articles that are not in the pretraining corpus; as such, the data set is better equipped to evaluate the performance of the language models on domain-specific downstream tasks. It was found that the MechBERT models outperform others on question-answering tasks within the materials-engineering domain. The best-performing model, for both SQuAD v1 and v2 versions, was the PureMechBERT cased variant. This outperformed other models, even those that were larger and pretrained on more data, demonstrating the significant benefit of focused pretraining for use cases within a specific field. To the best of our knowledge, these models are the first Transformer-based language models that are specialized in stress−strain information and showcase elevated performance for in-domain extractive tasks. This paper has exemplified an alternative approach to information extraction in domains of research that rely on specialist vernacular; this approach overcomes the problems faced with conventional NLP-based methods through the use of domain-specific language models. The variants of our MechBERT models automatically learn the linguistic patterns of scientific text. Meanwhile, an understanding of stress−strain property semantic relations is embedded into the model, owing to the specificity of the pretraining corpus; a task that is complex and difficult to approach manually. A visualization of the attention patterns of MechBERT models provides evidence that some understanding of tensile properties and their semantic relation to one another is embedded during the pretraining stages of the language models. The same level of understanding is not present in general English-language BERT models. As such, it is reasonable to assume that the domain-specific knowledge representations have automatically been learned and can be utilized for not only information- extraction purposes but also other downstream NLP tasks. The extractive capabilities of these property-specific language models have been showcased; it has been demonstrated that MechBERT variants outperform related models on question-answering tasks surrounding stress−strain information, with an increase of 25.39% (answerable questions) and 22.59% (unanswerable questions included) in the F1 score over BERTBASE counterparts. Due to the nature by which extractive questions target information about materials and their associated properties, the improvements in performance highlight that the stages which one would normally associate with a conventional information-extraction pipeline have been properly learnt by the language model, and they can therefore be successfully implemented for the purpose of information extraction. While further testing is required to determine the efficacy of these language models in a fully fledged information-extraction system, initial results show that MechBERT models are capable of achieving an F1 score of up to 83.50% in extractive tasks within the domain. As a secondary observation, the significant impact of specialized pretraining for downstream performance within that domain has been realized. PureMechBERT models have been found to outperform other language models on the domain-specific evaluation set, even those that have more pretraining data that are still closely related, such as MatSciBERT,35 SciBERT,24 and other scientifically aligned Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1886 pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as BERT models. PureMechBERT impressively outperformed larger BERT-based language models, such as BERTLARGE and RoBERTaLARGE, in domain-specific question-answering tasks. These large variants are three times the size of our MechBERT models, in terms of model parameters and have been pretrained on many more data; for example, the RoBERTa and DeBERTa v3 models have been pretrained on a total of 160 GB of text data.20,45 In comparison, the PureMechBERT models have been pretrained on only a 7.4 GB corpus, i.e., less than 5% of the larger models; yet they are still able to achieve better results on the domain-specific data set by 7.63 and 7.54% in F1 score and exact-match score, respectively, when compared to RoBERTaLARGE. Our PureMechBERT model is able to outperform the DeBERTa v3 model of the same size; although, it is beaten by the large DeBERTa v3 variant. Despite this, the disparity between PureMechBERT and DeBER- TaLARGE v3 models (3.9% F1 score and 2.19% exact-match score) is smaller than that between the PureMechBERT model and other large BERT models. Moreover, since Pure- MechBERT is a much smaller model than DeBERTaLARGE v3, it is able to process samples at a rate six times faster. This is valuable for information-extraction systems, which need to balance precision with performance. This finding has pertinent implications for language-model applications from a perspec- tive of improving the operational efficiency of AI-based processes while simultaneously gaining energy sustainability. ■ ASSOCIATED CONTENT Data Availability Statement All of the scripts used to pretrain the MechBERT and PureMechBERT models are available online, as is the code that was used to process and evaluate the models.47 The QAannotation tool used to create the domain-specific evaluation data set is available online.42 The evaluation data set itself is provided in Supporting Information. The pretrained and fine-tuned models are available on the Molecular Engineering Group HuggingFace.48 *sı Supporting Information The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857. Details on pretraining hyperparameters, fine-tuning hyperparameter optimization, domain-specific evaluation summary for all models tested; document object identifiers (DOIs) for each paper that was used as the input corpus that trained the MechBERT models (partitioned into three DOI lists that pertain to papers from three publishing houses); and discussion and comparison between ChatGPT and our MechBERT models in the context of extractive question-answering for research studies (PDF) ■ AUTHOR INFORMATION Corresponding Author Jacqueline M. Cole − Cavendish Laboratory, Department of Physics, University of Cambridge, Cambridge CB3 0HE, U.K.; ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Didcot OX11 0QX, U.K.; Research Complex at Harwell, Rutherford Appleton Laboratory, Didcot OX11 0FA, U.K.; orcid.org/0000-0002-1552- 8743; Phone: +44 (0)1223 337470; Email: jmc61@ cam.ac.uk Authors Pankaj Kumar − Cavendish Laboratory, Department of Physics, University of Cambridge, Cambridge CB3 0HE, U.K.; ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Didcot OX11 0QX, U.K.; Research Complex at Harwell, Rutherford Appleton Laboratory, Didcot OX11 0FA, U.K. Saurabh Kabra − ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Didcot OX11 0QX, U.K.; Present Address: Neutron Sciences Directorate, One Bethel Valley Rd, Oak Ridge, Tennessee 37831, United States Complete contact information is available at: https://pubs.acs.org/10.1021/acs.jcim.4c00857 Author Contributions J.M.C. conceived the overarching project. J.M.C., S.K., and P.K. designed the study. P.K. performed the model pretraining and fine-tuning, data extraction, and analyzed the data under the PhD supervision of J.M.C. and cosupervision of S.K. P.K. drafted the manuscript with assistance from J.M.C. The final manuscript was read and approved by all authors. Notes The authors declare no competing financial interest. ■ ACKNOWLEDGMENTS J.M.C. is grateful for the BASF/Royal Academy of Engineering Research Chair in Data-Driven Molecular Engineering of Functional Materials, which is partly sponsored by the Science and Technology Facilities Council (STFC) via the ISIS Neutron and Muon Source; this Chair also supports a PhD studentship (for P.K.). Shu Huang and Taketomo Isazawa from the Molecular Engineering group, Cavendish Laboratory, University of Cambridge, are thanked for their technical assistance. The authors are indebted to the Argonne Leadership Computing Facility, which is a DOE Office of Science Facility, for use of its research resources, under contract No. DE-AC02-06CH11357. ■ REFERENCES (1) Olivetti, E. A.; Cole, J. M.; Kim, E.; Kononova, O.; Ceder, G.; Han, T. Y.-J.; Hiszpanski, A. M. Data-driven materials research enabled by natural language processing and information extraction. Appl. Phys. Rev. 2020, 7, 041317. (2) Cole, J. M. A Design-to-Device Pipeline for Data-Driven Materials Discovery. Acc. Chem. Res. 2020, 53, 599−610. (3) Swain, M. C.; Cole, J. M. ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature. J. Chem. Inf. Model. 2016, 56, 1894−1904. (4) Mavracǐc,́ J.; Court, C. J.; Isazawa, T.; Elliott, S. R.; Cole, J. M. ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science. J. Chem. Inf. Model. 2021, 61, 4280−4289. (5) Isazawa, T.; Cole, J. M. Single Model for Organic and Inorganic Chemical Named Entity Recognition in ChemDataExtractor. J. Chem. Inf. Model. 2022, 62, 1207−1213. (6) Isazawa, T.; Cole, J. M. Automated Construction of a Photocatalysis Dataset for Water-Splitting Applications. Sci. Data 2023, 10, 651. (7) Sierepeklis, O.; Cole, J. M. A thermoelectric materials database auto-generated from the scientific literature using ChemDataEx- tractor. Sci. Data 2022, 9, 648. (8) Zhao, J.; Cole, J. M. A database of refractive indices and dielectric constants auto-generated using ChemDataExtractor. Sci. Data 2022, 9, 192. Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1887 https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.4c00857/suppl_file/ci4c00857_si_001.pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?goto=supporting-info https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.4c00857/suppl_file/ci4c00857_si_001.pdf https://pubs.acs.org/action/doSearch?field1=Contrib&text1="Jacqueline+M.+Cole"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdf https://orcid.org/0000-0002-1552-8743 https://orcid.org/0000-0002-1552-8743 mailto:jmc61@cam.ac.uk mailto:jmc61@cam.ac.uk https://pubs.acs.org/action/doSearch?field1=Contrib&text1="Pankaj+Kumar"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdf https://pubs.acs.org/action/doSearch?field1=Contrib&text1="Saurabh+Kabra"&field2=AllField&text2=&publication=&accessType=allContent&Earliest=&ref=pdf https://pubs.acs.org/doi/10.1021/acs.jcim.4c00857?ref=pdf https://doi.org/10.1063/5.0021106 https://doi.org/10.1063/5.0021106 https://doi.org/10.1021/acs.accounts.9b00470?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.accounts.9b00470?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.6b00207?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.6b00207?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.6b00207?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.1c00446?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.1c00446?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.1c01199?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.1c01199?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1038/s41597-023-02511-6 https://doi.org/10.1038/s41597-023-02511-6 https://doi.org/10.1038/s41597-022-01752-1 https://doi.org/10.1038/s41597-022-01752-1 https://doi.org/10.1038/s41597-022-01752-1 https://doi.org/10.1038/s41597-022-01295-5 https://doi.org/10.1038/s41597-022-01295-5 pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as (9) Huang, D.; Cole, J. M. A database of thermally activated delayed fluorescent molecules auto-generated from scientific literature with ChemDataExtractor. Sci. Data 2024, 11, 80. (10) Dong, Q.; Cole, J. M. Auto-generated database of semi- conductor band gaps using ChemDataExtractor. Sci. Data 2022, 9, 193. (11) Court, C. J.; Cole, J. M. Auto-generated materials database of Curie and Neél temperatures via semi-supervised relationship extraction. Sci. Data 2018, 5, 180111. (12) Huang, S.; Cole, J. M. A database of battery materials auto- generated using ChemDataExtractor. Sci. Data 2020, 7, 260. (13) Beard, E. J.; Sivaraman, G.; Vázquez-Mayagoitia, A.; Vishwanath, V.; Cole, J. M. Comparative dataset of experimental and computational attributes of UV/vis absorption spectra. Sci. Data 2019, 6, 307. (14) Kumar, P.; Kabra, S.; Cole, J. M. Auto-generating databases of Yield Strength and Grain Size using ChemDataExtractor. Scientific Data 2022, 9, 292. (15) Hall, E. O. The Deformation and Ageing of Mild Steel: III Discussion of Results. Proceedings of the Physical Society. Section B 1951, 64, 747−753. (16) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates Inc., 2017. (17) Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre- training of Deep Bidirectional Transformers for Language Under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics; Human Language Technologies, Vol. 1 (Long and Short Papers): Minneapolis, MN, 2019; pp 4171−4186. (18) Brown, T. B. et al. Language Models are Few-Shot Learners. In NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems; Curran Associates Inc., 2020. (19) Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. (20) Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019, arXiv:1907.11692. (21) Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P. J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 2023, arXiv:1910.10683. (22) Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C. H.; Kang, J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234−1240. (23) Huang, A.; Wang, H.; Yang, Y. FinBERT�A Deep Learning Approach to Extracting Textual Information. SSRN Electron. J. 2023, 40, 806−841. (24) Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Association for Computational Linguistics: Hong Kong, China, 2019; pp 3615−3620. (25) Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. 2020, arXiv:2010.09885. (26) Martin, L.; Muller, B.; Ortiz Suárez, P. J.; Dupont, Y.; Romary, L.; de la Clergerie, E.; Seddah, D.; Sagot, B. CamemBERT: a Tasty French Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computa- tional Linguistics, 2020. (27) Huang, S.; Cole, J. M. BatteryBERT: A Pretrained Language Model for Battery Database Enhancement. J. Chem. Inf. Model. 2022, 62, 6365−6377. (28) Zhao, J.; Huang, S.; Cole, J. M. OpticalBERT and OpticalTable-SQA: Text- and Table-Based Language Models for the Optical-Materials Domain. J. Chem. Inf. Model. 2023, 63, 1961−1981. (29) Wolf, T. et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. 2020, arXiv:1910.03771. (30) http://www.crossref.org/ (accessed April 2024). (31) Zhu, M.; Cole, J. M. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. J. Chem. Inf. Model. 2022, 62, 1633−1643. (32) Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning Books and Movies: Towards Story- like Visual Explanations by Watching Movies and Reading Books. In 2015 IEEE International Conference on Computer Vision (ICCV); IEEE, 2015. (33) Hoffmann, J. et al. Training Compute-Optimal Large Language Models. In Proceedings of the 36th International Conference on Neural Information Processing Systems; Curran Associates Inc., 2022. (34) Wu, Y. et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. 2016, arXiv:1609.08144. (35) Gupta, T.; Zaki, M.; Krishnan, N. M. A.; Mausam. MatSciBERT: A materials domain language model for text mining and information extraction. npj Comput. Mater. 2022, 8, 102. (36) Rasley, J.; Rajbhandari, S.; Ruwase, O.; He, Y. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, 2020; pp 3505−3506. (37) Rajpurkar, P.; Jia, R.; Liang, P. Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics, 2018. (38) Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics, 2016. (39) Biewald, L. Experiment Tracking with Weights and Biases. 2020. https://www.wandb.com/. (40) http://github.com/facebook/react (accessed April 2024). (41) Django Software Foundation version 2.2, Django. 2019. https:// djangoproject.com. (42) Kumar, P. http://github.com/gh-PankajKumar/QA-Annotator (accessed April 2024). (43) Pietsch, M.; Möller, T.; Kostic, B.; Risch, J.; Pippi, M.; Jobanputra, M.; Zanzottera, S.; Cerza, S.; Blagojevic, V.; Stadelmann, T.; Soni, T.; Lee, S. deepset (deepset) � huggingface.co. 2019. https:// huggingface.co/deepset (accessed March 2024). (44) He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding- enhanced BERT with Disentangled Attention. International Confer- ence on Learning Representations. 2021, arXiv:2006.03654. (45) He, P.; Gao, J.; Chen, W. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. 2023, arXiv:2111.09543. (46) Vig, J. A Multiscale Visualization of Attention in the Transformer Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations; Association for Computational Linguistics: Florence, Italy, 2019; pp 37−42. (47) Kumar, P. https://github.com/gh-PankajKumar/MechBERT- codebase/, accessed Jan 28, 2025 (48) Cambridge Molecular Engineering (Molecular Engineering). 2024. https://huggingface.co/CambridgeMolecularEngineering, accessed Jan 28, 2025. Journal of Chemical Information and Modeling pubs.acs.org/jcim Article https://doi.org/10.1021/acs.jcim.4c00857 J. Chem. Inf. Model. 2025, 65, 1873−1888 1888 https://doi.org/10.1038/s41597-023-02897-3 https://doi.org/10.1038/s41597-023-02897-3 https://doi.org/10.1038/s41597-023-02897-3 https://doi.org/10.1038/s41597-022-01294-6 https://doi.org/10.1038/s41597-022-01294-6 https://doi.org/10.1038/sdata.2018.111 https://doi.org/10.1038/sdata.2018.111 https://doi.org/10.1038/sdata.2018.111 https://doi.org/10.1038/s41597-020-00602-2 https://doi.org/10.1038/s41597-020-00602-2 https://doi.org/10.1038/s41597-019-0306-0 https://doi.org/10.1038/s41597-019-0306-0 https://doi.org/10.1038/s41597-022-01301-w https://doi.org/10.1038/s41597-022-01301-w https://doi.org/10.1088/0370-1301/64/9/303 https://doi.org/10.1088/0370-1301/64/9/303 https://doi.org/10.1093/bioinformatics/btz682 https://doi.org/10.1093/bioinformatics/btz682 https://doi.org/10.1111/1911-3846.12832 https://doi.org/10.1111/1911-3846.12832 https://doi.org/10.1021/acs.jcim.2c00035?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.2c00035?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.2c01259?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.2c01259?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.2c01259?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as http://www.crossref.org/ https://doi.org/10.1021/acs.jcim.1c01198?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.1c01198?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1021/acs.jcim.1c01198?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as https://doi.org/10.1038/s41524-022-00784-w https://doi.org/10.1038/s41524-022-00784-w https://www.wandb.com/ http://github.com/facebook/react https://djangoproject.com https://djangoproject.com http://github.com/gh-PankajKumar/QA-Annotator https://huggingface.co/deepset https://huggingface.co/deepset https://github.com/gh-PankajKumar/MechBERT-codebase/ https://github.com/gh-PankajKumar/MechBERT-codebase/ https://huggingface.co/CambridgeMolecularEngineering pubs.acs.org/jcim?ref=pdf https://doi.org/10.1021/acs.jcim.4c00857?urlappend=%3Fref%3DPDF&jav=VoR&rel=cite-as