Learning functional properties of proteins with language models

Unsal, TUNCA; Atas, Heval; Albayrak, Muammer; Turhan, Kemal; Acar, Aybar; Doğan, Tunca

doi:10.1038/s42256-022-00457-9

Learning functional properties of proteins with language models

Atıf İçin Kopyala

Unsal S., Atas H., Albayrak M., Turhan K., Acar A. C., Doğan T.

NATURE MACHINE INTELLIGENCE, cilt.4, sa.3, ss.227-245, 2022 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 4 Sayı: 3
Basım Tarihi: 2022
Doi Numarası: 10.1038/s42256-022-00457-9
Dergi Adı: NATURE MACHINE INTELLIGENCE
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED)
Sayfa Sayıları: ss.227-245
Hacettepe Üniversitesi Adresli: Evet

Özet

Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by better representing the data at hand. Novel data representation approaches mostly take inspiration from language models that have yielded ground-breaking improvements in natural language processing. Lately, these approaches have been applied to the field of protein science and have displayed highly promising results in terms of extracting complex sequence–structure–function relationships. In this study we conducted a detailed investigation over protein representation learning by first categorizing/explaining each approach, subsequently benchmarking their performances on predicting: (1) semantic similarities between proteins, (2) ontology-based protein functions, (3) drug target protein families and (4) protein–protein binding affinity changes following mutations. We evaluate and discuss the advantages and disadvantages of each method over the benchmark results, source datasets and algorithms used, in comparison with classical model-driven approaches. Finally, we discuss current challenges and suggest future directions. We believe that the conclusions of this study will help researchers to apply machine/deep learning-based representation techniques to protein data for various predictive tasks, and inspire the development of novel methods.