wwPDB 2023 News
Tribute to Dr. Olga Kennard
The wwPDB consortium would like to pay tribute to Dr. Olga Kennard OBE FRS upon the sad news of her passing. Her pioneering work on the development of crystallographic databases laid the groundwork for modern molecular structure data archiving and the subsequent scientific breakthroughs that have made use of these data.
Olga was renowned for establishing the CCDC (Cambridge Crystallographic Data Centre) to maintain the Cambridge Structural Database (CSD) for small molecules. The CSD was first established by Olga in 1965, based on activities in her research group and has become the world’s repository for small-molecule organic and metal-organic crystal structures. Olga collected these data so that she could study how crystals form and her surveys were fundamental in the development of “crystal engineering”. Now containing over one million structures from X-ray and neutron diffraction analyses, this database of accurate 3D structures has become an essential resource to scientists around the world.
The increased interest and breakthroughs in solving biological molecular structures lead to the founding of the PDB (Protein Data Bank) by Walter Hamilton at BNL (Brookhaven National Laboratory). Olga worked with Walter to support the foundation of the PDB archive, with the archive initially operated jointly between BNL and CCDC (see the 1971 PDB announcement in Nature New Biology). While data processing was carried out at BNL, CCDC was responsible for organization of the data archive, with Olga and CCDC’s experience in data archiving hugely beneficial. Nowadays, the small molecules contained in biological structures archived in the PDB are validated using CCDC software which incorporates the knowledge embedded in the CSD.
Left to right: Helen M. Berman, Janet Thornton, Shoshana Wodak, and Olga Kennard at the PDB-SwissProt Symposium in Jerusalem in 1996.
Olga was a person of great integrity and drive and, in an age before computers had really developed, she saw the value of cross-data analysis to derive principles governing how small molecules interact. Very few scientists can claim that their work has enabled thousands of papers and investigations. Olga’s foresight and determination to establish and maintain the CSD means she is among those giants on whose shoulders many other scientists stand.
See also Celebrating Dr Olga Kennard OBE FRS, Founder of the Cambridge Structural Database, 1924 – 2023 at CCDC
PDB entries with extended CCD or PDB IDs will be distributed in PDBx/mmCIF format only
wwPDB, in collaboration with the PDBx/mmCIF Working Group, has set plans to extend the length of accession codes (IDs) for PDB and Chemical Component Dictionary (CCD) entries in the future. PDB entries containing these extended IDs will not be supported by the legacy PDB file format. (see previous announcement)
CCD ID extension
CCD entries are currently identified by unique three-character alphanumeric IDs. At current growth rates, we anticipate running out of three-character IDs before 2024. After this point, the wwPDB will issue five-character alphanumeric accession codes for CCD IDs in the OneDep system. To avoid confusion with current four-character PDB IDs, four-character codes will not be used. Owing to limitations of the legacy PDB file format, PDB entries containing the new five character ID codes will only be distributed in PDBx/mmCIF format.
In addition, wwPDB has reserved a set of CCD IDs: 01 - 99, DRG, INH, LIG that will never be used in the PDB. These reserved codes can be used for new ligands during structure determination so that they can be identified as new upon deposition and added to the CCD during biocuration.
PDB ID extension
wwPDB will be extending PDB ID length to eight characters prefixed by ‘pdb’, e.g., pdb_00001abc. Each PDB entry has a corresponding Digital Object Identifier (DOI), often required for manuscript submission to journals and described in publications by the structure authors. Extended PDB IDs and corresponding PDB DOIs have been included in the PDBx/mmCIF formatted atomic coordinate files for all new and re-released entries since August 2021.
For example, PDB entry issued with 4-character PDB ID, 1abc, will have the extended PDB ID (pdb_00001abc) and corresponding PDB DOI (10.2210/pdb1abc/pdb), as listed in the _database_2 PDBx/mmCIF category.
PDB 1abc pdb_00001abc 10.2210/pdb1abc/pdb
For example, PDB entry issued with 8-character PDB ID, pdb_00099xyz, after all 4-character IDs are consumed:
PDB pdb_00099xyz pdb_00099xyz 10.2210/pdb_00099xyz/pdb
After all four-character PDB IDs are consumed, newly-deposited PDB entries will only be issued extended PDB ID codes, and PDB entries will only be distributed in PDBx/mmCIF format. PDB entries with four-character PDB IDs will remain unchanged.
wwPDB is asking users and software developers to review their code and remove any current limitations on PDB and CCD ID lengths, and to enable use of PDBx/mmCIF format files. Example files with extended PDB and/or CCD IDs are available via github to assist code revisions, see https://github.com/wwPDB/extended-wwPDB-identifier-examples. To learn about PDBx/mmCIF, please visit https://mmcif.wwpdb.org/.
For any further information please contact us at firstname.lastname@example.org.
The number of available 3-character CCD IDs annually.
Small Angle Scattering News
An outcome of a project aimed to test and benchmark different approaches for modeling SAS profiles from PDB coordinates has been published:
A round-robin approach provides a detailed assessment of biomolecular small-angle scattering data reproducibility and yields consensus curves for benchmarking
Trewhella, J., Vachette, P., Bierma, J., Blanchet, C., Brookes, E., Chakravarthy, S., Chatzimagas, L., Cleveland, T. E., Cowieson, N., Crossett, B., Duff, A. P., Franke, D., Gabel, F., Gillilan, R. E., Graewert, M., Grishaev, A., Guss, J. M., Hammel, M., Hopkins, J., Huang, Q., Hub, J. S., Hura, G. L., Irving, T. C., Jeffries, C. M., Jeong, C., Kirby, N., Krueger, S., Martel, A., Matsui, T., Li, N., Perez, J., Porcar, L., Prange, T., Rajkovic, I., Rocco, M., Rosenberg, D. J., Ryan, T. M., Seifert, S., Sekiguchi, H., Svergun, D., Teixeira, S., Thureau, A., Weiss, T. M., Whitten, A. E., Wood, K. & Zuo, X.
(2022) Acta Cryst. D78: 1315-1336 doi: 10.1107/S2059798322009184
In total, 171 SAXS and 76 SANS measurements for five proteins (ribonuclease A, lysozyme, xylanase, urate oxidase and xylose isomerase) were collected and analyzed centrally. In the process, new methods for data comparing and merging were developed. The data produced for this effort, has been deposited in the SAS Biological Data Bank (SASBDB) as consensus data along with the contributing individual data sets.
In addition, a chapter describing the work done to establish the 2017 publication guidelines for biomolecular SAS, the establishment of the SASBDB, and the evolution and outcomes of the benchmarking project has been published:
Chapter One - Data quality assurance, model validation, and data sharing for biomolecular structures from small-angle scattering
(2023) Methods in Enzymology 678: 1-22 doi: 10.1016/bs.mie.2022.11.002
These publications reflect the activities of the wwPDB Small Angle Scattering task force (SAStf) that first met with Chair Jill Trewhella in 2012. The SAStf was instrumental in progressing the important work that has led to biomolecular SAS being increasingly accepted as a mainstream structural biology technique.
Prototype of PDB NextGen Archive now available
A prototype of a next generation archive repository for the PDB is now available. The archive, called “NextGen”, hosts structural model files in PDBx/mmCIF and PDBML formats at files-nextgen.wwpdb.org. This enriched PDB archive provides annotation from external database resources in the metadata in addition to the content provided in the structure model files in the PDB main archive at files.wwpdb.org.
This prototype provides sequence annotation from external resources such as UniProt, SCOP2 and Pfam at atom, residue, and chain levels. This mapping information is derived from the Structure Integration with Function, Taxonomy and Sequence (SIFTS) project (https://www.ebi.ac.uk/pdbe/docs/sifts/), a service developed and maintained by the PDBe and UniProt teams at EMBL-EBI. Sequence mappings are provided in _pdbx_sifts_unp_segments and _pdbx_sifts_xref_db_segments categories for each segment, _pdbx_sifts_xref_db at residue level, and _atom_site at atom level.
The PDB NextGen Repository is currently updated monthly on the first Wednesday of the month at 00:00 UTC and is subject to change in the future. You can access these NextGen files at the following locations:
Data are structured based on entry ID with a two letter hash code, ‘third from last character' and 'second from last character’. This hash code will remain consistent once PDB ID codes are extended beyond four characters with the pdb_ prefix.
Some examples are shown below:Access entry pdb_00008aly at https://files-nextgen.wwpdb.org/pdb_nextgen/data/entries/divided/al/pdb_00008aly/Both PDBx/mmCIF and PDBML are provided at this location. For entry pdb_00008aly:
Please contact email@example.com with any questions.
Enhanced Collection of Starting Models
A new PDBx/mmCIF category, _pdbx_initial_refinement_model has been introduced to improve information collected about starting model for X-ray, 3DEM and NMR methods.
Experimentally derived vs computed models will be distinguished. Provenances of the resources where the starting model was obtained (e.g., PDB, AlphaFoldDB, RoseTTAFold, etc.) and its accession code/identifier will be captured, if publicly available.
For the full definition, see pdbx_initial_refinement_model. An example is below:
_pdbx_initial_refinement_model.type 'experimental model'
wwPDB strongly recommends all PDB users and software developers to review their code and adopt this definition for future applications.
Structure Predictors: Use ModelCIF for Computed Structure Models
ModelCIF (GitHub) is a data information framework developed for and by computational structural biologists to describe structural models of macromolecules derived from computational methods. It provides an extensible data representation for deposition, archiving, and public dissemination of these models of proteins to enable delivery of Findable, Accessible, Interoperable, and Reusable (FAIR) data to users worldwide.
A. Overview of the ModelCIF extension of PDBx/mmCIF. B. Schematic representation of ModelCIF data specifications. ModelCIF includes definitions for input data used in template-based and template-free modeling; reference information for macromolecular sequences and small molecule components; local and global CSM quality metrics; and metadata information regarding modeling protocol, CSM classification (ab initio, homology, etc.) and descriptions of associated files.
ModelCIF is an extension of the Protein Data Bank Exchange/macromolecular Crystallographic Information Framework (PDBx/mmCIF), which is the global data standard for representing experimentally-determined, three-dimensional (3D) structures of macromolecules and associated metadata. The PDBx/mmCIF framework and its extensions (e.g., ModelCIF) are managed by the wwPDB in collaboration with relevant community stakeholders such as the wwPDB ModelCIF Working Group.
This semantically rich and extensible data framework for representing computed structure models (CSMs) accelerates the pace of scientific discovery. Furthermore, use of this data standard promotes interoperation among structural biology data resources, with ModelCIF currently used by the ModelArchive, AlphaFold DB, and MODBASE repositories. A manuscript was recently submitted to bioRxiv describing the architecture, contents, and governance of ModelCIF as well as tools and processes for maintaining and extending the data standard .
Visit the ModelCIF GitHub for more information about this data information framework.
[1} Vallat B, Tauriello G, Bienert S, Haas J, Webb BM, et al. ModelCIF: An extension of PDBx/mmCIF data representation for computed structure models. bioRxiv doi: 10.1101/2022.12.06.518550.
PDB Reaches a New Milestone: 200,000+ Entries
Depositors: Download this image, write the number of structures deposited, and tag us in your photos
With this week's update, the PDB archive contains a record 200,069 entries. The archive passed 150,000 structures in 2019 and 100,000 structures in 2014.
Established in 1971, this central, public archive has reached this critical milestone thanks to the efforts of structural biologists throughout the world who contribute their experimentally-determined protein and nucleic acid structure data.
wwPDB data centers support online access to three-dimensional structures of biological macromolecules that help researchers understand many facets of biomedicine, agriculture, and ecology, from protein synthesis to health and disease to biological energy. Many milestones have been reached since the archive released the 100,000th structure in 2014. PDB data have been seminal in understanding SARS-CoV-2, and provided the foundation for the development of AI/ML techniques for predicting protein structure. The 50th anniversary of the PDB was celebrated throughout 2021.
Today, the archive is quite large, containing more than 3,000,000 files related to these PDB entries that require more than 1086 Gbytes of storage. PDB structures contain more than 1.8 billion non-hydrogen atoms.
Function follows form
In the 1950s, scientists had their first direct look at the structures of proteins and DNA at the atomic level. Determination of these early three-dimensional structures by X-ray crystallography ushered in a new era in biology-one driven by the intimate link between form and biological function. As the value of archiving and sharing these data were quickly recognized by the scientific community, the Protein Data Bank (PDB) was established as the first open access digital resource in all of biology by an international collaboration in 1971 with data centers located in the US and the UK.
Among the first structures deposited in the PDB were those of myoglobin and hemoglobin, two oxygen-binding molecules whose structures were elucidated by Chemistry Nobel Laureates John Kendrew and Max Perutz. With this week's regular update, the PDB welcomes 266 new structures into the archive. These structures join others vital to drug discovery, bioinformatics and education.
The PDB is growing rapidly, increasing in size by ~160% since 2011 (doubling in size every 6-8 years). In 2022, an average of 275 new structures were released to the scientific community each week. The resource is accessed hundreds of millions of times annually by researchers, students, and educators intent on exploring how different proteins are related to one another, to clarify fundamental biological mechanisms and discover new medicines.
Twenty Years of Collaboration
Since its inception, the PDB has been a community-driven enterprise, evolving into a mission critical international resource for biological research. The wwPDB partnership was established in July 2003 with PDBe, PDBj, and RCSB PDB. Today, the collaboration includes partners BMRB (joined in 2006) and EMDB (2021).
The wwPDB ensures that these valuable PDB data are securely stored, expertly managed, and made freely available for the benefit of scientists and educators around the globe. wwPDB data centers work closely with community experts to define deposition and annotation policies, resolve data representation issues, and implement community validation standards. In addition, the wwPDB works to raise the profile of structural biology with increasingly broad audiences.
Each structure submitted to the archive is carefully curated by wwPDB staff before release. New depositions are checked and enhanced with value-added annotations and linked with other important biological data to ensure that PDB structures are discoverable and interpretable by users with a wide range of backgrounds and interests.
wwPDB eagerly awaits the next 100,000 structures and the invaluable knowledge these new data will bring.
Time-stamped Copies of PDB and EMDB Archives
A snapshot of the PDB Core Archive as of January 2, 2023 is available.
A snapshot of the PDB Core archive (ftp://ftp.wwpdb.org, https://s3.rcsb.org) as of January 2, 2023 has been added to ftp://snapshots.wwpdb.org, https://s3snapshots.rcsb.org (AWS), and ftp://snapshots.pdbj.org. Snapshots have been archived annually since 2005 to provide readily identifiable data sets for research on the PDB archive.
The directory 20230102 includes the 199,755 experimentally-determined structure and experimental data available at that time. Atomic coordinate and related metadata are available in PDBx/mmCIF, PDB, and XML file formats. The date and time stamp of each file indicates the last time the file was modified. The snapshot of PDB Core Archive is 1086 GB.
A snapshot of the EMDB Core archive (ftp://ftp.ebi.ac.uk/pub/databases/emdb/) as of January 2, 2023 can be found in ftp://ftp.ebi.ac.uk/pub/databases/emdb_vault/20230102/ and ftp://snapshots.pdbj.org/20230102/. The snapshot of EMDB Core Archive contains map files and their metadata within XML files for both released and obsoleted entries (24186 and 262, respectively) and is 8.9 TB in size.