Structure Formats
We have discussed about some file formats used in materials databases such as JSON, HDF5 and CSV. In this section, we will focus on the file formats used to represent atomic and crystal structures. These file formats are essential for storing and exchanging structural information in materials science and crystallography.
Crystallographic Information Framework (CIF)¶
Crystallographic Information Framework (CIF) is a standard text file format for representing crystallographic information developed by International Union of Crystallography (IUCr). CIF is the standard file format for the submission and retrieval of crystallographic information. CIF is a human-readable file format that contains information about the crystal structure, including the unit cell parameters, atomic coordinates, symmetry operations, and other crystallographic data. CIF can also include metadata such as the experimental conditions, data collection parameters, and references to the original research articles.
A CIF file consists of data items that are organized into categories. Each data item is associated with a tag that describes the information it contains. The data items are typically arranged in a tabular format, with each row representing a different data item. CIF files can be opened and viewed using a text editor or specialized crystallographic software like Jmol, Mercury, or VESTA. It contains loop_ for repeating data items.
An example of CIF file of Zinc (Zn) is shown below:
# generated using pymatgen
data_Zn
_symmetry_space_group_name_H-M P6_3/mmc
_cell_length_a 2.61435980
_cell_length_b 2.61435980
_cell_length_c 4.87316102
_cell_angle_alpha 90.00000000
_cell_angle_beta 90.00000000
_cell_angle_gamma 120.00000000
_symmetry_Int_Tables_number 194
_chemical_formula_structural Zn
_chemical_formula_sum Zn2
_cell_volume 28.84510390
_cell_formula_units_Z 2
loop_
_symmetry_equiv_pos_site_id
_symmetry_equiv_pos_as_xyz
1 'x, y, z'
2 '-x, -y, -z'
3 'x-y, x, z+1/2'
4 '-x+y, -x, -z+1/2'
5 '-y, x-y, z'
6 'y, -x+y, -z'
7 '-x, -y, z+1/2'
8 'x, y, -z+1/2'
9 '-x+y, -x, z'
10 'x-y, x, -z'
11 'y, -x+y, z+1/2'
12 '-y, x-y, -z+1/2'
13 '-y, -x, -z+1/2'
14 'y, x, z+1/2'
15 '-x, -x+y, -z'
16 'x, x-y, z'
17 '-x+y, y, -z+1/2'
18 'x-y, -y, z+1/2'
19 'y, x, -z'
20 '-y, -x, z'
21 'x, x-y, -z+1/2'
22 '-x, -x+y, z+1/2'
23 'x-y, -y, -z'
24 '-x+y, y, z'
loop_
_atom_type_symbol
_atom_type_oxidation_number
Zn0+ 0.0
loop_
_atom_site_type_symbol
_atom_site_label
_atom_site_symmetry_multiplicity
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_occupancy
Zn0+ Zn0 2 0.33333333 0.66666667 0.25000000 1Protein Data Bank (PDB) File Format¶
Protein Data Bank (PDB) file format is a standard file format for representing the three-dimensional structures of biological macromolecules, such as proteins, nucleic acids, and carbohydrates. The PDB file format was developed by the Protein Data Bank, an international repository of experimentally determined macromolecular structures.
XYZ File Format¶
XYZ file format is a simple text file format for representing molecular and atomic coordinates. The XYZ file format is widely used in computational chemistry and molecular modeling to store the 3D coordinates of atoms in a molecule or crystal structure. The XYZ file format is human-readable and consists of a header line followed by lines containing the atomic symbols and their corresponding x, y, and z coordinates.
An example of XYZ file format for water molecule () is shown below:
3
Water molecule
O 0.000000 0.000000 0.000000
H 0.757000 0.586000 0.000000
H -0.757000 0.586000 0.000000The first line of the XYZ file contains the total number of atoms in the system. The second line is an optional comment line that can provide additional information about the system. The subsequent lines contain the atomic symbols and their x, y, and z coordinates in Angstroms.
XYZ file format only contains atomic coordinates and does not include information about the unit cell parameters, symmetry operations, or other crystallographic data. It is a lightweight and versatile file format that can be easily read and written by various molecular modeling software packages.
Extended XYZ File Format¶
Extended XYZ file (.extxyz) format is an extension of the standard XYZ file format that includes additional information about the atoms or molecules in the system. In addition to the atomic coordinates, the extxyz file format can include information such as atomic charges, velocities, forces, and other properties. The extxyz file format is commonly used in machine learning and molecular dynamics simulations to store atomic data along with additional information.
An example of extxyz file format for Zn is shown below:
2
Lattice="2.6143598557 0 0 -1.3071799278 2.2641020496 0 0 0 4.8731608391" Properties=species:S:1:pos:R:3 pbc="T T T"
Zn 0.00000 1.50940 1.21829
Zn 1.30718 0.75470 3.65487In this example, the first line of the extxyz file contains the total number of atoms in the system. The second line specifies the properties of the atoms, including the species (atomic symbol) and the atomic positions in Cartesian coordinates. The Lattice keyword specifies the lattice vectors of the unit cell, and the pbc keyword indicates the periodic boundary conditions. The Properties keyword defines the properties of the atoms, such as the species and positions. The subsequent lines contain the atomic species and their corresponding x, y, and z coordinates.
VASP POSCAR File Format¶
POSCAR file format is used by the Vienna Ab initio Simulation Package (VASP) to represent crystal structures. The POSCAR file contains information about the unit cell parameters, atomic coordinates, and lattice vectors of a crystal structure. The POSCAR file format is a plain text file that can be opened and edited using a text editor. The POSCAR file format is widely used in computational materials science and solid-state physics for performing electronic structure calculations and molecular dynamics simulations.
An example of POSCAR file format for Zn is shown below:
Zn2
1.0
2.6143598557 0.0000000000 0.0000000000
-1.3071799278 2.2641020496 0.0000000000
0.0000000000 0.0000000000 4.8731608391
Zn
2
Direct
0.333333343 0.666666687 0.250000000
0.666666657 0.333333313 0.750000000In this example, the first line of the POSCAR file contains the name of the system. The second line is the scaling factor for the lattice vectors. The next three lines contain the lattice vectors of the unit cell. The subsequent lines contain the atomic species, the number of atoms of each species, and the atomic coordinates in fractional coordinates.