Archiving research data
It is important to ensure that research data remains accessible, readable and usable over the long term. For long-term digital preservation that follows publication, the digital objects must meet certain requirements.
For example, research data are created with different software and are accordingly available in different file formats.
However, not all file formats are equally suited for digital preservation. A general rule of digital preservation is that there is a clear preference for open, well-documented file formats that are in widespread use (e.g. CSV, XML, DOCX, TXT, PDF/A, …) as opposed to proprietary formats (e.g. XLS, DOC, …). Certain file formats have been declared suitable for long-term digital archiving (see table).
In some cases, digital preservation may require format transfer measures or to emulate a format’s original system environment. This may be necessary to ensure the long-term technical interpretability and readability of the data and avoid losing information. Both are tasks of ZB MED.
ZB MED uses the Rosetta system from the company Ex Libris as the technical infrastructure for digital preservation.
For more information, see our pages on digital preservation.
Recommended preservation formats for research data
Type of data | File formats suitable for digital preservation | Standard, widespread file formats | Examples of sources and applications |
Audio | AIFF (*.aiff, *.aif), Matroska (*.mka), MXF (*.mxf), WAVE (*.wav) | AAC (*.aac, *.m4a, mp4), AIFF (*.aiff, *.aif), BWF (*.bwf), FLAC (*.flac), Matroska (*.mka), MP3 (*.mp3), MXF (*.mxf), OGG (*.ogg), OPUS (*.opus), WAVE (*.wav) | Interviews, surveys |
Biomaterial data | CSV (*.csv), TXT (*.txt), XML (*.xml) | CSV (*.csv), FASTA (*.fasta), FASTQ (*.fq, *.fastq), PDB (*.pdb, *.ent, *.brk), TXT (*.txt), XLS (*.xls), XML (*.xml) | DNA sequencers, mass spectrometers, microarrays, spectrophotometers |
Classifications, thesauri, codes | PDF/A (*.pdf), XML (*.xml) | DOC (*.doc, *.docx), PDF (*.pdf), XML (*.xml) | Institutions |
Databases | SQL (*.sql) | CSV (*.csv), HDF5 (*.hdf5, *.he5, *.h5), MS Access (*.mdb, *.accdb), dBase (*.dbf), SIARD (*.siard), SQL (*.sql) | Institutions |
Geospatial data | GML (*.gml), MIF/MID (*.mif/ *.mid) | ESRI Shapefiles (*.shp), GML (*.gml), KML (*.kml), MapInfo (*.tab), MID (*.mid), MIF (*.mif) | Vector and raster data |
Image data | JPEG2000 (*.jp2), PNG (*.png), SVG (*.svg), TIFF (*.tif, *.tiff) | DICOM (*.dcm), EPS (*.eps), GIF (*.gif), Illustrator (*.ai), JPEG 2000 (*.jp2), JPG (*.jpg, *.jpeg), PDF (*.pdf), PNG (*.png), STL (*.stl), SVG (*.svg), TIFF (*.tif, *.tiff) | Cameras, microscopes, MRT and CT scans, ultrasonic, X-ray and sonography instruments |
Image data 3D | OBJ (*.obj, *.mod, in ASCII format), VRML (*.vrml, *.wrl), X3D (*.x3d) | COLLADA (*.dae), DXF (*.dxf), FBX (*.fbx), OBJ (*.obj, *.mod), PLY (*.ply), STL (*.stl), VRML (*.vrml, *.wrl), X3D (*.x3d) | 3D technologies such as stereolithography |
Markup language | XML (*.xml) | HTML (*.html), SGML (*.sgml), XML (*.xml) | Websites |
Sensor data | CSV (*.csv), PDF (*.pdf), TXT (*.txt) | CSV (*.csv), PDF (*.pdf), TXT (*.txt), XLS (*.xls, *.xlsx), XML (*.xml) | Thermal sensors, pressure sensors, polysomnography, ECG, EEG |
Spreadsheets | CSV (*.csv) | CSV (*.csv), ODS (*.ods, *.odt, *.odg, *.odc, *.odf), OOXML (*.docx, *.docm), PDF/A (*.pdf), XLS (*.xls, *.xlsx) | Data from research, clinical care |
Statistical data | CSV (*.csv), R (*.r) | CSV (*.csv), data (*.csv, *.txt), DDI (*.xml), R (*.r), SAS (*.7dat, *.sd2, *.tpt), SPSS (*.sav), SPSS Portable (*.por), STATA (*.dta) | Data from research, clinical care |
Text files | PDF/A (*.pdf), TXT Unicode (*.txt, *.asc, *.c, *.h, *.cpp, *.m, *.py etc. in ASCII format), XML (*.xml) | DOC (*.doc, *.docx), ODT (*.odt), PDF (*.pdf), Powerpoint (*.ppt), RTF (*.rtf), TXT (*.txt) | Documentations, reports, findings, administrative data |
Video | Matroska (*.mkv), MXF (*.mxf) | AVI (*.avi), Matroska (*.mka, *.mkv), MPEG-2 (*.mpg, *.mpeg, *.m2v, *.mpg2), MPEG-4 (*.mp4, *.m4a, *.m4v), MXF (*.mxf), QuickTime (*.mov, *.qt), Windows Media (*.wmv) | Cameras, CT scans, ultrasonic instruments |
Sources
ETH Zurich: suitable file formats for digital preservations
DANS suitable file formats for digital preservations
DARIAH-DE (humanities): suitable file formats for digital preservations
Nestor-Handbook: Digital Curation of Research Data: Experiences of a Baseline Study in Germany
Forschungsdaten-Info (in German)
Contact
Birte Lindstädt
Head of Research Data Management
Phone: +49 (0)221 478-97803
Send mail
Uta Parmaksiz
Digital Preservation of Research Data
Phone: +49 (0)221 999 892 648
Send mail
Related links
Digital preservation at ZB MED
Metadata in digital preservation
OAIS