Introduction

The CODECHECK process describes a workflow for a reproduction of computations as part of a scientific peer review. CODECHECK follows a set of principles that allow many different variations into concrete implementations. The requirements for a successful CODECHECK are intentionally kept to a minimum, as are the requirements on how codechecking is conducted, or how the procedure is documented. At the end of the CODECHECK stands a CODECHECK report document, written by the codechecker and understandable to a person with some expertise in the scientific field of the related article. Besides the human-readable information in the CODECHEK report, there is a small set of metadata elements that are part of a CODECHECK procedure which are worth capturing in a more structured format.

This metadata is saved in the CODECHECK configuration file, which is specified in this document in version 1.0. The CODECHECK configuration file can serve as the identifier of a CODECHECK bundle, i.e. all the files part of a CODECHECK. The CODECHEK bundle is not formally specified, as its contents are largely at the discretion of the codechecker. The CODECHECK configuration file, however, is formally specified to enable automated extraction and development of tools to support codechecking. Both the author and the codechecker contribute information to the configuration file.

In the future, this information enables both meta-research about code within peer-reviews and more user-friendly assitance systems for authors, codecheckers, and publisher’s staff.

Note

This specification is result of a scientific collaborative project. Help improving it by providing your feedback.

Notational conventions

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” are to be interpreted as described in RFC 2119.

tl;dr for authors

If you are an author and just want to get the minimal codecheck.yml file prepared to the community workflow without reading the whole technical specification, then please use the following template. The template has a bit more than the strictly mandatory fields, but these extra fields are important for you as a creator to get credit. Please validate your final configuration file with a YAML Validator.

---
version: https://codecheck.org.uk/spec/config/1.0/

manifest:
  - file: name of output file 1 (e.g. figure.pdf)
    comment: short description of output file, e.g. ('Figure 1 in the paper', 'Result of running model variant A')
  - file: result.csv
    comment: short description of output file, e.g. ('Figure 1 in the paper', 'Result of running model variant A')

paper:
  title: "A good paper"
  authors:
    - name: Josiah Carberry
      ORCID: 0000-0002-1825-0097
  reference: https://doi.org/preprint.1

Format, name and encoding

Format: YAML 1.1 or later

Name: codecheck.yml

The file MUST be encoded in UTF-8.

Versioning

This document specifies version 1.0. The specification uses a major.minor semantic versioning scheme. Non-breaking changes can be introduced in a minor version release. The latest version of the specification can be found at https://codecheck.org.uk/spec/config/latest.

Storage location

The codecheck.yml is stored at the root of the project folder where all files related to the CODECHECK are saved. The folder where the codecheck.yml file is stored is called the CODECHECK bundle. It is the folder that also includes a directory codecheck (or .codecheck) for all files created during codechecking, see CODECHECK bundle documentation.

Content

Explicit document and directive

The file MUST include three dashes (---), the document start marker, to seperate the directive from document content.

The file SHOULD define the YAML version in the directive. While YAML supports bare (YAML 1.2)/implicit (YAML 1.1) documents, an explicit indication of the format is preferable for the CODECHECK use case. A codecheck.yml file is therefore an explicit document.

Version

The file SHOULD include a root-level node version with a URL denoting the used version of the CODECHECK configuration file specification. If no version is provided, the latest version SHOULD be assumed by software tools, but these tools CAN also abort processing the codecheck.yml with an informative message.

Example

%YAML 1.1
---
version: https://codecheck.org.uk/spec/config/1.0/

Manifest list

The configuration file MUST have a root-level sequence (i.e., a list) of files called manifest that form the manifest. All files part of the manifest must be recreated during a CODECHECK.

Each manifest sequence item MUST have a node file providing the relative path to a file that is part of the computational workflow. The relative paths MUST be relative to the location of the codecheck.yml. Each manifest sequence item MAY have a node comment with human-readable information about said file.

Example

---
version: https://codecheck.org.uk/spec/config/1.0/

manifest:
  - file: outputData.csv
    comment: data/output/one.csv
  - file: fig1.pdf
  - file: resultVectors.txt
  - file: appendix_figures.pdf
    comment: "appendix of paper, starting at page 12"

Author and submission metadata

The configuration file SHOULD include minimal metadata about the paper, i.e. the title and the author(s) of the paper whose workflow is submitted to the CODECHECK. For this information, the configuration file SHOULD have a root-level sequence paper. This information might be added after the CODECHECK or edited, e.g., after publication, therefore all sub-elements are optional.

The element paper SHOULD have a child item title with the title of the submission or publication.

The element paper SHOULD have a child sequence authors. The child nodes of authors sequence are called “author item”. There MUST be at least one author item, which is the corresponding author of the workflow under review. The corresponding author MUST be the first author item in the authors sequence. However, “authors” may be used very broadly and should not only list all authors but can include all types of contributors, e.g., software engineers, infrastructure service staff, etc.

Each author item MUST have a child name with the author’s name. Each author item SHOULD have a child ORCID with the author’s ORCID identifier. The value of the MUST can be the plain ORCID, e.g., 0000-0000-0000-0000, without URL prefix (i.e., without https://orcid.org/...).

If the workflow accompanies a preprinted article or concerns an article under review, a reference to the article SHOULD be put in the node reference under the root-level node paper. Ideally the identifier is a DOI in form of a resolvable URL, or a identifiable text string such as arXiv:2001.10641, or a short text “Under review at X”/”Paper to appear in Y”.

Example

---
# [...]

paper:
  title: "A good paper"
  authors:
    - name: Josiah Carberry
      ORCID: 0000-0002-1825-0097
    - name: John Doe
  reference: https://doi.org/preprint.1

The configuration file CAN have a root level node source with a textual description or a single URL to describe the source of the checked material. The field SHOULD be used if the material used for the check is drawn from multiple sources so that the repository node (see Codecheck metadata) and the metadata accessible via that URL can not sufficiently describe provenance of code or data files.

Example

---
# [...]

source: Data is available at https://download.url/dataset/123456/v2 and code can be found in an attachment to the submitted manuscript.

Codecheck metadata

Further important metadata is created during the CODECHECK process. The codecheck.yml started by the author is extended with this information by the codechecker. If a codechecker changes the meaning of any content provided by the author in the configuration file, they SHOULD clearly mark these changes in the form of a comment, in addition to a transparent record through the file being under version control.

The configuration file MUST include minimal metadata about the codechecker in a root-level sequence codechecker with at least one child element. Each item in the codechecker sequence MUST have one node name with the codechecker’s name. Each item in the codechecker sequence SHOULD have a child ORCID as defined in Author and submission metadata.

The configuration file MUST have a root-level node report with a unique identifier for the published CODECHECK report, such as a URL or DOI, ideally in a resolvable format.

The CODECHECK CAN add further fields with the following names and semantics:

Example

---
manifest:
  - file: outputData.csv
    comment: data/output/one.csv
  - file: fig1.pdf

codechecker:
  - name: S. Eglen
    ORCID: 0000-0001-8607-8025
  - name: Daniel N.
    ORCID: 0000-0002-0024-5046

report: https://doi.org/10.5281/zenodo.3674056
summary: |
    The check was straightforward as all material was provided and documented well, but computations took about 3 hours to run.
repository: https://github.com/codecheckers/Piccolo-2020
check_time: "2019-01-01 13:00:00"
certificate: 2020-001

Additional content

The file codecheck.yml may include any number of other nodes or sequences to support specific instances of a CODECHECK process. For clarity these SHOULD be named in a way that clearly identifies the origin and use case, e.g. by prepending a common prefix to node names or using a single parent node.

Example

---
# [...]

publishing_inc_identifier: 12345
publishing_inc_handler: Ed Editor

TheBestRepository:
  recordId: 1a2b3c
  checksum: cdce90c878462d073b31aec21ccee48e3366250a6baafd215fa73d1c6bc0357b

Minimal example

---
manifest:
  - file: fig1.pdf

Full example

%YAML 1.1
---
version: https://codecheck.org.uk/spec/config/1.0/

manifest:
  - file: outputData.csv
    comment: originally stored at data/output/one.csv
  - file: fig1.pdf
    comment: Figure 1
  - file: resultVectors.txt
    comment: output vectors in plain text format
  - file: appendix_figures.pdf
    comment: "appendix of paper, starting at page 12"

paper:
  title: "A good paper"
  authors:
    - name: Josiah Carberry
      ORCID: 0000-0002-1825-0097
    - name: John Doe
  reference: https://doi.org/preprint.1

codechecker:
  - name: S. Eglen
    ORCID: 0000-0001-8607-8025

report: https://doi.org/abcde.12345

summary: |
  The check was straightforward as all material was provided anddocumented well, but computations took about 3 hours to run.
    
  The created figures seem to match the ones provided in the article. The content of other output files was not checked.
repository:
  - https://github.com/codecheckers/example-workflow
  - https://github.com/codecheckers/example-data
check_time: "2019-01-01 13:00:00"
certificate: 2020-999

More examples can be found in the repositories of the codecheckers organisation on GitHub: https://github.com/codecheckers/.