Abstract Coronavirus disease 2019
(COVID-19) has emerged in December 2019 when the first case was reported
in Wuhan, China and turned into a pandemic with 27 million (September
9th) cases. Currently, there are over 95,000 complete genome sequences
of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the
virus causing COVID-19, in public databases, accompanying a growing
number of studies. Nevertheless, there is still much to learn about the
viral population variation when the virus is evolving as it continues to
spread. We have analyzed SARS-CoV-2 genomes to identify the most variant
sites, as well as the stable, conserved ones in samples collected in the
Netherlands until June 2020. We identified the most frequent mutations
in different geographies. We also performed a phylogenetic study focused
on the Netherlands to detect novel variants emerging in the late stages
of the pandemic and forming local clusters. We investigated the S and N
proteins on SARS-CoV-2 genomes in the Netherlands and found the most
variant and stable sites to guide development of diagnostics assays and
vaccines. We observed that while the SARS-CoV-2 genome has accumulated
mutations, diverging from reference sequence, the variation landscape is
dominated by four mutations globally, suggesting the current reference
does not represent the virus samples circulating currently. In addition,
we detected novel variants of SARS-CoV-2 almost unique to the
Netherlands that form localized clusters and region-specific
sub-populations indicating community spread. We explored SARS-CoV-2
variants in the Netherlands until June 2020 within a global context; our
results provide insight into the viral population diversity for
localized efforts in tracking the transmission of COVID-19, as well as
sequenced-based approaches in diagnostics and therapeutics. We emphasize
that little diversity is observed globally in recent samples despite the
increased number of mutations relative to the established reference
sequence. We suggest sequence-based analyses should opt for a consensus
representation to adequately cover the genomic variation observed to
speed up diagnostics and vaccine design.