Vasu Menon

Investigation of the Genetics of Coronaviruses and the Similarities Between Them

This project was done in high school in collaboration with Sohini Das and Eva Engel, and was presented at the North Carolina Junior Science and Humanities Symposium (NCJSHS). It was mentored by Dr. Eric Chi and sponsored by the National Science Foundation.


Overview

In December 2020, Dr. Tomokazu Konishi published a paper using principal component analysis (PCA) to compare the genetics of coronaviruses and influenza, revealing clear structuring in coronavirus genetic sequences. Our research asked: does this structure persist when other dimension reduction methods are applied, and do any underlying patterns reveal further relationships among coronaviruses?

Our hypothesis was that applying PCA and diffusion mapping to the clusters and subclusters would find additional structure in the data.

Methods & Results

Coronavirus genetic sequences (A, G, C, T, -) were first one-hot encoded. We then applied PCA and diffusion mapping iteratively.

  • PCA and diffusion maps identified 4 distinct clusters of coronaviruses:
    • Cluster 1 - Sarbecoviruses (SARS-related and bat coronaviruses)
    • Cluster 2 - Orthocoronavirinae (bird, bat, and mammal coronaviruses)
    • Cluster 3 - Embecoviruses (rodent coronaviruses)
    • Cluster 4 - Betacoronavirus 1 (large mammal coronaviruses)
  • Viruses within clusters and subclusters aligned closely with existing taxonomic classifications (e.g., subcluster 2a = Alphacoronaviruses, 2b = Buldecoviruses, 2c = Merbecoviruses).
  • Certain unclassified viruses showed strong similarities to classified viruses (e.g., an unclassified Merbecovirus in subcluster 2c clustered with Pipistrellus bat coronavirus HKU5).

Implications

These findings could be used to classify novel coronaviruses, identify strains with pandemic potential, predict genetic structure of future coronavirus variants, and examine host-jumping patterns among coronaviruses.