2022
Functional Genomics Platform, A Cloud-Based Platform for Studying Microbial Life at Scale
Edward E. Seabolt, Gowri Nayar, Harsha Krishnareddy, Akshay Agarwal, Kristen L. Beck, Ignacio Terrizzano, Eser Kandogan, Mark Kunitomi, Mary Roth, Vandana Mukherjee, James H. Kaufman
IEEE/ACM Transactions on Computational Biology and Bioinformatics 19(2), 940-952, 2022
Predicting Epitope Candidates for SARS-CoV-2
Akshay Agarwal, Kristen L. Beck, Sara Capponi, Mark Kunitomi, Gowri Nayar, Edward Seabolt, Gandhar Mahadeshwar, Simone Bianco, Vandana Mukherjee, James H. Kaufman
Viruses 14(8), 2022
Abstract
Epitopes are short amino acid sequences that define the antigen signature to which an antibody or T cell receptor binds. In light of the current pandemic, epitope analysis and prediction are paramount to improving serological testing and developing vaccines. In this paper, known epitope sequences from SARS-CoV, SARS-CoV-2, and other Coronaviridae were leveraged to identify additional antigen regions in 62K SARS-CoV-2 genomes. Additionally, we present epitope distribution across SARS-CoV-2 genomes, locate the most commonly found epitopes, and discuss where epitopes are located on proteins and how epitopes can be grouped into classes. The mutation density of different protein regions is presented using a big data approach. It was observed that there are 112 B cell and 279 T cell conserved epitopes between SARS-CoV-2 and SARS-CoV, with more diverse sequences found in Nucleoprotein and Spike glycoprotein.
2021
Functional profiling of COVID-19 respiratory tract microbiomes
Niina Haiminen, Filippo Utro, Ed Seabolt, Laxmi Parida
Scientific Reports 11(1), 6433, 2021
Abstract
In response to the ongoing global pandemic, characterizing the
molecular-level host interactions of the new coronavirus
SARS-CoV-2 responsible for COVID-19 has been at the center of
unprecedented scientific focus. However, when the virus enters
the body it also interacts with the micro-organisms already
inhabiting the host. Understanding the virus-host-microbiome
interactions can yield additional insights into the biological
processes perturbed by viral invasion. Alterations in the gut
microbiome species and metabolites have been noted during
respiratory viral infections, possibly impacting the lungs via
gut-lung microbiome crosstalk. To better characterize microbial
functions in the lower respiratory tract during COVID-19
infection, we carry out a functional analysis of previously
published metatranscriptome sequencing data of bronchoalveolar
lavage fluid from eight COVID-19 cases, twenty-five
community-acquired pneumonia patients, and twenty healthy
controls. The functional profiles resulting from comparing the
sequences against annotated microbial protein domains clearly
separate the cohorts. By examining the associated metabolic
pathways, distinguishing functional signatures in COVID-19
respiratory tract microbiomes are identified, including decreased
potential for lipid metabolism and glycan biosynthesis and
metabolism pathways, and increased potential for carbohydrate
metabolism pathways. The results include overlap between previous
studies on COVID-19 microbiomes, including decrease in the
glycosaminoglycan degradation pathway and increase in
carbohydrate metabolism. The results also suggest novel
connections to consider, possibly specific to the lower
respiratory tract microbiome, calling for further research on
microbial functions and host-microbiome interactions during
SARS-CoV-2 infection.
Analysis and forecasting of global real time RT-PCR primers and
probes for SARS-CoV-2
Gowri Nayar, Edward E Seabolt, Mark Kunitomi, Akshay Agarwal, Kristen L Beck, Vandana Mukherjee, James H Kaufman
Scientific Reports 11(1), 8988, 2021
Abstract
Rapid tests for active SARS-CoV-2 infections rely on reverse
transcription polymerase chain reaction (RT-PCR). RT-PCR uses
reverse transcription of RNA into complementary DNA (cDNA) and
amplification of specific DNA (primer and probe) targets using
polymerase chain reaction (PCR). The technology makes rapid and
specific identification of the virus possible based on sequence
homology of nucleic acid sequence and is much faster than tissue
culture or animal cell models. However the technique can lose
sensitivity over time as the virus evolves and the target
sequences diverge from the selective primer sequences. Different
primer sequences have been adopted in different geographic
regions. As we rely on these existing RT-PCR primers to track and
manage the spread of the Coronavirus, it is imperative to
understand how SARS-CoV-2 mutations, over time and
geographically, diverge from existing primers used today. In this
study, we analyze the performance of the SARS-CoV-2 primers in
use today by measuring the number of mismatches between primer
sequence and genome targets over time and spatially. We find that
there is a growing number of mismatches, an increase by 2\% per
month, as well as a high specificity of virus based on geographic
location.
Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes
Kristen L. Beck, Edward Seabolt, Akshay Agarwal, Gowri Nayar, Simone Bianco, Harsha Krishnareddy, Timothy A. Ngo, Mark Kunitomi, Vandana Mukherjee, James H. Kaufman
Viruses 13(12), 2021
Abstract
SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. However, autonomous genome annotation of SARS-CoV-2 genes, proteins, and domains is not readily accomplished by existing methods and results in missing or incorrect sequences. To overcome this limitation, we developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on the use of a single reference genome and by overcoming atypical genomic traits that challenge traditional bioinformatic methods. We analyzed an initial corpus of 66,000 SARS-CoV-2 genome sequences collected from labs across the world using our method and identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction, compared to proteome references, including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to other published tools, such as Prokka (base) and VAPiD, we yielded a 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 gene, protein, and domain sequences—some conserved across time and geography and others representing emerging variants. We observed 3362 non-redundant sequences per protein on average within this corpus and described key D614G and N501Y variants spatiotemporally in the initial genome corpus. For spike glycoprotein domains, we achieved greater than 97.9% sequence identity to references and characterized receptor binding domain variants. We further demonstrated the robustness and extensibility of our method on an additional 4000 variant diverse genomes containing all named variants of concern and interest as of August 2021. In this cohort, we successfully identified all keystone spike glycoprotein mutations in our predicted protein sequences with greater than 99% accuracy as well as demonstrating high accuracy of the protein and domain annotations. This work comprehensively presents the molecular targets to refine biomedical interventions for SARS-CoV-2 with a scalable, high-accuracy method to analyze newly sequenced infections as they arise.
Re-purposing software for functional characterization of the
microbiome
Laura-Jayne Gardiner, Niina Haiminen, Filippo Utro, Laxmi Parida, Ed Seabolt, Ritesh Krishna, James H Kaufman
Microbiome 9(1), 4, 2021
Abstract
Widespread bioinformatic resource development generates a
constantly evolving and abundant landscape of workflows and
software. For analysis of the microbiome, workflows typically
begin with taxonomic classification of the microorganisms that
are present in a given environment. Additional investigation is
then required to uncover the functionality of the microbial
community, in order to characterize its currently or potentially
active biological processes. Such functional analysis of
metagenomic data can be computationally demanding for
high-throughput sequencing experiments. Instead, we can directly
compare sequencing reads to a functionally annotated database.
However, since reads frequently match multiple sequences equally
well, analyses benefit from a hierarchical annotation tree, e.g.
for taxonomic classification where reads are assigned to the
lowest taxonomic unit.
2020
Monitoring the microbiome for food safety and quality using deep shotgun sequencing
Kristen L Beck, Niina Haiminen, David Chambliss, Stefan Edlund, Mark Kunitomi, B Carol Huang, Nguyet Kong, Balasubramanian Ganesan, Robert Baker, Peter Markwell, Ban Kawas, Matthew Davis, Robert J Prill, Harsha Krishnareddy, Ed Seabolt, Carl H Marlowe, Sophie Pierre, Andr\'{e} Quintanar, Laxmi Parida, Geraud Dubois, James Kaufman, Bart C Weimer
bioRxiv, Cold Spring Harbor Laboratory, 2020
Abstract
In this work, we hypothesized that shifts in the food microbiome can be used as an indicator of unexpected contaminants or environmental changes. To test this hypothesis, we sequenced total RNA of 31 high protein powder (HPP) samples of poultry meal pet food ingredients. We developed a microbiome analysis pipeline employing a key eukaryotic matrix filtering step that improved microbe detection specificity to \textgreater99.96\% during in silico validation. The pipeline identified 119 microbial genera per HPP sample on average with 65 genera present in all samples. The most abundant of these were Bacteroides, Clostridium, Lactococcus, Aeromonas, and Citrobacter. We also observed shifts in the microbial community corresponding to ingredient composition differences. When comparing culture-based results for Salmonella with total RNA sequencing, we found that Salmonella growth did not correlate with multiple sequence analyses. We conclude that microbiome sequencing is useful to characterize complex food microbial communities, while additional work is required for predicting specific species\$\backslash$textquoteright\ viability from total RNA sequencing.Competing Interest StatementThe authors were employed by private or academic organizations as described in the author affiliations at the time this work was completed. IBM Corporation, Mars Incorporated, and Bio-Rad Laboratories are members of the Consortium for Sequencing the Food Supply Chain. The authors declare no other competing interests.
Functional pathways in respiratory tract microbiome separate COVID-19 from community-acquired pneumonia patients
Niina Haiminen, Filippo Utro, Ed Seabolt, Laxmi Parida
bioRxiv, Cold Spring Harbor Laboratory, 2020
Abstract
In response to the global pandemic of the last four months, some progress has been made in understanding the molecular-level host interactions of the new coronavirus SARS-CoV-2 responsible for COVID-19. However, when the virus enters the body it interacts not only with the host but also with the micro-organisms already inhabiting the host. Understanding the virus-hostmicrobiome interactions can yield additional insights into the biological processes perturbed by the viral invasion. We carry out a comparative functional analysis of bronchoalveolar lavage fluid of eight COVID-19, twenty-five community-acquired pneumonia (CAP) patients and twenty healthy controls. The resulting functional profiles clearly separate the cohorts, even more sharply than just their corresponding taxonomic profiles. We also detect distinct pathway signatures in the respiratory tract microbiome that consistently distinguish COVID-19 patients from both the CAP and healthy cohorts. These include increased vitamin, drug, nucleotide, and energy metabolism during SARS-CoV-2 infection, contrasted with decreased amino acid and carbohydrate metabolism. This comparative analysis indicates consistent differences in COVID-19 respiratory tract metatranscriptomes compared to CAP and healthy samples.Competing Interest StatementThe PRROMenade methodology is associated with patent applications currently pending review at the USPTO.
IBM Functional Genomics Platform, A Cloud-Based Platform for Studying Microbial Life at Scale
Ed Seabolt, Gowri Nayar, Harsha Krishnareddy, Akshay Agarwal, Kristen L. Beck, Eser Kandogan, Mark Kuntomi, Mary Roth, Ignacio Terrizzano, James Kaufman, Vandana Mukherjee
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1--1, Institute of Electrical and Electronics Engineers (IEEE), 2020
Abstract
The rapid growth in biological sequence data is revolutionizing our understanding of genotypic diversity and challenging conventional approaches to informatics. With the increasing availability of genomic data, traditional bioinformatic tools require substantial computational time and the creation of ever-larger indices each time a researcher seeks to gain insight from the data. To address these challenges, we pre-computed important relationships between biological entities spanning the Central Dogma of Molecular Biology and captured this information in a relational database. The database can be queried across hundreds of millions of entities and returns results in a fraction of the time required by traditional methods. In this paper, we describe $\backslash$textit\IBM Functional Genomics Platform\ (formerly known as OMXWare), a comprehensive database relating genotype to phenotype for bacterial life. Continually updated, IBM Functional Genomics Platform today contains data derived from 200,000 curated, self-consistently assembled genomes. The database stores functional data for over 68 million genes, 52 million proteins, and 239 million domains with associated biological activity annotations from Gene Ontology, KEGG, MetaCyc, and Reactome. IBM Functional Genomics Platform maps all of the many-to-many connections between each biological entity including the originating genome, gene, protein, and protein domain. Various microbial studies, from infectious disease to environmental health, can benefit from the rich data and connections. We describe the data selection, the pipeline to create and update the IBM Functional Genomics Platform, and the developer tools (Python SDK and REST APIs) which allow researchers to efficiently study microbial life at scale.
Hierarchically Labeled Database Indexing Allows Scalable Characterization of Microbiomes
Filippo Utro, Niina Haiminen, Enrico Siragusa, Laura-Jayne Gardiner, Ed Seabolt, Ritesh Krishna, James H Kaufman, Laxmi Parida
iScience 23(4), 100988, 2020
Abstract
Summary
Increasingly available microbial reference data allow interpreting the composition and function of previously uncharacterized microbial communities in detail, via high-throughput sequencing analysis. However, efficient methods for read classification are required when the best database matches for short sequence reads are often shared among multiple reference sequences. Here, we take advantage of the fact that microbial sequences can be annotated relative to established tree structures, and we develop a highly scalable read classifier, PRROMenade, by enhancing the generalized Burrows-Wheeler transform with a labeling step to directly assign reads to the corresponding lowest taxonomic unit in an annotation tree. PRROMenade solves the multi-matching problem while allowing fast variable-size sequence classification for phylogenetic or functional annotation. Our simulations with 5\% added differences from reference indicated only 1.5\% error rate for PRROMenade functional classification. On metatranscriptomic data PRROMenade highlighted biologically relevant functional pathways related to diet-induced changes in the human gut microbiome.
Integrative and Conjugative Elements (ICE) and Associated Cargo Genes within and across Hundreds of Bacterial Genera
James H Kaufman, Ignacio Terrizzano, Gowri Nayar, Ed Seabolt, Akshay Agarwal, Ilya B Slizovskiy, Noelle Noyes
bioRxiv, Cold Spring Harbor Laboratory, 2020
Abstract
Horizontal gene transfer mediated by integrative and conjugative elements (ICE) is considered an important evolutionary mechanism of bacteria. It allows organisms to quickly evolve new phenotypic properties including antimicrobial resistance (AMR) and virulence. The rate of ICE-mediated cargo gene exchange has not yet been comprehensively studied within and between bacterial taxa. In this paper we report a big data analysis of ICE and associated cargo genes across over 200,000 bacterial genomes representing 1,345 genera. Our results reveal that half of bacterial genomes contain one or more known ICE features (\$\backslash$textquotedblleft\ICE genomes\$\backslash$textquotedblright\), and that the associated genetic cargo may play an important role in the spread of AMR genes within and between bacterial genera. We identify 43 AMR genes that appear only in ICE genomes and never in non-ICE genomes. A further set of 95 AMR genes are found \textgreater5x more often in ICE versus non-ICE genomes. In contrast, only 29 AMR genes are observed more frequently (at least 5:1) in non-ICE genomes compared to ICE genomes. Analysis of NCBI antibiotic susceptibility assay data reveals that ICE genomes are also over-represented amongst phenotypically resistant isolates, suggesting that ICE processes are critical for both genotypic and phenotypic AMR. These results, as well as the underlying big data resource, are important foundational tools for understanding bacterial evolution, particularly in relation to important bacterial phenotypes such as AMR.Competing Interest StatementThe authors have declared no competing interest.
2018
Context Analytics: Vision, Architecture, Opportunity
E Kandogan, M Roth, I Terrizzano, E Seabolt, P Schwarz, H Krishnareddy, A Agarwal
2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW), pp. 1--8
Abstract
Business actions are often situated in a complex system of people, data and software, and the pace and quality of decisions often rely on how well knowledge work is coordinated. A tremendous amount of contextual data exists within an enterprise in data management, business analytics, and visualization systems, enterprise applications, and collaboration and social networking tools that capture the flow of 'work' accurately and completely across people, data and software. We believe that if such contextual data is captured and integrated, it offers significant potential to support knowledge work and to accelerate the productivity of knowledge workers. In this paper, we argue for context analytics, broadly referring to analytics on knowledge work and activity. We propose a context graph to flexibly represent knowledge work, including people, data assets, and tools, as well as the context around them. We also propose a reference architecture that is specifically designed for integration and analytics, illustrate how to populate the context graph with contextual data from a variety of systems, and show how analytics can be flexibly computed over the graph such that the graph serves as both input and output. Finally, we describe techniques such as contextual search and activity summarization to create a contextual user experience for discovery, governance, and collaboration in an enterprise setting.
Contextual Intelligence for Unified Data Governance
Ed Seabolt, Eser Kandogan, Mary Roth
Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Association for Computing Machinery, 2018
Abstract
Current data governance techniques are very labor-intensive, as teams of data stewards typically rely on best practices to transform business policies into governance rules. As data plays an increasingly key role in today's data-driven enterprises, current approaches do not scale to the complexity and variety present in the data ecosystem of an enterprise as an increasing number of data requirements, use cases, applications, tools and systems come into play. We believe techniques from artificial intelligence and machine learning have potential to improve discoverability, quality and compliance in data governance. In this paper, we propose a framework for 'contextual intelligence', where we argue for (1) collecting and integrating contextual metadata from variety of sources to establish a trusted unified repository of contextual data use across users and applications, and (2) applying machine learning and artificial intelligence techniques over this rich contextual metadata to improve discoverability, quality and compliance in governance practices. We propose an architecture that unifies governance across several systems, with a graph serving as a core repository of contextual metadata, accurately representing data usage across the enterprise and facilitating machine learning, We demonstrate how our approach can enable ML-based recommendations in support of governance best practices.
Exploiting Functional Context in Biology: Reconsidering Classification of Bacterial Life
J H Kaufman, E Seabolt, M Kunitomi, A Agarwal, K Beck, H Krishnareddy, B C Weimer
2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW), pp. 17--20
Abstract
Ontologies are built in various domains such as biology, chemistry, and business. Ontologies as knowledge bases have great potential to serve as providers of context for analytics not only to yield more relevant results but also to provide meaning in explaining results. Simply put, analysis without context ignores the underlying meaning in data. In this paper, we discuss one important example, how traditional classification of organisms in biology can become obsolete given the tremendous amount of genetic data now being analyzed under the lens of gene ontologies. Gene ontologies provide a functional context to how organisms operate and perform the functions of life. Ontologies such as gene ontologies encapsulate collective intelligence of scientists based on many decades of work. In this paper, we put forth a vision of contextual analytics in the field of genetics powered with big data and describe blueprints of an new approach to classification. There are many dimensions to biological function. We demonstrate how the cellular processes of interest can provide the context for classification of thousands of organisms based on their functional potential.
2003
Alkaline Hydrolysis of O,S-Diethyl Phenylphosphonothioate and p-Nitrophenyl Diethyl Phosphate in Latex Dispersions
Edward E. Seabolt, Warren T. Ford
Langmuir 19(13), 5378-5382, 2003
Abstract
Rates of hydrolysis of O,S-diethyl phenylphonsphonothioate (DEPP) and p-nitrophenyl diethyl phosphate (Paraoxon) in 0.1 M aqueous NaOH dispersions of a cross-linked poly(2-ethylhexyl methacrylate) latex containing styrylmethyl(trimethyl)ammonium chloride units were measured by 31P NMR spectroscopy. The reactions followed second-order kinetics to 75\% conversion. The rate constants of hydrolysis of both DEPP and Paraoxon were up to six times faster than those in the absence of the latex. Hydrolysis of DEPP gave a 85/15 ratio of products from P?S versus P?O bond breaking in the absence of latex and a 90/10 ratio in the presence of latex. 31P NMR relaxation times and visual observations show that DEPP, Paraoxon, the products of DEPP hydrolysis, and p-nitrophenoxide ion all partition from water into the latex. The diethyl phosphate ion that is produced from Paraoxon partitions into water. The kinetics at these high concentrations of DEPP and Paraoxon do not fit the enzymelike and ion exchange models that have been applied to the kinetics of reactions of lower concentrations of substrates in latex dispersions.
2002
Effects of Filler Particle/Elastomer Distribution and Interaction on Composite Mechanical Properties
Liliane Bokobza, Gilles Garnaud, James E Mark, Jagdish M Jethmalani, Edward E Seabolt, Warren T Ford
Chemistry of Materials 14(1), 162--167, 2002
Abstract
Some new characterization results are reported for composites prepared from methyl acrylate monomer and from reinforcing silica particles at various degrees of dispersion. In some cases, 3-(trimethoxysilyl)propyl methacrylate groups were grafted onto the silica (PMA) through participation in the methyl acrylate polymerization used to form an elastomeric PMA matrix. In some cases, the usual random dispersion of the silica particles was “aged†or converted into regular arrays within the monomer prior to its polymerization. As an alternative, placing chloropropyltrimethoxysilane groups on the particle surfaces was used to obtain random arrangements in which the strong bonding between the particles and elastomer was suppressed. In another approach, the particle dispersion was first dried and then blended into the monomer before its polymerization, thereby giving an aggregated arrangement. These various composites were characterized with regard to their mechanical properties in elongation (using techniques allowing a close appproach to elastic equilibrium), and with regard to chain orientation (using birefringence measurements and infrared spectroscopy). The elastomers having randomly dispersed and regularly dispersed silica dispersions were very similar in mechanical properties and chain orientation, but extensibility was significantly improved by decreasing the strength of the particle−elastomer bonding. Additional improvements in extensibility, and associated increases in toughness, were obtained when these same particles were aggregated.
1999
Hydrolysis of nerve agent analogs in presence of hydroxide and highly lipophilic cationic polymer latex
Edward Eugene Seabolt
Master's Thesis, 1999
Master's Thesis
Master's Thesis