Data Science

Machine Learning

Machine learning aims to find patterns and trends in biomedical datasets. Empowered with strong knowledge in mathematics, statistics, and computer science, we build supervised (e.g. linear, trees, neural, Bayesian, etc), and unsupervised (e.g. clustering, associations, dimensionality reduction) models. Our team applies medical informatics approaches to deal with imbalanced datasets, sampling strategies, feature selection, parameter tuning, and model optimization. Some potential applications include enhancing treatment selection, improving prognostic accuracy, or subtyping cohorts based on genomics and phenotypes, among others. 

Computational Genomics

We use computational and statistical tools together with state-of-the-art machine learning algorithms to analyze the complex structure of the genome. Our goal is to help our customers unravel the complex biological processes underlying their DNA. The scale of the problem can grow from monogenic analyses to a whole-genome approach considering all 3 billion base pairs that comprise the human genome. Our current work focuses on the development of polygenic risk scores (PRS), and genetic ancestry inference from genotyped data and questionnaire data. 

Clinical and Public Health Informatics

Analysis of electronic health records (EHR) using statistical and machine learning methods aims to identify clinical and public health insights. Real-world data typically deals with data extraction, cleaning, transformation and loading (ETL process) into standardized datasets. Patient-level information can range from conditions, visits, patients, symptoms, observations, care sites, or drugs; all the way to genomics, ultrasounds, X-rays, CT-scans, wearables, social media, etc. Our team has experience performing ETL on biomedical datasets, and transforming them into the Common Data Model (CDM, commonly referred to as OMOP) from Observational Health Data Science Informatics (OHDSI) consortium. We help customers organize and structure their data into readable formats, and annotate it using controlled vocabularies, and biomedical ontologies like ICD, SNOMED, LOINC, RxNORM, etc. Ultimately, analysis on this data can be showcased in interactive dashboards and interpretable machine learning models.