Recent and Current Projects
Removing Bias in Sequence Models of Protein Fitness, Doctoral Thesis Topic
Unsupervised sequence models for protein fitness have emerged as powerful tools for protein design in order to engineer therapeutics and industrial enzymes, yet they are strongly biased towards potential designs that are close to their training data. This hinders their ability to generate functional sequences that are far away from natural sequences, as is often desired to design new functions. To address this problem, we introduce a de-biasing approach that enables the comparison of protein sequences across mutational depths to overcome the extant sequence similarity bias in natural sequence models. We demonstrate our method's effectiveness at improving the relative natural sequence model predictions of experimentally measured variant functions across mutational depths. Using case studies proteins with very low functional percentages further away from the wild type, we demonstrate that our method improves the recovery of top-performing variants in these sparsely functional regimes. Our method is generally applicable to any unsupervised fitness prediction model, and for any function for any protein, and can thus easily be incorporated into any computational protein design pipeline. These studies have the potential to develop more efficient and cost-effective computational methods for designing diverse functional proteins and to inform underlying experimental library design to best take advantage of machine learning capabilities.
Pre-print
Deep Learning Prediction of Enzyme Optimum pH, Doctoral Research Topic
Compiled database of 200+ measurements of point-mutation effects on pH tolerance across 50 enzymes
Developed large language modeling methods to infer biological drivers of pH tolerance in enzymes. Major contributor.
Pre-print
Protein Design using Structure-based Residue Preferences, Doctoral Research Topic
An unsupervised design approach that learns residue mutation preferences from local structural dependencies. Major contributor.
Co-author paper submitted to Nature Structural & Molecular Biology June 2023.
Pre-print
An in silico method to assess antibody fragment polyreactivity, Graduate Research Topic
Used AWS servers to host an online machine learning model to predict nanobody poly-specificity and visualize sequence biometrics. Users can visit
http://18.224.60.30:3000/ to input nanobody sequences and get predictions. Work published in Nature Communications:
Harvey, E.P., Shin, JE., Skiba, M.A. et al. An in silico method to assess antibody fragment polyreactivity. Nat Commun 13, 7554 (2022).
https://doi.org/10.1038/s41467-022-35276-4
Learning PET hydrolase activity from sparse experiment data, Graduate Research Topic
Developed machine learning methods to learn and predict from sparse, disparate enzyme activity. Work in collaboration with National Research Energy Lab (NREL) to develop plastic-eating enzymes.
First co-author paper will be submitted in July 2023.
Past Projects
Vertical Resolution Requirements to Simulate Transpacific Ozone Pollution, Graduate Research Topic
Observations show that chemical plumes injected to high altitudes (the free troposphere) by
convection, volcanoes, or stratospheric intrusions can retain their identity as well-defined
vertical layers for a week or more as they are transported on intercontinental scales. Global
atmospheric models fail to reproduce these persistent plumes due to rapid numerical diffusion.
Under realistic shear flow, plumes filament down to the grid scale where any advection scheme
collapses to first order and rapid numerical dissipation occurs. Eastham and Jacob (2017) found
that the primary limitation to resolving the plumes was the vertical resolution as General
circulation models (GCMs) have prioritized increasing horizontal resolution versus vertical
resolution. The new 132-level (L132) GEOS-5 model has potential to improving transport of transcontinental plumes
and better representing intercontinental influences on non-linear surface ozone chemistry. Improvement in transport is
evaluated via comparisons between inert tracer plumes simulated in L132 and the 72-level
GEOS-5 model (L72). I focus on transpacific Asian pollution plumes transported to North
America, for which we have extensive complementary aircraft, satellite, surface measurements
and ozonesonde data, to evaluate the L132’s quantification of Asian pollution influence on
western US surface ozone.
I primarily identified GEOS-5 model bugs, physical inconsistencies and collaborated with NASA scientists to fix and ensure valid model
performance and output. To do this, I developed an understanding of the FORTRAN 77 and 90 model framework and in particular
familiarized myself with the code structure of the chemistry and dynamics module.
Languages: Python (for analysis), FORTRAN
Symposium Presentation on Research
Dog Prediction Neural Networks, Fall 2018 AC209a
We optimized and compared ResNet’s Neural Networks, Convolutional Neural Networks and Artificial Neural Networks
capabilities to predict the breeds of 20,000+ purebred dogs. We used Keras machine learning to implement and test
our networks.
Language: Python
Project Website
Ozone Laminae Prediction Algorithm, Fall 2017 EPS236
I developed an algorithm to detect ozone laminae off the coast of Northern California, using data from Trinidad
Head, CA ozonesondes. The algorithm was able to filter out high frequency noise, define the free troposphere,
recognize high ozone peaks that fit the criteria of free tropospheric ozone laminae.
Language: R
Presentation Download
Analysis of Advection Schemes for Application in a Turbulent Propeller Wake, Fall 2018 AM205
We coded and tested three advection schemes: Essentially Non-Oscillating (ENO), Superbee, and
Monotonic upwind Scheme for Conservation Laws (MUSCL) using 1-D and 2-D standard testing methods. We applied the
lowest error schemes to a steady state velocity field produced by a weather balloon propeller in the stratosphere.
Language: python, MATLAB
Verification of Goldbach’s Conjecture, Spring 2018 CS205
We designed a simple algorithm in C for verifying Goldbach's conjecture and developed several parallel
implementations of the code to identify the best strategies for tackling the problem as integer size increases.
We tested the following forms of parallelism: OpenMP shared memory parallelism, MPI distributed memory parallelism,
Hybrid MPI-OpenMP parallelism, OpenACC GPU accelerated computing.
Language: C
Project Website
Investigating Extracellular Electron Transpot in Ammonia Oxidizing Bacteria, Undergraduate Research Project
Anaerobic Ammonia Oxidizing Bacteria can anaerobically capture aqueous carbon dioxide and also convert nitrite and
ammonia to dinitrogen.
Their slow doubling time of 7-22 days, complicated symbiosis with other microbes, and inability to be isolated
remains are major hurdles for their usage for both carbon capture and energy efficient wastewater treatment.
Scientific research in Anammox focuses on optimizing growth situations for their well-being by understanding
the carbon fixation pathways, their population dynamics, and exchange of materials between members of the Anammox
consortium.
Researchers have examined the variation of growth and activity of Anammox with different inorganic or organic
electron donors.
However, the need for the addition of chemical electron donors could be satisfied by supplying
electrons directly via current.
Extracellular electron transfer (EET) through cellular electron carriers,
notably c-type cytochrome, is used for transfer of electrons between cells and toward the surrounding environment.
Specifically, extracellular c-type cytochrome could allow the Anammox cell to utilize the electric current
supplied by electrodes in cellular metabolic processes.
Extracellular electron transport is attributed to extracellular membrane electron carriers such as quinones
(notably or anthraquinone-2,6-disulfonate (AQDS)), phenazines, flavins (Kotloski & Gralnick, 2013) and heme
proteins (notably c-type cytochrome(Rosenbaum, Aulenta, Villano, & Angenent, 2011)).
Furthermore, these surface electron carriers facilitate the transport of electrons on the surface of the cell to
the internal metabolic processes within the cell(Thrash & Coates, 2008).
The examination of the structure of the Anammox extracellular matrix reveals an abundance of c-type cytochrome. These form the majority of the
extracellular complex, manifesting in the vermillion color of the Anammox aggregate(Kartal et al., 2011).
C-type cytochrome, an electron transport protein in the electron transport chain has been previously
demonstrated to be a feasible extracellular electron carrier in geobacter, desulfovibrio, and various other
denitrifiers(Thrash & Coates, 2008).
According to the figure above (de Almeida, Wessels et al. 2016), c-type cytochrome redox is strongly correlated with hydrazine
dehydrogenase and hydrazine synthase.
It may be presumed that the addition of electrons to c-type cytochrome will result in a negative charge
accumulation in the Anammoxosome lumen and result in a stronger proton-motive force to generate more ATP.
Previous studies involving extracellular electron transport via electrodes to biofilm bacteria revealed an
increase in population growth and cellular metabolites, which is correlated with discrete current measurements
spanning from 10-100mA (Thrash & Coates, 2008).
I applied a series of different currents to an Anammox bioreactor to achieve two purposes:
1) To investigate if adding current to Anammox bacteria could increase the metabolic activity and thus the
nitrogen removal rate and 2) If it did increase the metabolic rate, what is the optimal current for their growth?
>
Indoor Air Quality, Undergraduate Research Project
I determined the volumetric flux above various types of lightbulbs to explore the relationship between flow rate and
power consumption. I designed and built a wood/plastic sheet greenhouse structure to prevent outside influence from
introducing artifacts in my data. This work was presented at Spring 2016 UC Berkeley Student Undergraduate Research
Forum
Language: MATLAB
Poster Presentation
Biofuels Technology Club, engineer, Spring 2017
The goal of the biofuels technology club is to produce biodiesel from cooking oil from the UC Berkeley Dining Commons.
The ultimate goal would be to to first optimise and develop a laboratory scale process and then begin the process of
scaling up the project to a pilot plant.
Each process: the titration removal of free fatty acids, transesterfication, and water washing needed a team of engineers
who were detail-oriented, were self-driven, and would work well with others. I worked with the water washing to
I worked with a team of students on the water washing process: documenting how much water we were using to wash out the
glycerol and communicating with downstream processes of quality testing to ensure that our biodiesel product met standards.
Project Website