How Genes Work
A different way to think of diseases, and how Software Engineering and genetics have a lot in common.
This post is part of a series I will be making about the book The Gene by Siddhartha Mukherjee. Check the end of this post for a summary of the book.
I want to start this series with an illustration of how genes and genetics work. While I have a small scientific background (having taken biology and other science classes in both high school and college) and while I find the topic fascinating, I’m not working in the field nor am I a biology researcher. Yet, the fact that I understand this—at least at a high level—is a testament to how good of a writer Siddhartha Mukherjee is.
This blog will start with how to think of a diseases and get technical into how genes actually work, and it’ll end with looking at all of this from a software engineering perspective.
What a disease is
We like to think of a disease as something bad and something that is not “normal”. However and over and over, Mukherjee emphasizes that this way of thinking about a disease is not intuitive with how genes work. Instead, a disease or illness is simply a mismatch between an organism’s genome (their genetic makeup) and an organism’s environment.
That is profound because it means that when someone is “sick”, it may be easier to change this person’s environment rather than fix this person’s “sickness”. It also begs the question, “in what kind of environment would this person with this particular disease be successful?”
With that said, this concept is best suited to conventionally genetic diseases: diseases like Huntington’s disease, breast cancer, or sickle cell anemia. When someone gets a cold, it’s kind of a stretch to say that it’s a mismatch between environment and genome (although common advice is to rest up and drink a lot of water, which is a slight change of environment). Instead, I see this as a Venn-diagram. On one hand we have the environment (which could be any place — for us it’s our planet but you can even break it down into different regions of the planet or even a particular house), and on the other hand we have all of the possible human genomes. The genomes that will survive best are those that are in the intersection between the two. Then, the set of genomes that aren’t in this intersection are what we colloquially call “genetic diseases”.
Let’s dive a bit deeper into how this works to understand how it all makes sense.
What’s the genome exactly?
I’ve already mentioned the word “genome”, but I haven’t said what it is (I know some readers would know what it is but this is for those who don’t).
The way we humans work is that we have our DNA which, you can almost think of as our identity. This identity is the exact same one that each individual person has in all of their cells, but this identity varies from person to person.
With that said, this is not just an “identity”. Another way to think of a person’s DNA is that it is a a recipe for what makes that person, who they are, and how they react to what they sense from their environment. Some examples:
- Some people have blue eyes and some people have brown eyes, and what determines the color of a person’s eye is a small part of this huge DNA recipe.
- Some people are claustrophobic, and it turns out that there is a genetic component to it.
The human genome, which you can think of this really long strand of DNA, is made up of “chromosomes” (think of them as huge parts of the DNA). In total, there are 46 chromosomes, and in combination, they form what we call the human genome.
How DNA is read and processed
Now, let’s dive a step deeper. The DNA is made up of 4 nucleotides. For all intents and purposes, we don’t need to know what these are (lowkey, I only vaguely understand what they are myself). The 4 are: Adenosine (A), Cytosine (C ), Guanine (G), and Thymine (T). These nucleotides are placed side to side and form the DNA. So, someone’s DNA might look like this:
...GTGGTTTCGCTCTCCTAGTACGATTATGAGAGAAGACGCGCTTTGGTGACCGCTCCGCACGCTTGAAGTGCAGATGGACTCTGGGGGGAGTGGTTTCGCACTTCTCCTAG...
Where each letter represents one of those 4 nucleotides. Within it, there is a mechanism that “reads” a section of this DNA and from it creates a protein (again, no need to understand what a protein is yet). The section that is being read is what represents a gene. For instance, within that DNA above, the following could be one gene (which makes a protein):
TAC-GAT-TAT-GAG-AGA-AGA-CGC-GCT-TTG-GTG-ACC-GCT-CCG-CAC-GCT-TGA-AGT-GCA-GAT-GGA-CTC-TGG-GGG-GAG-TGG-TTT-CGC-ACT
Finally, it is important to understand that there are multiple genes (I believe over 20,000 genes) in the human DNA—i.e., over 20,000 chains of groups of 3-letters (i.e., nucleotides) just like the one above.
Fun & Related Fact: The Human Genome Project
The Human Genome Project was a project that aimed at finding all the letters (the A, C, G, and T letters) of the human DNA. This project ended up costing about $2.7B! Which is an insane amount of money, but with the technology of the time, it made sense why it cost so much the first time. Today, this cost has been cut so much that you can have your own genome sequenced for less that $600.
The significance of the Human Genome Project is that it was the first step at us understanding ourselves. The genome is a recipe for how we, humans, work. What’s fascinating is that the genome of other living species is made up of the same exact 4 nucleotides / letters, and what’s also mind blowing is that we share a lot of our DNA with other species and plants. For instance, from this article:
- We share up to 91% of our DNA with chimps
- We share up to 25% of our DNA with bananas
Finally, the fact that all living species use the same exact 4 nucleotides to make up their DNA is breathtaking. Look at the diversity that life has out of 4 simple letters!!! 🤯💀
Anyways, if you want to learn even more about the human genome, go to www.genome.gov.
How Genes Work
I want to get really technical in this last section and dive into how this whole thing work because it’s important to see the entire process to get an appreciation of this as a software engineer.
Let’s start of with this high level chart that I found in the book:
(1) Genes encode RNAs
Genes, or DNA, are what encodes RNAs. First, let’s talk about the difference between the two.
- DNA stands for deoxyribonucleic acid
- RNA stands for ribonucleic acid
Both are made from nucleotides, except that RNA has one difference: instead of a Thymine (T), it has a Uracil (U). So, whenever you see a T in the DNA sequence, for the RNA you’d replace it with a U. In addition to that, the DNA has a double helix structure where if on one side you have “base pairs”, i.e., pairs of nucleotides that always go hand in hand. For instance, you always have a A-T pair or a G-C pair. With the RNA, it’s only helix only, so there are no pairing. See the diagram below.
How does DNA encode RNA? Well, there is an enzyme (which is basically a biological unit that brings about a particular chemical reaction) whose job is to “read” the DNA (specifically, read one strand of the DNA) and make RNA out of it. In layman’s term, it reads the DNA and whenever it sees a a nucleotide, it writes the opposing pair into a new strand/helix of RNA. For instance, if it sees a T, it will “append” a A. If it sees a A, it “appends” a U (instead of a T because RNAs have U). If it sees an G, it appends a C and vice versa.
The RNA that comes directly out of this enzyme is called mRNA which stands “messenger RNA”.
(2) RNAs build Proteins
Now, this mRNA builds proteins. I’ll get to what proteins are in the next section, but first, the process through which this happens is that the mRNA is read in groups of 3 consecutive nucleotides (3 letters, called codons) and converted into an “amino acid” chain. We don’t need to understand what amino acids are except that there are 20 of them and there’s a direct map from a pair of 3 nucleotides to 1 amino acid. This mapping looks like the following table:
So, the DNA we previously had was:
TAC-GAT-TAT-GAG-AGA-AGA-CGC-GCT-TTG-GTG-ACC-GCT-CCG-CAC-GCT-TGA-AGT-GCA-GAT-GGA-CTC-TGG-GGG-GAG-TGG-TTT-CGC-ACT
This DNA will be converted to the following mRNA (notice how each letter has been converted to its complement):
AUG-CUA-AUA-CUC-UCU-UCU-GCG-CGA-AAC-CAC-UGG-CGA-GGC-GUG-CGA-ACU-UCA-CGU-CUA-CCU-GAG-ACC-CCC-CUC-ACC-AAA-GCG-UGA
Then, that mRNA will be converted to the following amino acid chain:
Met-Leu-Ile-Leu-Ser-Ser-Ala-Arg-Asn-His-Trp-Arg-Gly-Val-Arg-Thr-Ser-Arg-Leu-Pro-Glu-Thr-Pro-Leu-Thr-Lys-Ala
This amino acid, then will “fold” (i.e., instead of staying as a chain, it would fold itself many times into some sort of oddly shaped ball) and through folding will become a protein!
Now, the human genome has over 20,000 different genes like this that get converted into many different kinds of proteins that do many different things.
(3) Proteins form/regular Organisms
Before diving into how proteins form or regulate organisms, we have to talk about what a “protein” even is.
From what I understand, you can think of a protein as a biological unit that does one or a couple of very specific jobs. An analogy of this is how a company is ran by multiple different professions: there are accountants, janitors, investors, CEOs, engineers, salespeople, etc. The part where this analogy sort of breaks down is that for the human body, it makes those proteins (i.e., those people with different professions). (A side note: it’s kind of meta that I’m using an analogy involving people to explain how the human body, which is what people are, works 🥵).
I am not sure if a company can have over 20,000 different professions, but the human body has over 20,000 genes that can encode for over 20,000 different proteins which in turn do over 20,000 different jobs in the human body (and mind you, some proteins can combine to do a super job, e.g.: the enzyme that reads the DNA to make mRNA is just a combination of multiple proteins together).
With that said, there are different kinds of proteins. Here’s two types of proteins:
- There are proteins that regulate gene expression. These proteins essentially turn a particular gene on and off. When a protein turns a gene off, that gene will stop making more of its proteins. When it turns it on, it’ll start making more. This type of regulatory proteins can also increase or decrease the rate at which these proteins are made.
- There are proteins that take care of cleaning up the body.
In proper biological terms, there are actually 7 types of proteins, which are: (1)antibodies, (2) contractile proteins, (3) enzymes, (4) hormonal proteins, (5) structural proteins, (6) storage proteins, and (7) transport proteins. Fortunately for you, I’m not trying to fit in all of biology in one blog so feel free to follow the link if you want to learn more!
Now, how proteins form/regular organisms is that they just do whatever task they are supposed to do within the body based on whether they are expressed or not. For example, cells can consume sugar (of which, both lactose found in milk or glucose found in your average sweets are), and the way they do that is that there is a gene that encodes for a particular protein that consumes a particular sugar (and this would be a cleaning-up-the-body protein). So, if there are a lot of lactose (after drinking a lot of milk) in the body / cells, the cells will sense that and it will trigger a regulatory protein to “turn on” the gene that produces lactose-digesting proteins. These lactose sugars will be consumed by these proteins, and as they are exterminated, the production of lactose-digesting proteins will be progressively turned off.
(4) Organisms sense Environments, and (5) Environments influence Proteins, RNA
I grouped (4) and (5) together because they are practically simultaneous and easier to explain together.
Now, organisms also sense the environment. From my understanding, they sense them through different proteins and through the different signals that the organisms gets from the environment.
So, when we talk about our five senses, something actually happen at the molecular level. When it’s really hot or cold, there’s an actual element or reaction that is sensed—presumably by some proteins—which triggers the organism to do what it needs to do from step (3). In addition, we can self inject stuff from the environment (like the milk with lactose example) which also triggers reactions from step (3).
I don’t have too much to say here except that the environment is IMMENSE. We have the entire earth, and all of the different regions in it could trigger a different biological reaction in the human body. There are so many combinations of things, and this is practically what leads to the biodiversity we have on earth. With all the humans living in vastly different environments, it’s a no-brainer that we all look, act, and are so different because everyone’s human body had to adjust, over millions of years, to their particular environment.
(6) Proteins and RNA regulate Genes
I’ve talked about regulatory proteins above, and this part is essentially that, and with that, we’ve come full circle with this “circular flow of biological information”. From here, we simply go back to step (1) where a gene expresses encodes an mRNA.
An Appreciation from a Software Engineer
As a software engineer, I immediately notice the recursive and dynamic nature of this process. We start at genes and end at genes. This is a case example of a dynamic program. While beginner textbook examples of dynamic program usually involve only a single function (such as the fibonacci sequence, shown below), this one is more of a multi-function dynamic program that involves multiple parts. The closest thing I could think of as a program for this would be a mix between a general parser (e.g., this “Little Language” from MIT’s 6.031 class) and the way you write a language parser (what we did in MIT’s 6.009 to write a lisp language parser or this digital-inputter class I wrote a while ago for MIT’s 6.004 class).
# Precondition: n > 0
def fibonacci(n):
if n == 0:
return 0
if n < 2:
return 1
return fibonacci(n - 1) + fibonacci(n - 2) # <- recursion!
While in school studying for computer science, I had to learn how to write recursive functions and deal with dynamic programming (DP). After you get the hang of it, it’s gratifying and harmonious to solve a DP problem because the solution fits in super intuitively. Before you get the hang of it, it SUCKS.
I’ve come past the sucky part of DP, and when I absorb this “circular flow of biological information” (that is how the author, Siddhartha Mukherjee calls it by the way), I can’t help but think about how I’d formulate this as a DP problem, and no doubt this is easily one of the hardest DP problems there is because it just blows out of proportion. It’s not just about solving the mathematical equations that you’d get from the DP statements, it’s also about solving for a highly memory intensive program. Let me list some of the variables just to put that into perspectives:
- We have over 20,000 genes that need to be expressed.
- Each gene can have a variable length, from 2 to infinity. Obviously, in practice the bound is how long the human genome is divided by 3 (because 3 letters per amino acid in a protein), which is still insane at over 3b / 3 = 1 billion.
- The way the amino acid chain folds to form proteins depends on the chemistry geometry of it. Another set of variables that blow out of proportion.
- Various proteins can be grouped together to form enzymes that influence the system better.
- All the things our human body can POSSIBLY sense from the environment. I have no clue how to set a bound to this one.
- The starting state of the system, i.e., right after the sperm and egg have done their cross-over process and made a new genome for a new human (a baby / fetus in this case) + everything that is in the mother’s womb.
- The variable of time and any added element from the environment.
- Add to this list mutations, which can happen anytime and change what a particular gene expresses.
With all of that said, this process (which is to determine the state of this system at a future time given inputs + state at current time) is recursive, so it continues to happen up until the human dies, which could possibly be infinitely long (though in practice most people die before they’ve reached what, 80 years old?).
So, you’d be running this recursive function at a continuous time variable in a recursive manner. This seems next to impossible to me. Still, this gives me deep appreciation for biology, the human body, and how it all works so beautifully.
Caveat
For those who work directly in the field of either genetics or biology (or even those who have studied this to any level of depth), you know damn well that I’ve simplified a lot of things. I just want to flag this out because (1) this entire process is much more complicated than this (e.g., one gene can encode multiple kinds of proteins, and then there are RNAs that don’t encode for any protein and instead serve a different role), and (2) I obviously do not understand this entire process. My hope instead is to give others who may not know this stuff a simplified version of this so that maybe they’d be inspired to read more about it!
The Gene by Siddhartha Mukherjee is one of the most fascinating books I’ve read. Genes are what define us as humans, and in this book, Mukherjee lays out the incredibly complicated field of biology and genetics in plain english — showing how much it’s ubiquitous in our lives and how it has utterly defined our society in the most potent and intimate ways. This book is full of history, life lessons, and knowledge, and I couldn’t help write something about it. This blog is one instance of history, life lesson, or knowledge that I want to remember from this book.