The DNA Landing Zone
Van Warren
The purpose of this note is to explore whether techniques knowledge mapping and structured ontologies can increase the speed and effectiveness with which we can identify, formalize and illustrate gene transcription pathways. Knowledge mapping takes a set of facts extracted from the contents of a paper, typically an abstract, and links them to other similar facts in a graphical fashion. These relationships can then be formalized into one or more pathways, which are a structured inventory of the biochemical pathway. This is ontology construction in ever widening circles. The formal, but still qualitative version of the pathway can then be used as a checklist to account for the states of populations of chemical species participating in each step. Accounting for the numbers of reactants, products and enzymes in a quantitative way is a cellular modeling activity that precedes effective cellular simulation. The goal of cellular simulation is to show us what happens in a rich visual context so that we can understand these complex phenomena and ask useful questions.
First let's take a "baby
step". Imagine we write a simple display program that lets us interactively
visualize the relationships in a database of the confirmed 32,284+
genes in the genome. Those 32,284+ genes average between two and three
thousand base pairs in length, approximately 3.6% of human DNA when fully
accounted for.
In a simple
interface
developed one afternoon, 23 human chromosomes are displayed. The user clicks
on a particular chromosome to zoom in on a more detailed context. Click on
chromosome
14 for example.
Lets say we're looking at Epidermal Growth Factor Receptor. EGFR is a protein
that lives on the cell wall. It's blueprint is given by "EGFR the gene".
"EGFR the gene" used to be called, "avian erythroblastic leukemia
viral oncogene", which translated means, "bird blood making leukemia
virus cancer causing gene". But as Ridley
points out, "genes don't cause cancer". "EGFR the gene"
codes for "EGFR the protein", whether broken or not.
In a fit of curiosity we would like to know all the genomic sites that have a gene that codes for EGFR the protein. The fewer genes that code for a critical functions, the more problems arise if that gene is mutated. Single point failures, whether in biological or engineering systems are a problem, space shuttle style, so let's find out.
Imagine we had a program,
that we typed in commands, and out popped pictures. If we type in a gene name
we would get a picture that paints the gene in its chromosomal context.
// You type in:
> theGene(EGFR)
[ The short
arm of chromosome 7
lights up.]
// Now I type:
> homologousTo(theGene(EGFR), 100%)
// a deductive database
grinds for a while and then
[ The short
arm of chromosome 7 lights up.]
// ... a quality control check, no surprise
//
you KNOW what is coming next:
>homologousTo(theGene(EGFR), 70%)
// a deductive database grinds
for a while and then
[ several chromosomes light up to varying degrees
]
Now consider the following:
In CML, chronic myelogenous leukemia, a piece of chromosome 22 breaks off and fuses to chromosome 9, and vica versa. The smaller of the two chromosomes is called the Philadelphia chromosome. The process is called translocation.
In breast cancer a more complex suite of translocations
occur, so that the 23 original chromosomes are considerably more mixed up and have decided to
put on each others socks and shirts in a New Year's party, Dr. Suess style.
This is called translocation and is somewhat repeatable among various cancers.
Now when I type:
> showChromosome(14)
What do I mean?
Do I mean the original chromosome, or the fragmented one with its new partners?
If Farah Faucett-Majors, divorces Lee does she change her name right away,
or does she wait? We have an identity problem.
If DNA breaks in the middle of a gene, say in an intron between coding exons 1 and 2, new proteins which contain new combinations of motifs can be created and these new may sometimes confer on the cell that contains them a functional advantage. What may confer an advantage on the cell at one level, may be disadvantageous to the organism at a higher level of organization, i.e. she got cancer because of a Turing tape edit courtesy of a Tyler Durden subliminal frame.
We can handle mere strings of text, right? In many disease states, DNA gets rearranged, replicated, or jumbled as mentioned above, causing everything from Down's syndrome, to triple X, to Huntington's chorea. Now what was I saying?
So representing DNA as text strings is fine, we just say:
// &&
represents concatenation operator
chromosome9' = chromosome9p && chromosome 9q && chromosome22q
chromosome22' = etc.
and we can just go merrily along. Now we are ready to explore one benefit of a formal ontology. Formal ontologies are advertised to increase the consistency of databases so here is an interesting test, that breaks a naively constructed database. We have to do a real biochemical example so hold on for a second:
DNA is not just strings,
it is a chemical landing platform for
various regulatory and transcriptional proteins.
![]() |
Consider the protein called Estrogen Receptor, ER, here shown as a matched set. ER is a large molecule when compared to its ligand Estrogen (E), which looks like cholesterol. E comes sailing through the cell wall, which also looks like cholesterol. Inside the cytoplasm, ER is waiting with open arms. |
When ER sees E, I'm a little embarrassed to say, its love at first sight and the two form a complex ER+E. The complex is really just one entity, like oxygen and hemoglobin, but its also two things holding tightly to each other. So love goes. Keep breathing. For brevity, ER+E is ERE from now on.
As an aside for the computer scientist: All proteins are sequences chosen from an alphabet of twenty amino acids. Proteins, even when folded, have two ends left by the manufacturing department. The lead amino acid has a nose, called N, which comes first, and the other end, called C for Caboose, is attached last. At the Caboose end of ER a funny folded shape waits for only one thing, docking with E, or things that look like E, such details saves lives, but I digress...
Now on the locomotive nose end of ER, called N, is a short sequence of amino acids called a, "nuclear localization signal", or NLS. This pattern says, "take me to your SMPTE leader", because in this film noir, the nucleus is the where the action is.
Cut to scene: nucleus is studded with nuclear pores, which are complex little portals like something out of a space movie.
When the night club bouncer sees the NLS signal bobbing around on the end of the estrogen receptor complex, it says to the nuclear pore, "OPEN SESAME, like that big circular wheel in Stargate".
Then, without shedding
a single tear, ERE goes through the nuclear
pore. ERE has now made the BIG TIME. The safe has
been cracked, the fix is in and the check is in the mail. ERE bobs
around until it locates a specific DNA sequence entitled:
"AGGTCA nnn TGACCT"
where n stands for anything. Then something cool happens.
ERE, a protein, binds to the Turing Tape Itself, DNA, at the AGGTCA... site. If only one ERE is bound nothing happens. But if another ERE binds to DNA facing the other way at ...TGACCT, they exchange glances like Nicholas Cage and John Travolta. This "Face Off" bends the DNA and somehow turns on the gene that the pair of EREs are close to and transcription of that gene makes you start talking, thinking and acting like whatever that hormone wants at that particular moment in that particular tissue.
This same chain of events, this same set of cause and effect relationships applies to progesterone, testosterone and other "choleSTERicOID" hormones. So now we have some insight into how these little creatures rule the world, and trust me, they do.
Interestingly this property of it "takes two receptors to tango" happens other places, like at the cell surface with EGF receptor, where an entire signal transduction pathway is activated. Drugs like Tamoxifen (EGFR) and Herceptin (Her2 - a shipwrecked EGFR) get in the way of this dimerization and people's lives are saved, or at least extended.
Now back to our main point. Robust representation schemes for computing with these things.
DNA strings are landing zones. When these receptors make it to the nucleus like salmon after a trip up the ladder, the DNA is flexed like a note on Jimmy Page's Les Paul. Bent DNA becomes transcriptionally activated, and is quickly paid a visit by an RNA polymerase II complex assembling in a scene out of "whole lotta love".
So any representation scheme we use for DNA, that is, any ontology, must admit into its world, the fact that DNA is not just a sequence of letters, but a shape, a charge distribution and a landing zone determined by some runway numbers called a consensus sequence or "hormone response element" (HRE) to be specific. The shape, charge and sequence of the landing zones are dynamic and essential.
Now to the crux of the matter and a quick note on simulation. When we simulate a bouncing ball numerically, we don't really simulate any particular ball completely, because that would take as many computers as there are atoms in the ball. Instead we use an abstraction of the behavior of the ball called, "the equations of motion". We then march the equations of motions that specify the location of the ball over time subject to the external forces impressed upon it. We then draw the ball and say, "there is your ball". But then you say, "it doesn't squish right", or, "can I see that in a blue?", and we write some more equations and the simulation gets better.
But in the case of simulating DNA and its multidimensional transcriptional state at any given moment there is an intermediate position we can take. We could conceivably do the full electrostatics many-body problem and spend gobs of computer time to tell us what we already know, "That certain complexes bind to DNA at certain consensus sequences and have this or that effect". It is much less expensive, and probably just as informative to run a symbolic, or abstract simulation that reproduces the transcriptional behaviors that we know occur and then to ask questions about what are the implications of a given pathway being interrupted or altered. Now do we need a formal ontology to do this? Do knowledge maps help us to inventory these pathways faster, or to interconnect them more accurately or effectively? Inquiring minds want to know. I want to warp as rapidly as possible to effective simulations carrying only those formalisms that are truly necessary and effective in getting us there.
Do we want to carry a heavyweight representation scheme (HRS) around with us like a ball and chain every time we are comparing two DNA strands? Not really. But do we need to have an HRS if we need it for critical sections of an accurate simulation? Absolutely. Those that are going places in this business require it.
Summary
I wanted to make the following points with this note:
1) knowledge mapping the biotechnology literature is a rapid route to a complete and computable description of the transcriptional, signaling and regulatory pathways.
2) knowledge maps can be "reverse compiled" into a set of formal pathways.
3) these formal pathways constitute an ontology for DNA and its proteins.
4) cell simulations can then be computed using the relationships that exist in these pathways without need to resort to direct electrostatic simulation. DNA and its binding proteins are like an algebra that represents a complex phenomenon. It should be possible to execute the logical and functional relationships within a equilibrium and statistical framework that provides a meaningful insight into what happens when genes mutate, or break.
5) in the clinical environment, one could calibrate a given simulation run with the results of a gene chip study of a specific patients tumor. Gene expression patterns in tumors are like fingerprints. Correlating gene expression patterns with tissue specific cell simulations could enable custom tailored therapies to be developed and proven in silico. The goal would be to provide a "Darwinian bypass" so that the custom therapies would be specific to all cells in the cancer clone, the holy grail of cancer therapy.
6) is an image of what this may look like in the near future:
![]() |