Knowledge Mapping: Details of Fact Extraction

L. Van Warren
Warren Design Vision
3/16/2000

Revised
1/23/2001


Introduction

An essential process of codifying knowledge is separating what we know, from how we found it out. This reduces intellectual clutter and allows us to develop clearer lines of reasoning. A second essential is to visualize information as a knowledge map. Knowledge maps are produced by the process of fact extraction. The graphical represenation of a fact requires more display real estate than an equivalent textual one. However, graphical represenations have the advantage of being more rapidly understood by the human perceptual system since the graph of a fact map solicits two dimensional recognition behaviors, rather than reading behaviors. Text representations are essentially one dimensional, and the image of meaning must be constructed in the reader's mind. Thus previously covered detail can be obscured by new incoming information. In a knowledge map, all the information is available simultaneously, although readers may still choose to track a particular line of reasoning. Facts can be extracted from the corpus of textbooks, papers and journals. Completely automated fact extraction is difficult. Processing of journal papers is complicated by subtleties of inflection, nuance of expression, specialized nomenclature and idiosyncratic writing styles. For knowledge mapping to work properly, text must be sieved to extract cause and effect relationships, facts that can be represented in a recordable and computable form. Fact extraction is best done in with the interactive involvment of human experts. In this example we take a contemporary abstract and translate it into a set of facts represented as a list of relationships between entities of the form "a" relates to "b". Eventually we will make large knowledge maps by combining small ones. Knowledge maps are plots of relationships of the form:

"a" "is related to" "b"

as in

"gene x" "codes for" "protein y"

Characterization Leads to Bias

The human mind has a tendency to personify or characterize nouns in such a way that relative importance is assigned to the objects. This ranking affects the priority with which these objects will be stored and located for later use. Consider the following simple statement:

jack goes down the hill

In parsing this sentence, it is our tendency to focus on "jack" the personality, "jack" the person, and not on the place called "the hill". Instead, all objects whether "jack" or "the hill" should be placed on an even footing while depicting the relationships between them.

This depersonalization of jack is not done to slight jack's humanity, but rather to call attention to the bias of characterization that is threaded through human thinking. This bias can be reduced by extracting 1) the objects from the sentence, 2) the relationship(s) between the objects and depicting these relationships graphically. As facts accumulate, ranking may be assigned by the number of arcs flowing into or out of a particular entity. In the case of scientific knowledge viewing this constellation of arcs can tell us what ground has been thorougly trodden, and what ground should be explored more carefully. Our goal is to create small knowledge maps from abstracts and to link these small knowledge maps into large ones.

Now you might ask about the real line:

jack and jill go down the hill

There are two facts in this compound statement:

"jack" "goes down" "the hill"

"jill" "goes down" "the hill"

In the full rhyme we add the relation

carrying a pail of water

from which we extract the implicit facts

"jack" "carries" "pail of water"

"jill" "carries" "pail of water"

The article "a" is omitted, as it adds no additional information.

A more sophisticated example is given below:


Example

Consider the following example of a typical abstract, excerpted from The Journal of Cell Biology:

The Journal of Cell Biology, Volume 148, Number 5, March 6, 2000 871-882

MAD3 Encodes a Novel Component of the Spindle Checkpoint which Interacts with Bub3p, Cdc20p, and Mad2p

Kevin G. Hardwicka, Raymond C. Johnstona, Dana L. Smithb, and Andrew W. Murrayb
a Institute of Cell and Molecular Biology, University of Edinburgh, Edinburgh EH9 3JR, United Kingdom

b Department of Physiology, University of California, San Francisco, California 94143

Correspondence to: Kevin G. Hardwick, Institute of Cell and Molecular Biology, University of Edinburgh, King's Buildings, Mayfield Road, Edinburgh EH9 3JR, United Kingdom. Tel:44 131 650 7091 Fax:44 131 650 7037 E-mail:hardwick@holyrood.ed.ac.uk.

We show that MAD3 encodes a novel 58-kD nuclear protein which is not essential for viability, but is an integral component of the spindle checkpoint in budding yeast. Sequence analysis reveals two regions of Mad3p that are 46 and 47% identical to sequences in the NH2-terminal region of the budding yeast Bub1 protein kinase. Bub1p is known to bind Bub3p (Roberts et al. 1994 ) and we use two-hybrid assays and coimmunoprecipitation experiments to show that Mad3p can also bind to Bub3p. In addition, we find that Mad3p interacts with Mad2p and the cell cycle regulator Cdc20p. We show that the two regions of homology between Mad3p and Bub1p are crucial for these interactions and identify loss of function mutations within each domain of Mad3p. We discuss roles for Mad3p and its interactions with other spindle checkpoint proteins and with Cdc20p, the target of the checkpoint.

Key Words: MAD3, checkpoint, BUB3, CDC20, MAD2

This is a knowledge map for the abstract:

l


Step 1: Sentential Expansion:

In this step each sentence of the abstract is placed on a separate line for WHAT versus HOW culling.
In this example
we are only interested in WHAT the cell is doing, not HOW it was figured out. An expert human reader makes this decision interactively.

Click checkbox only if sentence contains WHAT information.

We show that MAD3 encodes a novel 58-kD nuclear protein which is not essential for viability, but is an integral component of the spindle checkpoint in budding yeast.
Sequence analysis reveals two regions of Mad3p that are 46 and 47% identical to sequences in the NH2-terminal region of the budding yeast Bub1 protein kinase.
Bub1p is known to bind Bub3p (Roberts et al. 1994 ) and we use two-hybrid assays and coimmunoprecipitation experiments to show that Mad3p can also bind to Bub3p.
In addition, we find that Mad3p interacts with Mad2p and the cell cycle regulator Cdc20p.
We show that the two regions of homology between Mad3p and Bub1p are crucial for these interactions and identify loss of function mutations within each domain of Mad3p.
We discuss roles for Mad3p and its interactions with other spindle checkpoint proteins and with Cdc20p, the target of the checkpoint.

Step 2: Text Reduction:

The reader has checked those sentences that remain for fact extraction. In this example, all sentences survive.

We show that MAD3 encodes a novel 58-kD nuclear protein which is not essential for viability, but is an integral component of the spindle checkpoint in budding yeast.
Sequence analysis reveals two regions of Mad3p that are 46 and 47% identical to sequences in the NH2-terminal region of the budding yeast Bub1 protein kinase.
Bub1p is known to bind Bub3p (Roberts et al. 1994 ) and we use two-hybrid assays and coimmunoprecipitation experiments to show that Mad3p can also bind to Bub3p.
In addition, we find that Mad3p interacts with Mad2p and the cell cycle regulator Cdc20p.
We show that the two regions of homology between Mad3p and Bub1p are crucial for these interactions and identify loss of function mutations within each domain of Mad3p.
We discuss roles for Mad3p and its interactions with other spindle checkpoint proteins and with Cdc20p, the target of the checkpoint.


Step 3: Interactive Fact Extraction:

A fact is a relationship between two entities.
Relationships are typically verbs or verb phrases.
Entities are typically nouns or noun phrases and are always quoted.


The KaKer assigns priorities to adjectives as follows:


Unnamed entities are given unique temporary names on-the-fly.
The relation "is located in" is always used to tag "where" information.
The relation "is found in" is always used to indicate organism of origin
The relation "is part of" is always used when indicating a cellular component.

We show that MAD3 encodes a novel 58-kD nuclear protein which is not essential for viability, but is an integral component of the spindle checkpoint in budding yeast.
Sequence analysis reveals two regions of Mad3p that are 46 and 47% identical to sequences in the NH2-terminal region of the budding yeast Bub1 protein kinase.

Note that these facts came from the first sentence, but required lookahead to the second sentence for the name "Mad3p". Otherwise we have to create a unique temporary name "lamda", and then go back later and purge it out.

Fact: "MAD3" is a gene.
Fact: "Mad3p" is a protein..
Fact: "MAD3" codes for "Mad3p". // encodes is a synonym verb phrase for codes
Fact: "Mad3p" is located in nucleus.
Fact: "Mad3p" has a molecular weight of "58-kD".
Fact: "Mad3p" is [novel].
Fact: "Mad3p" is not essential for "viability". // this will change to "not required"
Fact: "Mad3p" is located in "spindle checkpoint".
Fact: "Mad3p" is found in "budding yeast".
Fact: "spindle checkpoint" is a cell component.
Fact: "spindle checkpoint" is part of "(budding) yeast".

 

Sequence analysis reveals two regions of Mad3p that are 46 and 47% identical to sequences in the NH2-terminal region of the budding yeast Bub1 protein kinase.
Bub1p is known to bind Bub3p (Roberts et al. 1994 ) and we use two-hybrid assays and coimmunoprecipitation experiments to show that Mad3p can also bind to Bub3p.

Note that these facts came from the second sentence, but required lookahead to the third sentence for the name "Bub1p".

Fact: "Mad3p" is shares homologies with Bub1p. // Should we quantify the consensus here?
Fact: "Mad3p" binds to "Bub3p". // The "binds to" relation is extremely important for CellWorld.
Fact: "Yeast" buds.
Fact: "Bub3p" is a "protein".
Fact: "Bub1p" is a "kinase". // this has a graphical interpretation, i.e. show me all proteins that are kinases.
Fact: "kinase" phosphorylates "substrate".



 
In addition, we find that Mad3p interacts with Mad2p and the cell cycle regulator Cdc20p.

Fact: "Mad3p" interacts with "Mad2p" // the "interacts with" relation could be synonymous as "complexes with", in a graphical query
Fact: "Mad3p" interacts with "Cdc20p" // one fact per fact!
Fact: "Cdc20p" is a "cell cycle regulator".

We show that the two regions of homology between Mad3p and Bub1p are crucial for these interactions and identify loss of function mutations within each domain of Mad3p.
We discuss roles for Mad3p and its interactions with other spindle checkpoint proteins and with Cdc20p, the target of the checkpoint.

Fact: "Mutations" may cause "loss of function".



Waystation

Two kinds of facts emerge, explicit facts that come from simple translation of the text. Implicit facts come from the understanding that the trained reader brings to the table. It is necessary to enumerate implicit facts for the connection of explicit facts, to the network of facts at large. The fact, "genes" code for "proteins" is an example. "Everyone" knows this. Yet an automatic system will not function properly unless this critical, and obvious-to-a-human fact is provided.

To enable the semi-automated and rapid processing, knowledge extraction workers, or "data miners", would have text analysis helper programs that would find synonyms for nouns and find synonyms for relationships. Preferred words could then be used that would reduce the search burden, and increase the consistency and compactness of the knowledge map. The use of the knowledge map, in the context of CellWorld, will be discussed in the sequel.


Translation

It is now necessary to translate these computable facts into knowledge graphs. A knowledge graph is a directed graph that connects entities via arrows. Prior to translation we "collect our facts". We use the syntax of GraphViz, kindly provided by the dot group [Examples].

Fact: "MAD3" is a gene.
Fact: "Mad3p" is a protein..
Fact: "MAD3" codes for "Mad3p".
Fact: "Mad3p" is found in nucleus.
Fact: "Mad3p" weighs "58-kD".
Fact: "Mad3p" is [novel].
Fact: "Mad3p" is not required for "viability".
Fact: "Mad3p" is located in "spindle checkpoint".
Fact: "Mad3p" is found in "budding yeast".
Fact: "spindle checkpoint" is a "cell component".
Fact: "spindle checkpoint" is part of "(budding) yeast".
Fact: "Mad3p" is shares homologies with Bub1p.
Fact: "Mad3p" binds to "Bub3p".
Fact: "Yeast" buds.
Fact: "Bub3p" is a "protein".
Fact: "Bub1p" is a "kinase".
Fact: "kinase" phosphorylates "substrate".
Fact: "Mad3p" interacts with "Mad2p".
Fact: "Mad3p" interacts with "Cdc20p".
Fact: "Cdc20p" is a "cell cycle regulator".

First we collect our nodes, which are usually noun entities. This can be done automatically.

Entities
MAD3, gene, Mad3p, protein, nucleus, 58-kD, viability, spindle checkpoint, budding yeast, cell component, yeast, Bub1p, Bub3p, kinase, substrate, Mad2p, Cdc20p, cell cycle regulator.

Second we sort our nodes so that alphabetically related concepts are near each other. This is also automatic.
Entities
58-kD, Bub1p, Bub3p, budding yeast, Cdc20p, cell component, cell cycle regulator, gene, kinase, Mad2p, MAD3, Mad3p, nucleus, protein, spindle checkpoint, substrate, viability, yeast.

Then we group loosely related concepts. This is manual, but straightforward, a uniquely human activity.

Entities
58-kD,
Bub1p, Bub3p,Cdc20p, Mad2p, Mad3p,
MAD3,
cell component, cell cycle regulator, kinase,spindle checkpoint, nucleus, gene, protein, substrate,
viability,
budding yeast, yeast.

We then assign the concepts to existing categories, or create new categories.

Entities
genes: MAD3
proteins: Bub1p, Bub3p,Cdc20p, Mad2p, Mad3p
components: cell component, cell cycle regulator, kinase,spindle checkpoint, nucleus, gene, protein, substrate
organisms: yeast
properties: 58-kD, viable, budding

We now use colors that have been assigned to the categories. These are drawn from the X11 standard colors, but can also be specified in terms of hue, saturation and lightness later on.

genes: reds
proteins: oranges
components: yellows
organisms: greens
properties: grays

It turns out that the act of assigning colors to categories is a very powerful conceptual tool. Creating and coloring categories significant reduces the need to add implicit facts. It also reduces clutter.

Now we can declare our nodes in dot format, naming the graph by its citation. Later it will become part of the larger knowledge map.

graph J_Cell_Bio_148_5_871
{
graph [ color = green] ; /* background attribute doesn't work in dotty or neato*/
node [shape=ellipse, style=filled, color=crimson] MAD3;
node [color=darkorange] Bub1p; Bub3p; Cdc20p; Mad2p; Mad3p;
node [color=lightyellow] "cell component"; "cell cycle regulator"; kinase; "spindle checkpoint"; nucleus; gene; protein; substrate;
node [color=mediumseagreen] yeast;
node [shape=box, color=lightgray] "58-k"; viable; budding;
}

Running this through dotty produces the listing of nodes:

We can now repeat this process for the relations between entities. We say that the "is located in" is synonomous with "is found in". Note that "found with" is different than "found in". So we would not call the latter two relations synonyms. We also that "is located in" is synonmous with "is part of". Again very slight differences in encoding can have a large impact on the utility of the knowledge map. This is why it must be semi-automatic. Human readers are much better at reducing ambiguity than machines. A famous example is the sentence, "time flies like an arrow", which is said to has many possible interpretations, but many fewer correct ones. Finally "binds to" will be considered synomous with "interacts with"

Our final list of relations is:.

"is a", "codes for", "is found in", "weighs", "is not required for", "is part of", "shares homologies with", "binds to", "buds", "phosphorylates", "interacts with"

Fact: "MAD3" is a gene.
Fact: "Mad3p" is a protein.
Fact: "Bub3p" is a "protein".
Fact: "Bub1p" is a "kinase".
Fact: "Cdc20p" is a "cell cycle regulator".
Fact: "spindle checkpoint" is a "cell component".

Fact: "Mad3p" is found in nucleus.
Fact: "Mad3p" is found in "spindle checkpoint".
Fact: "Mad3p" is found in "budding yeast".
Fact: "spindle checkpoint" is found in "(budding) yeast".

Fact: "Mad3p" interacts with "Bub3p"
Fact: "Mad3p" interacts with "Mad2p".
Fact: "Mad3p" interacts with "Cdc20p".

Fact: "Mad3p" shares homologies with Bub1p.
Fact: "MAD3" codes for "Mad3p".


Fact: "Mad3p" weighs "58-kD".
Fact: "Mad3p" is not essential for "viability".
Fact: "Yeast" buds.

Fact: "kinase" phosphorylates "substrate".

Out of 19 implicit and explicit facts, only 9 relations appear and 6 of these are "is a" . We now translate these to directed graph form a "chunk" at a time. This step can be automated!

Fact: "MAD3" is a gene.
Fact: "Mad3p" is a protein.
Fact: "Bub3p" is a "protein".
Fact: "Bub1p" is a "kinase".
Fact: "Cdc20p" is a "cell cycle regulator".
Fact: "spindle checkpoint" is a "cell component".

MAD3 -> gene [label = "is a"] ;
Mad3p-> protein [label = "is a"] ;
Bub3p-> protein [label = "is a"] ;
Bub1p -> kinase [label = "is a"] ;
Cdc20p-> "cell cycle regulator" [label = "is a"] ;
"spindle checkpoint" -> "cell component" [label = "is a"];

Fact: "Mad3p" is found in nucleus.
Fact: "Mad3p" is found in "spindle checkpoint".
Fact: "Mad3p" is found in "budding yeast".
Fact: "spindle checkpoint" is found in "(budding) yeast".

Mad3p -> nucleus [label = "is found in"]
Mad3p -> "spindle checkpoint" [label = "is found in"]
Mad3p -> "yeast" [label = "is found in"]
"spindle checkpoint" -> "budding" [label = "is found in"]

Fact: "Mad3p" interacts with "Bub3p"
Fact: "Mad3p" interacts with "Mad2p".
Fact: "Mad3p" interacts with "Cdc20p".

"Mad3p" -> "Bub3p" [label = "interacts with"];
"Mad3p"-> "Mad2p" [label = "interacts with"];
"Mad3p"-> "Cdc20p" [label = "interacts with"];

Fact: "Mad3p" shares homologies with Bub1p.
Fact: "MAD3" codes for "Mad3p".

Fact: "Mad3p" weighs "58-kD".
Fact: "Mad3p" is not essential for "viability".
Fact: "Yeast" buds.
Fact: "kinase" phosphorylates "substrate".

"Mad3p" -> Bub1p [label = "shares\nhomologies"];
"MAD3" -> "Mad3p" [label = "codes for"];
"Mad3p" -> "58-kD" [label = "weighs"];
"Mad3p" -> "viability" [label="is not essential", style=dashed];
"budding" -> "yeast" [label="a kind of"];
"kinase" -> "substrate" [label="phosphorylates"];

digraph J_Cell_Bio_148_5_871
{
graph [fontsize = "16", fontname = "Comic Sans", fontcolor = "black", center = 1];

node [shape=ellipse, style=filled, color=crimson, fontsize=20, fontname="Comic Sans"] MAD3;
node [color=darkorange] Bub1p; Bub3p; Cdc20p; Mad2p; Mad3p;
node [color=lightyellow] "cell component"; "cell cycle regulator"; kinase; "spindle checkpoint"; nucleus; gene; protein; substrate;
node [color=mediumseagreen] "yeast";
node [shape=box, color=lightgray] "58-kD"; viability; budding;

edge [len=2.0, fontsize=16, fontname="Comic Sans"];

MAD3 -> gene [label = "is a"] ;
Mad3p-> protein [label = "is a"] ;
Bub3p-> protein [label = "is a"] ;
Bub1p -> kinase [label = "is a"] ;
Cdc20p-> "cell cycle regulator" [label = "is a"] ;
"spindle checkpoint" -> "cell component" [label = "is a"];

Mad3p -> nucleus [label = "is found in"]
Mad3p -> "spindle checkpoint" [label = "is found in"]
Mad3p -> "yeast" [label = "is found in"]
"spindle checkpoint" -> "budding" [label = "is found in"]

"Mad3p" -> "Bub3p" [label = "interacts with"];
"Mad3p"-> "Mad2p" [label = "interacts with"];
"Mad3p"-> "Cdc20p" [label = "interacts with"];

"Mad3p" -> Bub1p [label = "shares\nhomologies"];
"MAD3" -> "Mad3p" [label = "codes for"];
"Mad3p" -> "58-kD" [label = "weighs"];
"Mad3p" -> "viability" [label="is not essential", style=dashed];
"budding" -> "yeast" [label="a kind of"];
"kinase" -> "substrate" [label="phosphorylates"];

}


Observations and Improvements

With this first version of the graph in hand, we can see that we did not need the implicit facts if we define the meaning of the color codes in a legend. This will help reduce clutter.

These changes are now incorporated and added to form the final knowledge map shown at the beginning of the paper.

Results and Conclusion

Two implications arise from the processing of this one abstract in vacuo. First, kinases appear to play a role in cell cycle regulation. Second kinases phosphorylate. Perhaps every biochemist might be expected to know these things. But the beauty is that someone who is not a biochemist might now be able to make a similar observation. Are we trying to put biochemists out of business? Absolutely not. But something that is obvious to one biochemist, might not be evident to another. It also indicates that hidden assumptions and implications can be made explicitly visible. This means that, even for experts, a knowledge map could be useful, and someday perhaps indispensable. Since the very definition of an expert is "one who knows alot about a few things", this argument seems reasonable.

Informally, there are two kinds of mental processing, recognition and reconstruction. Recognition means remembering something one has seen before AFTER seeing it again. Reconstruction is the ability to produce from scratch a complete recollection of the information. A knowledge map addresses a difficult issue when drinking from the firehose - persistence of memory.

An additional value of knowledge maps is that of illustrating transitive relations - relations one node removed is valuable. In this graph we might expect to find Bub1p and the other proteins in the nucleus since we find something that it binds to in the nucleus. This relationship might not be immediately evident to readers of the equivalent text. We might also infer that important aspects of cell cycle regulation are occurring in the nucleus.

Although this procedure makes it appear tedious, an experienced "dataminer" can process this information quite rapidly, particularly with the assistance of interactive tools. Work is proceeding on building just such interactive tools for creation of a biochemistry and molecular biology knowledge map for CellWorld.


Acknowledgments

Discussions with several people have helped me refine these ideas in the context of biochemistry, molecular biology, cancer and CellWorld. The first was Sara-Jayne Farmer who along with various web searchers introduced me to the bibliography of Bayesian belief networks, including the work of Wray Buntine. Also appreciate were discussions with Mark Turpin, Leo Blume and Marilyn Fulper. Availability of the Netica software allowed rapid prototyping of belief networks. In Washington D.C. Robert Beckman showed me an example of insurance fraud detection using NetMap. Sincere appreciation also to Hardwick et. al and the Journal of Cell Biology for this paper excerpt. Gratitude is expressed to John Ellson of Lucent Technologies and Stephen North of AT&T research for their wonderful work on the dot program. Thanks also to Nicholas Warren who created a Flash animation of this process.