L-MEM | telepark.wiki v2.2.1 Professional

There's been a lot of nice visualization of metadata relationships recently (this is end of the 2000 decade).

Linguistic Syntax Decomposition

This has some relationship with the linguistic syntactic decompositon trees, lexical analysis as pioneered by Noam Chomsky et al.

http://www.beaugrande.com/TheoryMetaTheory%202.jpg

http://content.ll-0.com/lweaver/syntax_tree.jpg

but is exanded to display other kinds of relationships. Some software even allow to interactively assemble linguistic or correct computational linguistic decomposition of existing sentences.

http://mac.softpedia.com/screenshots/20-151_1.jpg

One purpose of the actual work (not the visualization but the actual decomposition) is for grammar correction (maybe helpful for spell check also) and automatic translation (which would also be combined with some statistical co-occurence frequency models...).

Semantic Decomposition

Semantic Decomposition is sort of an attempt to create a language neutral representation of content so rendition in a natural language can be synthesize (rendered) if you like.

For those not familiar with Prolog and other artificial languages from the 1980's or before,

http://www.nyu.edu/pages/linguistics/workbook/lehner/christop.gif

such language are specialized in expressing relationships (e.g. is-a)

Other useful expression "semantic network", and by extention mind maps). One thing unclear about semantic is the level at which we aspire to establish the semantic data.

http://jcmc.indiana.edu/vol11/issue3/bers.image005.gif

= Topical Summarization

The first degree of summarization would be to identify the main words and use this in attempt to locate in a reference / classification space. Typically after removing most common words (e.g. "a", "the", "is"), such system would scan each word and remove "morphological" variations (i.e used as verb, noun, plural form,...). Then the main form of that word would have a preexisted location in the reference thesaurus dictionary. So the output of such pass would be a list of words and parented location on an indexing space of some sort.

The thesausus mapping totally affects the the result here. Right away we have the issue of distance. It's not because something is at Dewey 101.023 that's it's conceptually further away from 902.035 then 435.123. Similarly 101.235 and 191.453 is not a measure within a main category if you like. This is already a difficult topic in itself as it opens the ontology (classification) issue. Of course the first degree work is sometimes done for you in scientific articles when they include classification tags.

== Ontologies

We use ontology here in the classical computation linguistic form. A classical ontology would be bird -> hawk ... etc. Often ontology works within a particular domain but can become real complicated as a useful idea in more generic context. For example Ford is-a car so each time the word Ford is found it also adds weight to car. So at some point we know it's an article related to car even if the word car is not used frequently.

http://www.scientific-computing.com/images/scwjulaug05ontologies1.gif

== Thesaurus

Similarly the thesaurus is used to deambiguate a word used in multiple domains. That is a word by itself might have multiple usage, and the synonyms and antonyms might have non-overlapping domains so their presence within the same content helps refine the domain of definition that should be considered.

http://www.visualthesaurus.com/landing/?ad=ddc.large&word=cat&utm_source=ddc&utm_medium=large&utm_campaign=VT

If a few questions are allowed (user guidance) then the categorization can probably quickly converge to something much more precise.

== Words co-occurence

So assuming we have the initial granularity of an article of some length (what we call a logical Page, could-be a chapter). Within that Page a number of unique strings (space separated sequence of characters) is used. We make this upper/lowe case irrelevant for the first word of a sentence (?). Say that initial list has 2000 such "words". We eliminate all the too generic words in english ("a"m"is" "be" ,"the", "for"). Then we try to combine all the "morphological" variations of a word. (run: ran, run, running, runs, runner) and maybe now we have 500 words used in our Page. Each language probably has a few cases where the plural or conjugation produce identical orthograph for two words. These strings are known by dictionary and a second pass simply use the thesaurus and the other words in Page to deambiguate the word. Of course here we are talking about non-fiction as poetry explicitely play with multiple meanings so would not work well under such analytic scheme. Now we have a list of 500 words and we want to build a co-occurence table. One way to do it is to keep a buffer of 3 sentences (the previous, the next and the current sentence) where a sentence is period terminated. One way would be to assign 3 points to a word before or after, 2 points if within the same sentence and one point if in the previous or next sentence. So we advance word by word through text (so we are always in current sentence). We look the base word before and after and if a "counted" word (is not "a") then we add it to the co-occurence list. For example "wild"-"cat". So at the end of this we have our initial list of words and a new list of paired words (e.g. wild : cat and cat : wild is the same pair).

http://www.textanalysis.info/

= Abstracting Text

So as a first pass we now have the frequency of base words used (we can sort words by how many times they are used) and also a way to sort pairs of words. We could create another derivative of word pair, triplets. For example keep all the pairs that co-occured more then once and now see for each of these occurences if there is co-occurence when that pair is present with a third base word. Maybe we at that step stretch to 2 sentences before and 2 after and count 3 points if in same sentence then the pair, 2 points if in previous or next and 1 point if 2 sentences away. There is probably a variation on that which takes into account average word count per sentence to modulate that so the result is more similar across writing styles. Another way would be more pyramidal and co-occur word before and after, 2 words before and after, and so on... The point to make here is if we were to use this for comparative purposes, obviously the two texts would need to use the same occurence computation model.

So the next level of summarization after creating some sort of classification index would be to reduce the text to compress the amount of characters to a much smaller set. A particular system generates a particular point system which we will use to skim out sentences with low point count. Sentences are eliminated until we arrive to the requested number of characters (e.g. description must fit in 200 words). What we generated like this is a collection of sentences from the text. We probably maintain their initial presentation order. As expressed here we probably over amplified longer sentences, so we would probably have some penalty for long sentences defined as deviation from average sentence length or something.

== Content Compression

What we have here so far is also a way to zoom on content that makes more sense then scaling font size to size 2 for example. Of course such abstracted ("zoomed-in") text might appear surrealistic at best sometimes read as is but still conveys more relevant information then simply graphically scaling content. This sort of technique can be scaled to the content of an whole book, for example to produce the one page version of the book.

== Content Linkages

The other possible output is to instead focus on revealing important co-occurences. This becomes very related to (expands the basic idea) extracting sets of keywords for content retrieval tasks (searching). This can be a list of words or a graph displaying the relationship between the main words. The value of such things often depends on the data source type: police report, news reports, legal cases, medical publications,... have a certain form where sometimes factual informatin extraction could make sense as the domain is partly restricted to start with. In such case though it's possible higher level analysis can be perfomed at least for retrieval purposes. That is the set of source data is known to usually have a date, a city or some geographical location where this happened or is a concern, etc Knowbots agents for example can scan news reports for all crimes involving a yoyo in cities less then 100 miles from Detroit and feed that in your news browser.

= Abstracting Structure in Text

Something of value for this Book is capturing structure from a text stream that does not explicitely reveal it to our presentation style. Essentially many books don't have a structured table of content so digital reader that allow to reformat the text to any arbitrary resolution loose text page references and such, making communication about content much more difficult. On this web page ln 47 when display size is medium in Firefox 3.0 and the browser is in full screen mode on a 1920x1080 display... Part of the process to convert to this Book format implies to reveal the text logical paragraphs which are the base object of reference.

A similar simple technique to reveal structure could work as follows: We can use "return of line" - "empty line", identify paragraphs the same way we dealt with sentences. And our job is to partition these into small sets of such paragraphs we call in this Book "logical Paragraphs" and tag a sub-title to these. Without human interaction we might consider enough for now to just create a sub text level using such. If the article has main sub-titles this would create a second level of sub-titles, if it has none then it would only create one level of sub-title. So assuming we already build our list of co-occuring words, the first step would be for each paragraph in source text to list all co-occuring words. We probably also want the list of all words that exists only once in whole Page and put these in their paragraph bucket. Our desired result is to end up with a sub-title tag for a set of 1 to N paragraphs. The text generation in this case does not have to be a complete sentence. It could be "wild cats and food". However it has to be unique within the Page. We don't want two sub-titles to be the same so the sub-title length would need to grow if two Paragraphs ended up with the same sub-title in the first pass. We also don't generally want single word sub-titles. Finally relative initial paragraph length probably means something. We probably don't want to group into a larger Logical Paragraph a paragraph that is already much larger then the others.

Co-Occurence and Search Engines

tested on google

search garcia on google: Results 1 - 10 of about 143,000,000

search occurence : Results 1 - 10 of about 48,200,000

occurrence + garcia: Results 1 - 10 of about 1,150,000

garcia + occurrence : Results 1 - 10 of about 1,350,000

occurence AND garcia: Results 1 - 10 of about 1,370,000

garcia and occurence: Results 1 - 10 of about 1,370,000

Note + seems not commutative ???

What we are showing here is looking up (this is an approximation) for the presence of two words in a document obviously can reduce the amount of initial prospect returns. It also shows it's a weak thing by itself.

http://jude-users.com/en/modules/weblog/index.php?user_id=5&start=20

http://nwb.slis.indiana.edu/research.html

just lost map of science

http://books.alerme.com/images/GemCyclo.jpg

non boring display of most popular words

http://www.cornel.stuffo.info/?p=127

http://thegrangehall.com/2008/11/03/the-week-in-grange/

http://digitalscholarship.wordpress.com/2008/06/22/using-text-analysis-tools-for-comparison-mole-chocolate-cake/

http://www.wordle.net/

http://www.slipperybrick.com/wp-content/uploads/2007/01/the-great-gatsby.jpg

http://www.infovis.net/printMag.php?num=103&lang=2

The problem with iconology: if I provide these icons, do they mean something different?

http://www.mopo.ca/uploaded_images/shapes-750266.jpg

Looks like Brad is still alive

http://wbpaley.com/brad/

Username
Password
Remember me

System message