ࡱ > } y bjbjEE :@ ' ' + ! 1 1 1 E E E 8 } E l e e { { { '# '# '# Fl Hl Hl Hl Hl Hl Hl $ qn q V ll 1 % '# '# % % ll 1 1 { { l x1 x1 x1 % R 1 { 1 { dg x1 % Fl x1 x1 R RU { z* E . :S " Pg l 0 l \S iq 0 iq D RU RU iq 1 ,V $ '# # ^ x1 $ L U$ G '# '# '# ll ll x1 '# '# '# l % % % % iq '# '# '# '# '# '# '# '# '# :
Automating Mini-Ontology Generation from Canonical Tables
by
Stephen Lynn
A thesis proposal submitted to the faculty ofBrigham Young Universityin partial fulfillment of the requirements for the degree of
Master of Science
Department of Computer ScienceBrigham Young UniversityOctober 2007
Abstract
In this thesis work we propose, develop, and test MOGO (a Mini-Ontology GeneratOr.) MOGO automates the generation of mini-ontologies from canonicalized tables of data. This will help anyone trying to organize large amounts of existing data into a more searchable and accessible form. By using a number of different heuristic rules for selecting, enhancing, and modifying ontology elements, MOGO allows users to automatically, semi-automatically, or manually generate conceptual mini-ontologies from canonicalized tables of data. Ideally, MOGO operates fully automatically while allowing users to intervene to direct and correct when necessary so that they can always satisfactorily complete the translation of canonicalized tables into mini-ontologies. We test MOGO experimentally using a third party selected set of tables. We evaluate MOGOs output mini-ontologies using a set of metrics covering concept/value recognition, relationship discovery, and constraint discovery.
Introduction
From libraries filled with millions of books to the Internet available for anyone with a web browser, the amount of information available in the world is growing exponentially. With this information explosion comes new challenges in organizing and finding information that is relevant to a users needs. Most of the available information does not follow any consistent format or structure, making it difficult to extract in a way that supports queries beyond common keyword searching. One possible solution to this problem is structuring the information on the Internet into standardized forms (ontologies) which represent the inherent concepts, relationships, and constraints found in the information. Exposing the information in an ontological model enables an entire new class of search algorithms allowing queries to be expressed more completely and more explicitly, well beyond anything currently available in standard keyword searches available today.
Few use ontology-based representations to organize information on the Internet because creating an ontology takes too much time and effort and requires a high degree of expertise. TANGO ADDIN EN.CITE Y. A. TijerinoSeptember 20047717Y. A. TijerinoD. W. EmbleyD. W. LonsdaleY. DingG. NagyToward Ontology Generation from TablesWorld Wide Web: Internet and Web Information SystemsWorld Wide Web: Internet and Web Information Systems251-28583September 2004[9] is a project which will reduce the time, effort, and degree of expertise needed by automating the process of creating an ontology from the concepts, relationships, and constraints found in sets of tabular data. As the second component of the overall TANGO project, MOGO (a Mini-Ontology GeneratOr) develops and implements the necessary algorithms and user interfaces for automatically, semi-automatically, or manually generating mini-ontologies from canonicalized tables of data. (The first component of the TANGO project interprets raw tables found on the web and elsewhere and reorganizes them as canonical tables. The third component merges a set of mini-ontologies into a large ontology representing a body of knowledge that is usable as a means of organizing information on the Internet.)
Related Work
Automating the creation of ontologies has become a widely researched area over the past few years, and researchers from many different backgrounds have contributed a variety of solutions. A common approach in the area of Natural Language Processing (NLP) attempts to learn ontologies by finding the terms, concepts, relations, and concept hierarchies existing in large collections of unstructured text documents. The lack of structure and appropriate metadata in these documents has so far made these approaches less than accurate, thus requiring significant human post-processing before the results can actually be used ADDIN EN.CITE Cimiano200610106Cimiano, PhilippOntology Learning and Population from Text : Algorithm, Evaluation and Applications2006New York, New YorkSpringer9780387306322[3]. Our approach differs from typical NLP approaches by using tabular data as the source information. Using tabular data is useful in the creation of ontologies because the data has been structured by humans into a form representing the relationships found in the data. This structure makes the automatic discovery of relationship information much more effective than algorithms based solely on unstructured text documents.
Researchers in the area of reverse engineering have approached the problem of automatic generation of ontologies in many ways. Benslimane et al. ADDIN EN.CITE Sidi Mohamed2006131347Benslimane, Sidi MohamedBenslimane, DjamalMalki, MimounAcquiring Owl Ontologies from Data-intensive Web SitesProceedings of the 6th International Conference on Web Engineering2006Palo Alto, California, USAACM Pressinternal-pdf://p361-benslimane-1135352576/p361-benslimane.pdfhttp://doi.acm.org/10.1145/1145581.1145593[1] focus on generating OWL ontologies using HTML web forms in conjunction with the database schema associated with the forms. While this method shows promise, their approach differs from ours in that it relies on access to an underling database schema and is based on web forms rather than tables. The recent survey paper by Canfora and Penta ADDIN EN.CITE Canfora2007141447Canfora, GerardoDi Penta, MassimilianoNew Frontiers of Reverse EngineeringInternational Conference on Software Engineering-Future of Software Engineering2007May 23-25, 2007Washington, DCIEEE Computer Societyinternal-pdf://New_Frontiers_of_Reverse_Engineering-0968370176/New_Frontiers_of_Reverse_Engineering.pdfDOI= http://dx.doi.org/10.1109/FOSE.2007.15[2] details how the majority of recent reverse engineering research is focused on using either database schemas or source code repositories as the input data for generating ontologies. Many of these projects have similar output goals as MOGO, but the input data is drastically different.
Pivk et al. ADDIN EN.CITE Pivk2007181817Pivk, AleksanderSure, YorkCimiano, PhilippGams, MatjazRajkovic, VladislavStuder, RudiTransforming Arbitrary Tables into Logical Form with TARTARData & Knowledge EngineeringData & Knowledge Engineering567-595602007http://dis.ijs.si/sandi/docs/DKE_final_paper-SUBMITTED.pdfinternal-pdf://DKE_final_paper-TransformingArbitrTablesFLogicFrames-1113447424/DKE_final_paper-TransformingArbitrTablesFLogicFrames.pdf[8] have approached automatic ontology creation in a manner similar to MOGO. Their approach (implemented as a system called TARTAR) uses tabular data as the input in the same way MOGO does, with the eventual output being an ontology represented using F-Logic frames. F-Logic frames have their roots in object-oriented program modeling and is a formal way to represent object identity, complex objects, inheritance, polymorphic types, query methods, and encapsulation ADDIN EN.CITE Michael1995212117Michael, KiferGeorg, LausenJames, WuLogical Foundations of Object-oriented and Frame-based LanguagesJournal of the Association for Computing MachineryJournal of the Association for Computing Machinery741-8434241995JulyACM Press0004-5411internal-pdf://f-logic-4281142528/f-logic.pdf210335http://doi.acm.org/10.1145/210332.210335[7]. TARTAR focuses primarily on using statistical methods for string recognition and grouping to discover concepts and relationships in a table. Our approach makes use of some similar pattern matching heuristics but also includes a strong emphasis on heuristics employing linguistic clues to discover concepts, relationships, and constraints in a table. Additionally, MOGO is also designed to be easily extended without requiring changes to the main code base. It is unclear whether TARTARs implementation can be adjusted without access to the original source code.
Thesis Statement
In this thesis work, we develop a tool, called MOGO, to accurately generate mini-ontologies from canonicalized tables of data automatically, semi-automatically, or manually. Using a third party selected set of test tables, we plan to evaluate MOGOs accuracy in the following areas: concept/value recognition, relationship discovery, and constraint discovery.
Project Description
MOGO takes as input canonicalized tables of data based on Wang notation ADDIN EN.CITE Wang19966632Wang, XinxinTabular Abstraction, Editing and FormattingPHD Disertation, Department of Computer Science1996University of WaterlooDisertationhttp://www.cs.uwaterloo.ca/research/tr/1996/09/CS-96-09.pdfinternal-pdf://TabularAbstractionEditingFormating-Wang-4018172416/TabularAbstractionEditingFormating-Wang.pdf[10]. This notation preserves the labels found in the source table as well as their associated data values. The notation organizes label information in simple data structures called dimensions. Each dimension corresponds to a different axis of the table similar to the different axes of a multi-dimensional array. Combining these dimensions allows every data cell to be referenced using an element from each dimension. Because Wang notation can represent any set of tabular data independent of layout, MOGO is agnostic to the datas original form.
To further enhance MOGOs ability to produce a useful mini-ontology, we enhance standard Wang notation so information beyond row and column labels and data values is preserved in a canonicalized form. These enhancements include the identification of a tables title, caption, and footnotes as well as row, column, and value augmentations such as units of measure.
Based on the canonicalized input data, MOGO tries to produce a mini-ontology that conforms to the OSM data modeling language ADDIN EN.CITE Embley199217176Embley, David W.Kurtz, Barry D.Woodfield, Scott N.Object-oriented Systems Analysis: A Model-driven Approach1992Prentice-Hall[5]. OSM provides a standard way of representing concepts, relationships, and constraints. MOGO uses the following basic steps to automatically generate a mini-ontology:
Concept/Value Recognition: MOGO extracts the set of concepts found in each of the dimensions, and associates the tables data values with the appropriate concepts.
Relationship Discovery: MOGO adds relationship information to the concepts using structural and linguistic clues.
Constraint Discovery: MOGO adds constraint information to the mini-ontology by examining the tables data values.
MOGO performs all of these steps automatically and allows the user to: accept the mini-ontology without review, make adjustments to the mini-ontology, or manually rebuild the mini-ontology.
To illustrate how MOGO works we use the table of geo-political data in Figure 1 as an example. We compiled a small amount of data from multiple tables to create a single sample table which illustrates the various facets of MOGOs processing abilities.
Region and State InformationLocationPopulation (2000)LatitudeLongitudeNortheast2,122,869 Delaware817,37645-90 Maine1,305,49344-93Northwest9,690,665 Oregon3,559,54745-120 Washington6,131,11843-120Figure SEQ Figure \* ARABIC 1. Sample Table.
Figure 2 shows the sample table in Figure 1 in canonicalized form. MOGO uses XML as the input format for canonicalized tables. Input XML must validate against an XML-Schema document previously developed by others as part of the TANGO project. The Table tag contains a number of attributes useful to the overall TANGO project for uniquely identifying different tables. It also contains a title attribute which contains the tables title if there is one. Each tag in the XML has an object identifier (OID) for uniquely identifying the different nodes. CategoryNodes contain all of the labels found in the table. The CategoryParentNodes section captures the tree structure of the labels in each dimension. The DataCells section contains all of the data values in the table as well as references back to the labels that give the values a meaningful context. The final section, Augmentations, describes all of the augmentations found in the table which can include row, column, data, or table augmentations such as footnotes, values in labels (like the value 2000 in Figure 1), and units of measure.
. . .
. . .
Figure 2. XML version of canonicalized table.
Figure 3 shows a graphical representation of the canonicalized table in Figure 2. Each dimension of the table forms a tree structure with the depth of the tree determined by how many levels of label nesting exist in the dimension. The second dimension in the canonicalized table has no label value so a placeholder label of [Dimension2] is used. Each label in the dimension represents a node in the tree and connects to other tree nodes using a solid black line. Data values, at the bottom of the figure, connect to one node from each dimension using a dashed line. The dotted line connecting the Population node and the value 2000 indicates that the 2000 is an augmentation of the Population node. The title of the table is also captured and marked as such.
EMBED Visio.Drawing.11
Figure 3. Graphical view of canonicalized sample table.
Concept/Value Recognition
MOGO extracts concepts from a canonicalized table using a set of concept recognition algorithms and assigns the appropriate data values to those concepts. Each concept recognition algorithm conforms to a standard interface making it easy to augment MOGO with additional heuristic algorithms. MOGO implements five concept recognition algorithms. We execute each of the algorithms until each piece of the canonicalized table is recognized as either a concept or a value. Each algorithm marks items it recognizes and subsequent algorithms only evaluate unmarked items until all items have been marked, at which point MOGO skips any subsequent algorithms.
A concept is synonymous with an object set in the OSM data modeling language. According to OSM an object set identifies a group of objects or values ADDIN EN.CITE Embley199217176Embley, David W.Kurtz, Barry D.Woodfield, Scott N.Object-oriented Systems Analysis: A Model-driven Approach1992Prentice-Hall[5]. Object sets, either lexical or non-lexical, are the ontological elements representing the different concepts found in a table. A lexical object set is one whose members are printable and represent themselves (e.g., telephone numbers, names of companies). In OSM a lexical object set is visually represented by a box with a dashed border. A non-lexical object sets members are object identifiers that are non-printable (e.g., identifiers that stand for persons or companies). In OSM a non-lexical object set is visually represented by a box with a solid border.
The first algorithm uses lexical clues to determine which dimension labels the tables data values belong to. MOGO uses Wordnet, an electronic lexical database ADDIN EN.CITE Fellbaum1998996Fellbaum, ChristianeWordnet: An Electronic Lexical Database1998Bradford Books026206197X[6], to compare each data value to its corresponding dimension labels. A data value is said to belong to a label if the data value is a hyponym of one at least one of the labels senses, and is not a hyponym of any other dimension label associated with that data value. If the majority of the data values associated with a dimension label belong to that label as hyponyms, MOGO flags the label as a potential object set. After evaluating all the dimension labels, if all the labels MOGO flags belong to the same dimension then it marks all of the labels in that dimension as lexical object sets and associates the corresponding data values with those object sets. Otherwise, MOGO clears the flags and proceeds to the next algorithm.
The second concept recognition algorithm also uses Wordnet, but in this case the objective is to determine if a label is an instance of its parent label. Each dimension has at least one label referred to as the root label. Below that, a dimension can contain several levels of label nesting. Beginning with the labels directly under the dimensions root label, MOGO uses Wordnet to look up each unmarked label in a dimension and retrieve that labels list of inherited hypernyms. The inherited hypernym list in Wordnet includes a words direct hypernyms. It also recursively includes the hypernyms of each hypernym until a word with no hypernyms is reached. A label is said to be an instance of its parent label if either the parent label, or the name of the object set the parent label is assigned to, is found in the labels inherited hypernym list. If the majority of the labels at one level of label nesting are instances of that levels parent label, MOGO marks all the labels at that level as values, creates an unnamed object set, and assigns the values to the object set. MOGO evaluates each succeeding level of label nesting in like manner until the leaf labels have been evaluated.
MOGO provides a name finding service available at each step of the process for assigning names to unnamed concepts. Titles, footnotes, captions and augmentations can contain words which are helpful for naming unnamed concepts. The combined set of words from these sources forms a pool of possible concept names. Given an unnamed concept, MOGO retrieves the inherited hypernym list of each value assigned to a concept, compares the list with each of the words in the naming pool, and assigns the concept a name if one of the words in the pool is a direct match to a word in the hypernym list. MOGO uses this name finding service to find an appropriate name for any unnamed object sets produced by the previous algorithm.
Data frames provide a mechanism for recognizing different types of objects from strings of data using regular expression recognizers ADDIN EN.CITE Embley19808847Embley, DavidProgramming with Data Frames for Everyday Data ItemsNational Computer Conference301-3051980Anaheim, California[4]. The third concept recognition algorithm takes each unmarked dimension label and attempts to classify all the data values associated in a row or column with that label using data frames. If all the data values in a row or column have the same type, MOGO temporarily associates that type with the dimension label. After MOGO classifies all the labels for a dimension, if there are at least two labels in the dimension of different types, MOGO flags all of the labels in the dimension as lexical object sets and associates the corresponding data values with the object sets. Requiring two different types avoids misidentifying object sets in a table uniformly populated by data of the same type such as a table full of percentages or of currency values.
The fourth concept recognition algorithm tries to identify concepts among sibling labels. MOGO first classifies each unmarked dimension label using data frames. For each set of sibling labels that have the same data frame classification, MOGO marks the labels as values, creates an object set, names the object set with the data frames name, and associates the sibling labels with the new object set.
If none of the prior algorithms successfully mark all items in the canonicalized table as object sets or values, MOGO defaults to marking all of the unmarked labels in the last dimension as lexical object sets. If the data values associated with those labels are currently unassigned, MOGO assigns the data values to the newly created object sets. For any unmarked labels in the remaining dimensions, MOGO groups the labels that are at the same level of nesting in each dimension, treats the labels as values, creates unnamed object sets for each group of labels, associates the values with the newly created object sets, and uses the name finding service to find appropriate names for the object sets. For any remaining data values that are not currently assigned to an object set, MOGO creates a new unnamed object set and assigns the values to that object set.
Figure 4 shows the object sets MOGO identifies in our sample table. The second algorithm creates object sets out of each level of label nesting in the Location dimension. The inherited hypernym list for each label in the Location dimension contains either the parent label, or the name of the object set to which the parent label is assigned. MOGO marks the labels at each level of nesting as values, creates unnamed lexical object sets for each level of nesting, and assigns the values from each level to the corresponding object set. The naming service finds names for these object sets in the title of the table and assigns the names Region and State to the two unnamed object sets. The final algorithm creates lexical object sets for each of the labels in the [Dimension2] dimension because it is the final dimension in the table and none of its labels are marked by any of the previous algorithms. MOGO also assigns the associated data values, none of which are assigned to an object set by previous algorithms, to the newly created object sets.
Figure 4. Discovered object sets and value assignments.
Relationship Discovery
With all of the concepts identified and the values assigned to those concepts, MOGO next identifies all of the relationships that exist between the different concepts. MOGO adds relationship information to the object sets using a set of relationship discovery algorithms. Each relationship discovery algorithm conforms to a standard interface making it easy to augment MOGO with additional heuristic algorithms. MOGO implements four relationship discovery algorithms. We execute the algorithms in order, passing the newly discovered relationship information on to the next algorithm.
The first relationship discovery algorithm extracts relationship information from the different dimension trees. For each dimension, MOGO creates relationship sets between the object sets from that dimension anywhere an edge exists in the dimension trees. When labels at one level of nesting have been merged into a single object set, MOGO only creates one relationship set between the parent object set and the child object set. If sibling object sets (object sets coming from labels in the same level of label nesting) do not have any object sets higher in the tree to be related to, MOGO creates an object set of unknown type, labels it with the dimensions name (if there is one), and creates relationship sets between this new object set and each of the sibling object sets.
Figure 5 shows the relationship sets MOGO adds between the different object sets. MOGO associates the Region and State object sets because they come from different levels of the same dimension. The Population, Latitude, and Longitude object sets are sibling object sets with no parent object set to associate with. MOGO creates an object set of unknown type, and associates the sibling object sets with the newly created object set. Object sets of unknown type are visually represented as a box with no border.
Figure 5. Reltionship sets from dimension trees.
The next relationship discovery algorithm modifies the generated ontology relationship sets using lexical clues. Wordnet provides a way for MOGO to analyze object set labels and sets of values associated with object sets to discover semantic relationship information like hypernyms, hyponyms, holonyms, and meronyms. Hypernyms and hyponyms translate to generalization/specialization relationships (represented as an empty triangle). Holonyms and meronyms translate to aggregation relationships (represented as a filled in triangle). MOGO looks for more specific relationship information by examining each object set involved in a relationship set to see if the labels or values in the two object sets contain any semantic relationship information. If they do, MOGO adjusts the relationship set by replacing it with an aggregation or generalization/specialization.
If aggregations are found between the different object sets from one dimension, MOGO looks for any generalization/specializations that might exist in the table. Using Wordnet, MOGO looks up the inherited hypernym list of each object set label participating in the aggregation. If the dimensions root label is in the majority of the inherited hypernym lists of the different object sets, MOGO creates a new lexical object set, labels it with the dimensions root label, and associates this new object set with each of the object sets that participate in the aggregation using generalization/specialization.
Figure 6 shows the sample tables ontology elements after MOGO modifies them using lexical clues. Using Wordnet, MOGO finds that Delaware is an instance of an American State which is a hyponym of region. MOGO uses this information to create an aggregation constraint from the Region object set to the State object set. Because the Region and State object sets come from the same dimension, MOGO checks if the dimensions root label is in the inherited hypernym list of those object sets. MOGO successfully finds the root label (Location) in the inherited hypernym lists so it creates a non-lexical object set, names it using the dimensions root label, and associates this object set with the existing object sets using generalization/specialization.
Figure 6. Relationship sets after linguistic processing.
The third relationship discovery algorithm uses a library of data frame recognizers to find relationship information in the canonicalized tables labels, and augmentations. When labels match a particular data frame, MOGO modifies the generated object sets and relationships associated with those labels based on the ontology information associated with that data frame. For data frame matches in augmentations, MOGO creates the appropriate object sets for that data frame, and forms n-ary relationship sets between the newly created object set and the object set associated with the augmentation.
Figure 7 shows the results of MOGO applying this algorithm. MOGO finds a data frame match on the Latitude and Longitude labels. This data frame contains information about geographical coordinates and the corresponding ontology information found in the data frame is added. The second match is for the Population column augmentation (2000). MOGO extracts the numeric value, creates a singleton object of value 2000, and creates a ternary relationship set between the Population object set and the unnamed object set already related to the Population object set.
Figure 7. Relationship sets after data frame recognizers.
The final relationship discovery algorithm merges ontology fragments into a mini-ontology. Ontology fragments are made up of all of the ontology elements that are inter-connected via some type of relationship set. There will always be at least one ontology fragment per dimension but there may be more if relationships were not found between each of the object sets in a dimension. MOGO joins the ontology fragments by creating an n-ary relationship set between the ontology fragment link-in points. An ontology fragments link-in point is the object set in the fragment that came from the highest level label or labels in the dimension. If one of the link-in points is a placeholder object set and there is only one other ontology fragment, MOGO removes the placeholder object set and the n-ary relationship set, and transfers all of the removed object sets relationships to the remaining ontology fragments link-in point.
In our sample table, there are two ontology fragments. Figure 8 shows the result of merging the two sample ontology fragments into a mini-ontology. MOGO removes the placeholder object set from the one ontology fragment because there is only one other ontology fragment. MOGO then assigns the orphaned relations to the link-in point of the other ontology fragment (the Location object set).
Figure 8. Mini-ontology results from fragment merge.
Constraint Discovery
MOGO adds constraints to the mini-ontology using a set of constraint discovery algorithms. Each constraint discovery algorithm conforms to a standard interface making it easy to augment MOGO with additional heuristic algorithms. MOGO implements four constraint discovery algorithms.
The first constraint discovery algorithm adds constraints to generalization/specialization relationships that exist in the mini-ontology. A generalization/specialization relationship can be constrained to be a union, a mutual exclusion, or a partition. MOGO constrains a generalization/specialization relationship to be a union (represented as an empty triangle containing a U) if any values in the generalization object set are also in at least one of the specialization object sets. MOGO adds a mutual exclusion constraint (represented as an empty triangle containing a +) if there is no overlap in the values in each of the specialization object sets. When the generalization/specialization is constrained by both union and mutual exclusion, MOGO assigns a partition constraint (represented as an empty triangle containing both a U and a +) to the relationship.
The second constraint discovery algorithm looks for any computed values in the table. Tables often include columns or rows that contain the summation, average, or other aggregates of values in the table. MOGO examines the values related to object sets that come from dimensions with label nesting. By computing aggregates of the values from related object sets and comparing them to given values to test whether the aggregates hold, MOGO captures these constraints and adds them as annotations to the mini-ontology.
The third constraint discovery algorithm looks for functional relationship sets. A functional relationship set (represented by an arrowhead on the range side of a relationship set line) exists when an object in one object set (the domain object set) maps to at most one object in another object set (the range object set.) Each of the data values in a table is functionally determined by the set of dimension labels associated with those values. MOGO identifies the object sets that contain the tables data values and marks the relationship sets coming into those object sets as functional. Object sets assigned values that are dimension labels are handled separately. MOGO evaluates each of these object sets to see if the values assigned to the object set functionally map to values assigned to any related object sets. If so, MOGO marks the relationship set as functional.
The final constraint discovery algorithm determines if objects in an object set participate mandatorily or optionally in associated relationship sets. Optional participation is represented in OSM as an o placed on the object sets connection point to a relationship set line. MOGO identifies object sets whose objects have optional participation in relationship sets by considering empty value cells in the canonicalized table. MOGO determines where the non-existing values should be and marks participation in any relationship sets between one of these object sets and any other object set as optional.
Figure 9 shows the results of the constraint discovery algorithms. MOGO determines that there are no values assigned to the Location object set that are not also assigned to the Region or State object sets. The values in the Region and State object sets are also found to be mutually exclusive. Thus, MOGO assigns a partition constraint to the generalization/specialization relationship in the mini-ontology. Looking for possible aggregate values, MOGO determines that the population values related to the Region object set values are the summation of the population values related to the State object set. MOGO thus ads the constraint Region.Population = sum(Population);Region to the mini-ontology. The Population, Latitude, and Longitude object sets contain the data values from the canonicalized tables, so MOGO marks the relationship sets coming into these object sets as functional. The Region and State object sets contain values from dimension labels. Because the values assigned to the State object set functionally determine the values assigned to the Region object set, MOGO marks the relationship set from the State object set to the aggregation connecting it to the Region object set as functional. The canonicalized table contains four empty data cells. These non-existing values belong to the Longitude and Latitude object sets. MOGO thus marks participation of object sets in any relationship sets coming into either of these object sets as optional. Because these object sets were replaced by an ontology fragment associated with a data frame, MOGO marks the connection between the Location object set and the relationship set that comes into the Geographical Coordinates object set as optional.
Figure 9. Final mini-ontology produced by MOGO.
Validation
We plan to evaluate MOGO using a test set of tables found on the Internet by a third party participant. We will ask the participant to capture the URL and HTML source code of twenty different web pages that contain tables. Because tables can vary drastically in form and complexity, we will ask that the test tables meet the following criteria: the tables should come from at least three distinct sites; the tables should contain a mix of simple tables (one dimensional with no label nesting) and complex tables (multi-dimensional with or without label nesting); and that the tables are from a common domain.
MOGO will generate a mini-ontology for each canonicalized table. We will evaluate generated mini-ontologies using the following criteria:
Concept/Value Recognition. Every table has a fixed number of concepts to which the data values belong. We will observe how many concepts MOGO correctly identifies, how many it misses, and how many concepts it identifies that are not really concepts. We will also observe how many data values are assigned to the correct concept, and how many are incorrectly assigned.
Relationship Discovery. We will evaluate relationship discovery by observing how many valid relationship sets MOGO discovers, how many it discovers that are invalid, and how many valid relationship sets MOGO does not discover.
Constraint Discovery. We will evaluate constraint discovery by observing how many valid constraints MOGO discovers, how many invalid constraints it discovers, and how many valid constraints MOGO does not discover.
From these observations, we will be able to report the accuracy of MOGO with respect to precision and recall values.
It is necessary to point out that when building ontologies, there is often no right answer. For any given set of data there can be multiple ontologies that are valid conceptualizations of the data set. For that reason, it is necessary for the evaluation to be done manually by a trained expert in the field of data conceptualization.
Thesis Schedule
Milestone
High Level Design
Implementation/Coding
Validation
Submit Thesis to Advisor
Submit Thesis to Committee Members
Thesis DefenseDeadline
October 2007
January 2008
January 2008
February 2008
February 2008
March 2008
ADDIN EN.REFLIST Annotated Bibliography
1. Benslimane, S.M., Benslimane, D., Malki, M.: Acquiring Owl Ontologies from Data-intensive Web Sites. Proceedings of the 6th International Conference on Web Engineering, ACM Press, Palo Alto, California, USA (2006)
Interesting paper about the use of HTML forms in conjunction with database schemas to reverse engineer an ontology. Shares a similar end goal with our approach but has very different input data and processing methodology.
2. Canfora, G., Di Penta, M.: New Frontiers of Reverse Engineering. International Conference on Software Engineering-Future of Software Engineering, IEEE Computer Society, Washington, DC (2007)
Survey paper reviewing current approaches in reverse engineering for generating ontologies. Helpful for discovering current reverse engineering techniques for ontology creation that might be applicable to our approach.
3. Cimiano, P.: Ontology Learning and Population from Text : Algorithm, Evaluation and Applications. Springer, New York, New York (2006)
Provides an extensive review of recent natural language processing algorithms for learning ontologies from unstructured text documents. Was helpful in discovering what was currently being done in that field that might be helpful for our solution and to avoid duplicating work that has already been done.
4. Embley, D.: Programming with Data Frames for Everyday Data Items. National Computer Conference, Anaheim, California (1980)
Data frames provide a nice way of extracting information about standard data types. Beyond being able to recognize something as a string or a number, data frames can recognize complex types of data such as geographical coordinates. MOGO uses data frames to enhance the growing mini-ontology with more specific relationships based on the recognition of more complex types of data.
5. Embley, D.W., Kurtz, B.D., Woodfield, S.N.: Object-oriented Systems Analysis: A Model-driven Approach. Prentice-Hall (1992)
Formally defines the data modeling language OSM. This is the data modeling language used in this paper for representing ontologies.
6. Fellbaum, C.: Wordnet: An Electronic Lexical Database. Bradford Books (1998)
Wordnet is an electronic dictionary that can be used to discover relationships between words as well as determine which words are derivatives of other words. MOGO uses Wordnet in its heuristics to discover semantics about concepts, relationships, and data.
7. Michael, K., Georg, L., James, W.: Logical Foundations of Object-oriented and Frame-based Languages. Journal of the Association for Computing Machinery 42 (1995) 741-843
Provides a detailed description of F-Logic frames as an alternate form for representing an ontology. Article was helpful in understanding alternate data modeling languages.
8. Pivk, A., Sure, Y., Cimiano, P., Gams, M., Rajkovic, V., Studer, R.: Transforming Arbitrary Tables into Logical Form with TARTAR. Data & Knowledge Engineering 60 (2007) 567-595
Describes an alternative approach for processing tabular data on the web to generate an ontology. While the goals of this approach are quite similar to MOGO, the implementation is significantly different.
9. Tijerino, Y.A., Embley, D.W., Lonsdale, D.W., Ding, Y., Nagy, G.: Toward Ontology Generation from Tables. World Wide Web: Internet and Web Information Systems 8 (September 2004) 251-285
The TANGO project is the source of an NSF grant that deals with the development of a process for automatically generating a growing ontology from a group of previously known data tables. This project fulfills the second piece of the overall project by providing a way to take a canonicalized table of data and convert it into a mini-ontology which can then be used to build the growing ontology.
10. Wang, X.: Tabular Abstraction, Editing and Formatting. PHD Disertation, Department of Computer Science. University of Waterloo (1996)
Describes a process for representing any table of data in a normalized form. This is helpful in this project because it greatly reduces the type and format of input that MOGO must deal with.
BRIGHAMYOUNGUNIVERSITY
GRADUATECOMMITTEEAPPROVAL
of a thesis proposal submitted by
Stephen Lynn
This thesis proposal has been read by each member of the following graduate committee and by majority vote has been found to be satisfactory.
_______________________________________________________________DateDavid W. Embley, Chair_______________________________________________________________DateDeryle Lonsdale_______________________________________________________________DateDan Ventura_______________________________________________________________DateParris K. Egbert
Graduate Coordinator
We refer to the output of MOGO as a mini-ontology. While the output is a valid ontology that represents the concepts, relationships, and constraints found in the original source table, it is mini it only represents one table and because the overall goal is to combine many of these mini-ontologies into one large ontology.
We refer to the input to MOGO as a canonicalized table. Tables are canonical when they conform to a standard set of rules as defined by an XML-Schema document.
PAGE \* MERGEFORMAT 4
? % , 4 < o s x | }
Q ] a = A Y m
hIX hIX hIX hIG h0!` hLk hd<