Research Project: Choose, investigate, and report on a research topic
pertaining to the problem of merging in information integration.See project specification
- Information Extraction (IE)
[Eikvil99]Line Eikvil. Information Extraction
from World Wide Web A Survey. Norweigan Computing Center Technical Report 945,
1999.
[Muslea99] Ion Muslea. Extraction patterns
for information extraction tasks: survey
. Proceedings of the American Association for Artificial Intelligence (www.aaai.org), 1999.
- IE from online documents
- Ontology based IE
[ECJL+99] D.W. Embley, E.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale,
Y.-K. Ng, and R.D. Smith.
Conceptual-Model-Based Data Extraction from Multiple-Record Web Documents
, Data and Knowledge Engineering, November, 1999.
- Learning-based IE
[Freitag98] Dayne Freitag. Information
extraction from HTML: application of a general machine learning approach
. American Association for Artificial Intelligence(www.aaai.org),1998.
[CM97] Califf & Mooney. Relational learning
of pattern match rules for information extraction
. ACL97 workshop on natural language learning,1997.
[Soderland99] S. Soderland. Learning information extraction rules for semi
structured and free text. Journal of Machine learning,1999.
- Wrapper generation
- Hard coded
[HGNY+97] Joachim Hammer, Hector Garcia-Molina, Svetlozar Nestorov, Ramana
Yerneni, Marcus Breunig, and Vasilis Vassalos.
Template-based wrappers in the {TSIMMIS} system
. In Proceedings of Twenty-Third ACM SIGMOD International Conference on
Management of Data, pages 552—535, Tucson, Arizona, 1997.
- Learning based
[KWD97] N. Kushmerick, D. Weld, and R. Doorenbos.
Wrapper induction for information extraction
. Proceedings of the 15th International Conference on Artificial Intelligence
(IJCAI-97), pages 729—735, 1997.
[Hsu98] Chun-Nan Hsu. Initial results on wrapping
semistructured web pages with finite-state transducers and contextual rules
. In the 1998 Workshop on AI and Information Integration, Madison, Wisconsin, July 26-27, 1998.
AAAI Press.
[MMK99] I. Muslea and S. Minton and G. Knoblock.
A Hierarchical Approach to Wrapper Induction
. In Proceedings of the 3rd Conference on Autonomous Agents 1999 (1999).
[BLP01]David Buttler, Ling Liu, Calton Pu. A Fully Automated Object
Extract System for the Web . In Proceeding of the 21st International Conference on
Distributed Computing (ICDCS-21), April 16-19, 2001. Phoenix, Arizona, USA. IEEE Computer Society. (IEEE Press)
- Union free regular expression
[CMM01] Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. ROADRUNNER:
Towards Automatic Data Extraction from Large Web Sites. Proceedings of the 27th VLDB Conference, September 11-14, 2001, Roma, Italy.
- Wrapper Maintenance
[LM00]K. Lerman and S. Minton. Learning the Common Data Structure of data
. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-2000), Menlo Park, July 26-30 2000.
AAAI Press.
- IE from free text
- Information Integration (II)
- Theoretical Views
- Global as View (GaV)
[GPQR+97] H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman,
Y. Sagiv, J.D. Ullman, V. Vassalos, and J. Widom.
The TSIMMIS Approach to Mediation: Data Models and Languages
. Journal of Intelligent Information Systems, Vol. 8, No. 2, Pages117—132,
1997.
[Hammer99] J. Hammer. The Information
Integration Wizard (Iwiz) Project
Report Work in Progress, Univ. of Florida, Technical Report, TR99-019, Gainesville,
FL 32611-6120.
- Local as View (LaV)
[RLO96] A. Rajaraman, A.Y. Levy and J.J. Ordille.
Query heterogeneous information sources using source descriptions
. In 22nd International Conference on Very Large Data Bases, pages
251—262, India, September 1996.
[DG97] O.M. Duschka and M.R. Genesereth. Infomaster—
An Information Integration Tool. In the Proceedings of the International Workshop on Intelligent
Information Integration, Freiburg, Germany, Sep. 1997.
[BE02] J. Biskup and D.W. Embley. Extracting
information from heterogeneous information sources using ontologically specified
target views
, Information Systems, 2002.
[LMSS95]A. Levy, A.O. Mendelzon, Y. Sagiv, and D. Srivastava.
Answering Queries Using Views. PODS 95.
[MLF00]T. Millstein, A. Levy, and M. Friedman. Query containment for
data integration systems. In the 19th ACM Symposium on Principles of Database Systems,
Dallas, Texas, May 15—17, 2000.
- GaV vs. LaV
[Ullman97] J.D. Ullman. Information
Integration Using Logical Views
. In Proceedings of the 6th International Conference on Database
Theory (ICDT-97), Lecture Notes in Computer Science, Pages 19—40, Springer-Verlag,
1997
[CGL01]A. Cali, G.D. Ciacomo, and M. Lenzerini.Models for information integration:
turning Local-as-view into global-as-view. In Proceedings of International
Workshop on Foundations of Models for Information Integration (10th Workshop in the series Foundations of Models
and Languages for Data and Objects), 2001.
- Procedures
- Source discovery
- Information retrieval (IR)
[Chakrabarti99] S. Chakrabarti. Recent
results in automatic Web resource discovery
, ACM Computing Surveys, Vol. 31, No. 4es, December 1999.
- Text classification
[BM98] L.D. Baker and A.K. McCallum.
Distributional Clustering of Words for Text Classification, SIGIR'98.
[Mccallum] Rainbow. http://www-2.cs.cmu.edu/~mccallum/bow/rainbow/ -
Crawling the hidden Web
[Liddle02] S.T. Yau.
Extracting Data Behind Web Forms
, Technical Report, Brigham Young University, Provo, February 2002.
[RG01] S. Raghavan, and H. Garcia-Molina.
Crawling the hidden Web
. Proceedings of the 27th VLDB Conference, Roma, Italy, 2001. - Ontology-based
source discovery
[Tang00]June Tang. A Probrabilistic Model for Classification of Multiple-Record Web Documents.
[ENX01] D.W. Embley, Y.-K. Ng, and L. Xu.
Recognizing Ontology-Applicable Multiple-Record Web Documents
, appear on ER2001.
[Wang01] Q. Wang. Ontology-Based Binary Categorization of Multiple-Record
Web Documents Using a Probabilistic Retrieval Model, A Master degree of Science
thesis, Brigham Young University, Provo, Utah, 2001.
- Schema mapping
- Survey
[RB01] E. Raham and P.A. Bernstein.
A survey of approaches to automatic schema matching
. The VLDB Journal 10(4), pages 334—350, 2001. - Rule based
[LNE89] J.A. Larson, S.B. Navathe and R. Elmasri. A theory of attribute
equivalence in databases with applications to schema integration. IEEE
Transactions on Software Engineering. 15(4), pages 449—463, 1989.
[KS96] V. Kashyap and A.P. Sheth. Semantic
and schematic similarities between database objects: a context-based approach
. VLDB Journal: Very Large Data Bases, 5(4), pages 276—304, 1996.
[MZ98]T. Milo and S. Zohar. Using Schema
Matching to Simplify Heterogeneous Data Translation. In Proceedings of 24th
International Conference on Very Large Data Bases, pages 122—133.
[CGLN+99]D. Calvanese, G.D. Giacomo,M. Lenzerini,D. Nardi, and
R. Rosati. A Principled Approach to Data
Integration and Reconciliation in Data Warehousing, In Proceedings of
the International Workshop on Design and Management of Data Warehouse(DMDW'99),
Heidelberg, Germany, June 14-15, 1999.
[MWJ99]P. Mitra, G. Wiederhold, and M. Kersten.
Semi-automatic Integration of Knowledge Sources, In Proceedings of Fusion'99,
Sunnyvale, USA.
[MHH00] R. Miller, L. Haas, and M. Hernandez.
Schema Mapping as Query Discovery
. In Proceedings of VLDB, September 10-14, Cairo, Egypt, 2000.
[MBR01] J. Madhavan, P.A. Bernstein, and E. Rahm.
Generic Schema Matching with Cupid
, In Proceedings of VLDB, September 11-14, Roma, Italy, 2001. -
Learning based
[LC94] W. Li, and C. Clifton. Semantic Integration in Heterogeneous Databases Using
Neural Networks. In Proceedings of 20th International Conference on Very
Large Databases, Pages 1—12, Santiago, Chile, September 12-15, 1994.
[DDH01] A. Doan, P. Domingos, A. Halevy.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach
. In Proceedings of the ACM SIGMOD Conference on Management of Data,
May 21-24, Santa Barbara, California, 2001.
- Data merging
- Arbitration
[Revesz95] P.Z. Revesz. On the
Semantics of Arbitration. International Journal of Algebra and Computation, Vol. 7,
Issue 2, pages 133—160, 1997.
[KP97] S. Konieczny, and R. Pino-Perez.
On the Logic of Merging
, KR’98. - Data mining/Machine learning
[HS97] M. Hernandez and S. Stolfo.
Real-World Data is Dirty: Data Cleansing and The Merge/Purge Problem
. Journal of Data Mining and Knowledge Discovery, 1997.
[TKM 01] S. Tejada, C.A. Knoblock and S. Minton.
Learning Object Identification Rules for Information Integration
. . Information Systems, pages 607—633, Vol. 26, Issue 8, Dec. 2001.
[FLMC01] W. Fan, H. Lu, S.E. Madnick and D. Cheung. Discovering and
Reconciling Value Conflicts for Numerical Data Integration . Information
Systems, pages 635—656, Vol. 26, Issue 8, Dec. 2001.
[LLL01] W. Low, M. Lee, and T. Ling. A Knowledge-based Approach for Duplicate
Elimination in Data Cleaning. Information Systems, pages 585—606, Vol. 26,
Issue 8, Dec. 2001. - Data quality analysis
[LM98] J. Lin, and A.O. Mendelzon. Merging Databases
Under Constraints. International
Journal of Cooperative Information Systems, Vol.7, No. 1, pages 55—76, 1998.
[Embury01] S.M. Embury. Adapting Integrity Enforcement Techniques for Data
Reconciliation. Information Systems, pages 657—689, Vol. 26, Issue 8, Dec.
2001.
- Midterm Exam: This exam will cover material only from the following
papers:
- Final Exam: This exam
will cover topics presented in class either by the instructor
or class members and supported by readings found in the
literature read for the class. It will be comprehensive and will also include
the papers covered in the midterm.
- Based on total points, final grades will be calculated as follows:
A 93.3-100% A- 90-93.2%
B+ 86.7-89.9% B 83.3-86.6% B- 80-83.2%
C+ 76.7-79.9% C 73.3-76.6% C- 70-73.2%
D+ 66.7-69.9% D 63.3-66.6% D- 60-63.2%
E Below 60%
I, W, UW: given according to University Policies
- 1 May:
Conceptual-Model-Based Data Extraction from Multiple-Record Web Documents, Dr. Embley.
PowerPoint or PDF
BYU Onto Demo, Dr. Embley
Project 1 -- Start (don't prograstinate; spring term goes fast)
- 3 May:
Information Extraction
from World Wide Web A Survey, Li Xu.
PowerPoint or PDF
Extraction patterns
for information extraction tasks: survey, Li Xu.
Automatically constructing
a dictionary for information extraction tasks, Li Xu.
PowerPoint or PDF
- 6 May:
Relational learning of pattern match rules for information extraction, Tim Chartrand.
PowerPoint
Information extraction from HTML: application of a general machine learning approach, David Marble.
PowerPoint
or
PDF Handouts
- 8 May:
Template-based wrappers in the {TSIMMIS}
system, Li Xu.
PowerPoint
Wrapper induction for information extraction,
Reema Al-Kamha.
PowerPoint
- 10 May:
Initial results on wrapping semistructured
web pages with finite-state transducers and contextual rules, Lars Olson.
PowerPoint
Learning information extraction rules for semi structured and free text, Alan E. Wessman.
PowerPoint
A Hierarchical Approach to Wrapper
Induction, Tim Chartrand.
PowerPoint
- 13 May:
A Fully Automated Object
Extract System for the Web, David Marble. PowerPoint
Learning the Common Data Structure of
Data, Jeff Roth. PowerPoint
Project 1 Due
- 15 May:
ROADRUNNER:
Towards Automatic Data Extraction from Large Web Sites, Li Xu. PowerPoint
Recognizing Ontology-Applicable Multiple-Record Web Documents, Dr. Embley.
PowerPoint
Project 2
- 17 May:
Information Retrieval Models and Bayesian Learning , Li Xu, PowerPoint
Project 1 Presentations
- 20 May:
Information Retrieval Models and Bayesian Learning (cont.), Li Xu, PowerPoint
Recent
results in automatic Web resource discovery, Cui Tao
PowerPoint
Distributional
Clustering of Words for Text Classification, Yihong Ding
PowerPoint
- 22 May:
Extracting Data Behind Web Forms, Helen Chen
PowerPoint
Crawling the hidden Web, Joe Zhou
PowerPoint
A Probabilistic Model for Binary Categorization of Multiple-Record Web Documents, Li Xu, PowerPoint
- 24 May:
Ontology-Based Binary Categorization of Multiple-Record Web Documents Using a Probabilistic Retrieval Model, Li Xu, PowerPoint
Information Integration Using Logical
Views, Li Xu, PowerPoint
- 27 May:
Memorial Day Holiday -- No class
- 29 May:
The TSIMMIS Approach to Mediation: Data Models and Languages, Craig Parker, PowerPoint
The Information
Integration Wizard (Iwiz) Project , Muhammed Al-Muhammed, PowerPoint
Query heterogeneous information sources using source descriptions, Yihong Ding, PowerPoint
Infomaster—
An Information Integration Tool, Cui Tao, PowerPoint
Answering Queries Using Views, Li Xu, PowerPoint
Models for information integration:
turning Local-as-view into global-as-view, Li Xu, PowerPoint
- 31 May:
No class
Project 2 Due
- 3 June:
Midterm Exam
Discovering Direct and Indirect Matches for Schema Elements, Li Xu, PowerPoint
Project 3, PowerPoint
- 5 June:
Project 2 Discussion, PowerPoint
Extracting
information from heterogeneous information sources using ontologically specified
target views, Dr. Embley,
PowerPoint
- 7 June:
A survey of approaches to automatic schema matching, Li Xu, PowerPoint
A theory of attribute
equivalence in databases with applications to schema integration, Reema Al-Kamha,PowerPoint
- 10 June:
Using Schema
Matching to Simplify Heterogeneous Data Translation, Craig Parker,
PowerPoint
A Principled Approach to Data
Integration and Reconciliation in Data Warehousing, Alan Wessman,
PowerPoint
Generic Schema Matching with Cupid, Li Xu, PowerPoint
- 12 June:
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach, Li Xu, PowerPoint
Semantic Integration in Heterogeneous Databases Using
Neural Networks, Jeff Roth,
PowerPoint
Schema Mapping as Query Discovery, Helen Chen,
PowerPoint
- 14 June:
On the
Semantics of Arbitration, Lars Olson,
PowerPoint
On the Logic of Merging, Muhammed Al-Muhammed,
PowerPoint
"Research Project"/Final Exam, Due June 20th
Project 3 Due
- 17 June:
Learning Object Identification Rules for Information Integration, Joe Zhou,
PowerPoint
Permitting Inconsistent Data in Data Storage, Lars Olsen,
PowerPoint
Project 3 Discussion PowerPoint
- The newsgroup for this class is
news:byu.class.s02.c-s.c652
.
- We will use the news group for course announcements as well as a forum
for posting and answering questions.
Last modified on Monday, June 10, 2002 at 4:00 PM