Year of Publication
The amount of information available on the web is growing constantly. As a result, theproblem of retrieving any desired information is getting more difficult by the day. Toalleviate this problem, several techniques are currently being used, both for locatingpages of interest and for extracting meaningful information from the retrieved pages.Information extraction (IE) is one such technology that is used for summarizingunrestricted natural language text into a structured set of facts. IE is already being appliedwithin several domains such as news transcripts, insurance information, and weatherreports. Various approaches to IE have been taken and a number of significant resultshave been reported.In this thesis, we describe the application of IE techniques to the domain of universityweb pages. This domain is broader than previously evaluated domains and has a varietyof idiosyncratic problems to address. We present an analysis of the domain of universityweb pages and the consequences of having them input to IE systems. We then presentUniversityIE, a system that can search a web site, extract relevant pages, and processthem for information such as admission requirements or general information. TheUniversityIE system, developed as part of this research, contributes three IE methods anda web-crawling heuristic that worked relatively well and predictably over a test set ofuniversity web sites.We designed UniversityIE as a generic framework for plugging in and executing IEmethods over pages acquired from the web. We also integrated in the system a genericweb crawler (built at the University of Kentucky) and ported to Java and integrated anexternal word lexicon (WordNet) and a syntax parser (Link Grammar Parser).
Janevski, Angel, "UniversityIE: Information Extraction From University Web Pages" (2000). University of Kentucky Master's Theses. 217.