Exploiting Hierarchical Information in Web Content Mining

Abstract

📜 Abstract

The Web continues to grow at a staggering rate as more information is added and as more users join the web. Information overload has become a problem for users as well as web service providers. A large number of web pages contain useful hierarchical information, which can be used to improve the utility of the additional information. An important aspect of such hierarchical information is presented as a file tree structure where directories contain files and subdirectories. In this paper, we present methods for detecting and using hierarchical information to categorize and better represent the content of web pages.

Description

✨ Summary

The paper “Exploiting Hierarchical Information in Web Content Mining” was published in 1999 by Denis M. Sullivant, Vivek Jain, and Amin Vahdat. It discusses methods to leverage hierarchical information found in web structures to improve web content categorization and represent the content of web pages more effectively. This involves analyzing elements like the file tree structure inherent in many web directories, which aids in the automatic sorting and organization of web data.

The paper highlights techniques for detecting such hierarchical data and proposes strategies for using this information to address the issue of information overload on the internet.

After conducting a web search, there are no significant citations found directly referencing this work in subsequent studies or industrial applications, suggesting it may have had limited direct impact on further research or development in the field. This may be due to the evolving nature of web technologies and content mining methods since its publication.