DECOM - Trabalhos apresentados em eventos

Resultados da Pesquisa

Agora exibindo 1 - 4 de 4

Um novo retrato da web brasileira.
(2005) Modesto, Marco; Pereira Junior, Álvaro Rodrigues; Ziviani, Nivio; Castilho, Carlos; Yates, Ricardo Baeza
O objetivo deste artigo ´e avaliar características quantitativas e qualitativas da Web brasileira, confrontando estimativas atuais com estimativas obtidas há cinco anos. Grande parte do conteúdo Web´ e dinâmico e volátil, o que inviabiliza a sua coleta na totalidade. Logo, o processo de avaliação foi realizado sobre uma amostra da Web brasileira, coletada em marco de 2005. Os resultados são estimados de forma consistente, usando uma metodologia eficaz, j´a utilizada em trabalhos similares com Webs de outros países. Dentre os principais aspectos observados neste trabalho estão a distribuição dos idiomas das paginas, o uso de ferramentas abertas versus proprietárias para geração de paginas dinâmicas, a distribuição dos formatos de documentos, a distribuição de tipos de domínios e a distribuição dos links a Web sites externos.
WIM : an information mining model for the web.
(2005) Yates, Ricardo Baeza; Pereira Junior, Álvaro Rodrigues; Ziviani, Nivio
This paper presents a model to mine information in ap-plications involving Web and graph analysis, referred to as WIM – Web Information Mining – model. We demonstrate the model characteristics using a Web warehouse. The Web data in the warehouse is modeled as a graph, where nodes represent Web pages and edges represent hyperlinks. In the model, objects are always sets of nodes and belong to one class. We have physical objects containing attributes di-rectly obtained from Web pages and links, as the title of a Web page or the start and end pages of a link. Logical ob-jects can be created by performing predefined operations on any existing object. In this paper we present the model components, propose a set of eleven operators and give ex-amples of views. A view is a sequence of operations on objects, and it represents a way to mine information in the graph. As practical examples, we present views for cluster-ing nodes and for identifying related item sets.
The evolution of web content and search engines.
(2006) Yates, Ricardo Baeza; Pereira Junior, Álvaro Rodrigues; Ziviani, Nivio
The evolution of web content and search engines The Web grows at a fast pace and little is known about how new content is generated. The objective of this paper is to study the dynamics of content evolution in the Web, giv-ing answers to questions like: How much new content has evolved from the Web old content? How much of the Web content is biased by ranking algorithms of search engines? We used four snapshots of the Chilean Web containing documents of all the Chilean primary domains, crawled in four distinct periods of time. If a page in a newer snapshot has content of a page in an older snapshot, we say that the source is a parent of the new page. Our hypothesis is that when pages have parents, in a portion of pages there was a query that related the parents and made possible the creation of the new page. Thus, part of the Web content is biased by the ranking function of search engines. We also de¯ne a genealogical tree for the Web, where many pages are new and do not have parents and others have one or more parents. We present the Chilean Web genealogical tree and study its components. To the best of our knowledge this is the ¯rst paper that studies how old content is used to create new content, relating a search engine ranking algorithm with the creation of new pages.
Genealogical trees on the web : a search engine user perspective.
(2008) Yates, Ricardo Baeza; Pereira Junior, Álvaro Rodrigues; Ziviani, Nivio
This paper presents an extensive study about the evolution of textual content on the Web, which shows how some new pages are created from scratch while others are created using already existing content. We show that a significant fraction of the Web is a byproduct of the latter case. We introduce the concept of Web genealogical tree, in which every page in a Web snapshot is classified into a component. We study in detail these components, characterizing the copies and identifying the relation between a source of content and a search engine, by comparing page relevance measures, documents returned by real queries performed in the past, and click-through data. We observe that sources of copies are more frequently returned by queries and more clicked than other documents.

DECOM - Trabalhos apresentados em eventos

Navegar

Filtros

Configurações

Ordenar por

Resultados por página

Resultados da Pesquisa