DECOM - Trabalhos apresentados em eventos
URI permanente para esta coleçãohttp://www.hml.repositorio.ufop.br/handle/123456789/581
Navegar
6 resultados
Resultados da Pesquisa
Item Syntactic similarity of web documents.(2003) Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThis paper presents and compares two methods for evaluating the syntactic similarity between documents. The first method uses the Patricia tree, constructed from the original document, and the similarity is computed searching the text of each candidate document in the tree. The second method uses shingles concept to obtain the similarity measure for every document pairs, and each shingle from the original document is inserted in a hash table, where shingles of each candidate document are searched. Given an original doc-ument and some candidates, two methods find documents that have some similarity relationship with the original doc-ument. Experimental results were obtained by using a pla-giarized documents generator system, from 900 documents collected from the Web. Considering the arithmetic ave rage of the absolute differences between the expected and ob-tained similarity, the algorithm that uses shingles obtained a performance of 4,13 % and the algorithm that uses Patricia tree a performance 7.50%Item Geração de impressão digital para recuperação de documentos similares na web(2004) Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThis paper presents a mechanism for the generation of the “finger-print” of a Web document. This mechanism is part of a system for detecting and retrieving documents from the Web with a similarity relation to a suspicious do-cument. The process is composed of three stages: a) generation of a fingerprint of the suspicious document, b) gathering candidate documents from the Web and c) comparison of each candidate document and the suspicious document. In the first stage, the fingerprint of the suspicious document is used as its identifica-tion. The fingerprint is composed of representative sentences of the document. In the second stage, the sentences composing the fingerprint are used as queries submitted to a search engine. The documents identified by the URLs returned from the search engine are collected to form a set of similarity candidate do-cuments. In the third stage, the candidate documents are “in-place” compared to the suspicious document. The focus of this work is on the generation of the fingerprint of the suspicious document. Experiments were performed using a collection of plagiarized documents constructed specially for this work. For the best fingerprint evaluated, on average87.06%of the source documents used in the composition of the plagiarized document were retrieved from the Web.Item Um novo retrato da web brasileira.(2005) Modesto, Marco; Pereira Junior, Álvaro Rodrigues; Ziviani, Nivio; Castilho, Carlos; Yates, Ricardo BaezaO objetivo deste artigo ´e avaliar características quantitativas e qualitativas da Web brasileira, confrontando estimativas atuais com estimativas obtidas há cinco anos. Grande parte do conteúdo Web´ e dinâmico e volátil, o que inviabiliza a sua coleta na totalidade. Logo, o processo de avaliação foi realizado sobre uma amostra da Web brasileira, coletada em marco de 2005. Os resultados são estimados de forma consistente, usando uma metodologia eficaz, j´a utilizada em trabalhos similares com Webs de outros países. Dentre os principais aspectos observados neste trabalho estão a distribuição dos idiomas das paginas, o uso de ferramentas abertas versus proprietárias para geração de paginas dinâmicas, a distribuição dos formatos de documentos, a distribuição de tipos de domínios e a distribuição dos links a Web sites externos.Item WIM : an information mining model for the web.(2005) Yates, Ricardo Baeza; Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThis paper presents a model to mine information in ap-plications involving Web and graph analysis, referred to as WIM – Web Information Mining – model. We demonstrate the model characteristics using a Web warehouse. The Web data in the warehouse is modeled as a graph, where nodes represent Web pages and edges represent hyperlinks. In the model, objects are always sets of nodes and belong to one class. We have physical objects containing attributes di-rectly obtained from Web pages and links, as the title of a Web page or the start and end pages of a link. Logical ob-jects can be created by performing predefined operations on any existing object. In this paper we present the model components, propose a set of eleven operators and give ex-amples of views. A view is a sequence of operations on objects, and it represents a way to mine information in the graph. As practical examples, we present views for cluster-ing nodes and for identifying related item sets.Item The evolution of web content and search engines.(2006) Yates, Ricardo Baeza; Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThe evolution of web content and search engines The Web grows at a fast pace and little is known about how new content is generated. The objective of this paper is to study the dynamics of content evolution in the Web, giv-ing answers to questions like: How much new content has evolved from the Web old content? How much of the Web content is biased by ranking algorithms of search engines? We used four snapshots of the Chilean Web containing documents of all the Chilean primary domains, crawled in four distinct periods of time. If a page in a newer snapshot has content of a page in an older snapshot, we say that the source is a parent of the new page. Our hypothesis is that when pages have parents, in a portion of pages there was a query that related the parents and made possible the creation of the new page. Thus, part of the Web content is biased by the ranking function of search engines. We also de¯ne a genealogical tree for the Web, where many pages are new and do not have parents and others have one or more parents. We present the Chilean Web genealogical tree and study its components. To the best of our knowledge this is the ¯rst paper that studies how old content is used to create new content, relating a search engine ranking algorithm with the creation of new pages.Item Genealogical trees on the web : a search engine user perspective.(2008) Yates, Ricardo Baeza; Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThis paper presents an extensive study about the evolution of textual content on the Web, which shows how some new pages are created from scratch while others are created using already existing content. We show that a significant fraction of the Web is a byproduct of the latter case. We introduce the concept of Web genealogical tree, in which every page in a Web snapshot is classified into a component. We study in detail these components, characterizing the copies and identifying the relation between a source of content and a search engine, by comparing page relevance measures, documents returned by real queries performed in the past, and click-through data. We observe that sources of copies are more frequently returned by queries and more clicked than other documents.