A. Csiszárik, S. Lestyán, A. Lukács: Efficient Apriori based algorithms for privacy preserving frequent itemset mining, Cognitive Infocommunications (CogInfoCom), 2014 5th IEEE Conference on, 431-435, 2014. (doi)
In this paper Apriori based distributed, privacy preserving Frequent Itemset Mining algorithms are considered. Our secure algorithms are designed to fit in the Secure Multiparty Computation model of privacy preserving computation.
A. Lukács, Z. Nagy: Dynamic Log Analysis, Infocommunications Journal 4 (2), 37-41, 2012. (pdf)
This article reviews a new log analysis solution based on multidimensional data cubes. It introduces the process of dynamic log analysis, including real-time log compression during the log management, a new log parsing language, and the efficient regular expression processing engine for this language. The article assesses the new online analytic processing (OLAP) tool implemented for log analysis. The algorithm and software technology developments appearing in the resulting system are creating a new class of real time log processing and analysis tools.
N Bánfi, L Dudás, Z Fekete, J Göbölös-Szabó, A Lukács, A Nagy, A Szabó: City sentinel-VAST 2011 mini challenge 1 award: “Outstanding integration of computational and visual methods”, Visual Analytics Science and Technology (VAST), 2011 IEEE Conference on, 305-306, 2011. (doi, pdf)
City Sentinel is a visual analytic software capable of handling a large collection of textual documents by combining diverse text mining and visualization tools. This tool was applied for the Vast Challenge 2011, Mini Challenge 1 over millions of tweet messages. City Sentinel helped to extract the hidden information from the tweet messages, to analyze and locate a hypothetical epidemic outbreak.
A visually supported search and browsing system for network-type data is presented. A prototype was applied for the Vast Challenge 2009, Flitter Mini Challenge.
M. Kurucz, L. Lukács, D. Silklósi, A. A Benczúr, K Csalogány, A Lukács: Telephone call network data mining: A survey with experiments, Handbook of Large-Scale Random Networks. Springer Berlin Heidelberg, 2008. 489-530. (doi, pdf)
We survey some results of social network modeling and analysis relevant for telephone call networks and illustrate these results over the call logs of major Hungarian telephone companies. Our unique data sets include millions of users, long time range, and sufficiently strong sociodemographic information on the users. We explore properties that give stronger intuition on how contacts within real social networks arise, and suggest properties unexplained by current network evolution models.
We cover four areas: link prediction that investigates measures of connection strength between members of the community; geographic contact distribution modeling that explains how small world networks arise and span large physical distances; high-level partitioning and clustering that identifies top level organizational principles of the network; finally classification in the presence of a network with strong influence between connected nodes.
Halting a computer or biological virus outbreak requires a detailed understanding of the timing of the interactions between susceptible and infected individuals. While current spreading models assume that users interact uniformly in time, following a Poisson process, a series of recent measurements indicates that the intercontact time distribution is heavy tailed, corresponding to a temporally inhomogeneous bursty contact process. Here we show that the non-Poisson nature of the contact dynamics results in prevalence decay times significantly larger than predicted by the standard Poisson process based models. Our predictions are in agreement with the detailed time resolved prevalence data of computer viruses, which, according to virus bulletins, show a decay time close to a year, in contrast with the 1 day decay predicted by the standard Poisson process based models.
We propose a new, heterogeneous data warehouse architecture where a first phase traditional relational OLAP warehouse coexist with a second phase data in compressed form optimized for data mining. Aggregations and metadata for the entire time frame are stored in the first phase relational database. The main advantage of the second phase is its reduced I/O requirement that enables very high throughput processing by sequential read-only data stream algorithms. It becomes feasible to run speed optimized queries and data mining operations on the entire time frame of most granular data. The second phase also enables long term data storage and analysis using a very efficient compressed format at low storage costs even for historical data. The proposed architecture fits existing data warehouse solutions. We show the effectiveness of the two-phase data warehouse through a case study of a large web portal.
While current studies on complex networks focus on systems that change relatively slowly in time, the structure of the most visited regions of the Web is altered at the timescale from hours to days. Here we investigate the dynamics of visitation of a major news portal, representing the prototype for such a rapidly evolving network. The nodes of the network can be classified into stable nodes, that form the time independent skeleton of the portal, and news documents. The visitation of the two node classes are markedly different, the skeleton acquiring visits at a constant rate, while a news document's visitation peaking after a few hours. We find that the visitation pattern of a news document decays as a power law, in contrast with the exponential prediction provided by simple models of site visitation. This is rooted in the inhomogeneous nature of the browsing pattern characterizing individual users: the time interval between consecutive visits by the same user to the site follows a power law distribution, in contrast with the exponential expected for Poisson processes. We show that the exponent characterizing the individual user's browsing patterns determines the power-law decay in a document's visitation. Finally, our results document the fleeting quality of news and events: while fifteen minutes of fame is still an exaggeration in the online media, we find that access to most news items significantly decays after 36 hours of posting.
Integration of data mining and database management systems could significantly ease the process of knowledge discovery in large databases. We consider implementations of frequent itemset mining algorithms, in particular pattern-growth algorithms similar to the top-down FP-growth variations, tightly coupled to relational database management systems. Our implementations remain within the confines of the conventional relational database facilities like tables, indices, and SQL operations. We compare our algorithm to the most promising previously proposed SQL-based FIM algorithm. Experiments show that our method performs better in many cases, but still has severe limitations compared to the traditional stand-alone pattern-growth method implementations. We identify the bottlenecks of our SQL-based pattern-growth methods and investigate the applicability of tightly coupled algorithms in practice.
This is a survey in Hungarian about algorithms for clustering and dimension reduction.
A. A. Benczúr, K. Csalogány, K. Hum, A. Lukács, B. Rácz, Cs. I. Sidló, M. Uher, and L. Végh: Architecture for Mining Massive Web Logs with Experiments, In: Proceedings of the HUBUSKA Open Workshop on Generic Issues of Knowledge Technologies, 2005. (pdf)
We introduce an experimental web log mining architecture with advanced storage and data mining components. The aim of the system is to give a flexible base for web usage mining of large scale Internet sites. We present experiments over logs of the largest Hungarian Web portal [origo] (http://www.origo.hu) that among others provides online news and magazines, community pages, software downloads, free email as well as a search engine. The portal has a size of over 130,000 pages and receives 6,500,000 HTML hits on a typical workday, producing 8 GB of raw server logs that remains a size of 400 MB per day even after cleaning and preprocessing, thus overrunning the storage space capabilities of typical commercial systems.
As the results of our experiments over the [origo] site we present certain distributions related to the number of hits and sessions, some of which is somewhat surprising and different from an expected simple power law distribution. The problem of the too many and redundant frequent sequences of web log data is investigated. We describe our method to characterize user navigation patterns by comparing with choices of a memoryless Markov process. Finally we demonstrate the effectiveness of our clustering technique based on a mixture of singular value decomposition and refined heuristics.
We present a generalized scheme for preprocessing and high density compression of log files. The aim of the method is to provide a base for long-term storage in a form appropriate for direct processing by data mining algorithms. Experimental runs on real data show that our scheme compresses at 2-3%, outperforming general purpose compression utilities, e.g. bzip2, in some cases by a factor of 10. The encoding is also near 3 times faster than with bzip2, and the time is dominated by the preprocessing phase. Decompression is as fast as gzip. This efficiency is achieved by semantic and standard compression techniques organized into modular pipelines. A pipeline for a specific field is composed of standard encoding schemes preceded by invertible transformations customizable for the actual data set. The discussed implementation demonstrates the flexibility of the pipeline concept by inlaying a novel compression algorithm. Our implementation was designed for the largest Hungarian Internet content provider and fulfills the additional requirements of production environments.
Algorithms based on rapidly mixing Markov chains are discussed to produce nearly uniformly distributed random elements in abelian groups of finite order. Let A be an abelian group generated by set S. Then one can generate eps-nearly uniform random elements of A using |S|log(|A| eps)log(|A|) additions and the same number of random bits.
A. Lukács: Non-reversible random walks on groups, 2002.
A new proof and bound for the mixing rapidity of non-reversible random walks on directed Cayley graphs of finite groups is given.
Let X be a vertex-transitive graph and let S be an arbitrary finite subset of its vertices. Denote by N(S) the set of vertices adjacent to S but not in S. Babai and Szegedy proved that for an infinite, connected, locally finite X with subexponential growth we have N(S)/|S|>=2/(d(S)+2) where d(S) is the diameter of S. The aim of this note is to provide a slightly better, tight lower bound on this quantity. We prove that N( S)/|S| >=2/( d(S)+1) under the same conditions.
We present results on the structure of graphs with polynomial growth. For certain classes of these graphs we prove that they contain subgraphs which are similar to the lattice graph Ld. These results are then applied to investigate isoperimetric properties of certain classes of graphs with polynomial growth. In addition these structural results can also be applied to
Let X be a locally finite, vertex-transitive, infinite graph with polynomial growth. Then there exists a quotient group of Aut(X)
which contains a finitely generated nilpotent subgroup N which has the same growth rate as X. We show that X contains a subgraph which is finitely contractible onto the h-dimensional lattice, where h is the Hirsch number of N.
A near-representation of a group G is a map M:G -> GL(W) that is nearly a homomorphism where W is a complex linear space. We show that such a near-representation of a finite group is necessarily near to, and so is derived from an exact representation, assuming that the possible error of the map is small enough. The required error bound does not depend on the order of the group G, only on the dimension and the size of the near-representation.
On the way to prove this result we create approximate versions of several results of representation theory, like approximate equivalence with unitary near-representations, decomposition into stable irreducible components, approximate intertwining operators and an approximate version of Schur's lemma.