Arya Mazumdar (Research)
Students and Advising
Reliability in Large-Scale Storage
With the advent of large scale distributed storage systems, cloud computing and commercial data storage applications, there is a renewed interest in the coding and information theory that governs the reliability issues of data. In these large networked databases, faster updates and quick repair requirements have to be integrated with reliable data storage protocols. These requirements bring new dimensions and parameters in the traditional optimization problems of information theory.
In this project we propose, for the first time, a model of large-scale storage that accounts for the topology of storage networks. Previously, storage-topology was never a concern of the code designers. Further, we study the update-efficiency and local-repair (fast recovery) properties of error-correcting codes suitable for storage. For all of these, we will analyze the fundamental limits of systems; as well as propose explicit (fast algorithmic) construction of codes. Several tools from graph and network theory, combinatorics and optimization theory will be used to find performance limits and devise coding algorithms.
Bioinformatics and Related Applications
We have recently started to look at potential application of graph-based codes (LDPC codes), iterative decoding algorithms and Network coding to both predictions of Gene functionality as well as cell fate. Traditionally many of these systems such as Gene regulatory networks are modeled by artificial Neural nets.
Data Compression + Error-Correction
Algorithms for lossy/lossless compression and error-correcting codes have been at the core of the digital revolution. This project focuses on the particular set of applications in which both lossy compression and noise resilience are required. Examples include storage of high resolution imagery on non-perfect semiconductor (flash) memory and real-time video surveillance over jammed or noisy channels.
The state-of-the art solution is “separation”: serial concatenation of an off-the-shelf compression algorithm with an off-the-shelf error-correcting code. However, as shown recently by the investigators, for worst-case guarantees the separated solution is far from being (even asymptotically) optimal. This provides the principal motivation for a multifaceted investigation of the combinatorial, geometric, algebraic and information theoretic aspects of the joint source-channel coding problem.
Ranking and Ordinal Data
With the emergence of Big Data platforms in social and life sciences, it is becoming of paramount importance to develop efficient lossless and lossy data compression methods catering to the need of such information systems. Although many near-optimal compression methods exist for classical text, image and video data, they tend to perform poorly on data which naturally appears in fragmented or ordered form. This is especially the case for so called ordinal data, arising in crowd-voting, recommender systems, and genome rearrangement studies. There, information is represented with respect to a “relative,” rather than “absolute” scale, and the particular constraints of the ordering cannot be properly captured via simple dictionary constructions. This project hence seeks to improve the operational performance of a number of data management, cloud computing and communication systems by developing theoretical, algorithmic and software solutions for ordinal data compaction.
The main goal of the project is to develop the first general and comprehensive theoretical framework for ordinal compression. In particular, we propose to investigate new distortion measures for ordinal data and rate-distortion functions for lossy ordinal compression; rank aggregation and learning methods for probabilistic ordinal models, used for ordinal clustering and quantization; smooth compression and compressive computing in the ordinal domain. The proposed analytical framework will also allow for addressing algorithmic challenges arising in the context of compressing complete, partial and weak rankings. The accompanying software solutions may hence find broad applications in areas as diverse as theoretical computer science (sorting, searching and selection), machine learning (clustering and learning to rank), gene prioritization and phylogeny (reconstruction of lists of influential genes and ancestral genomes, respectively).
Machine Learning Applications
Currently we are looking at two different applications: