Big Data Analytics Group

Dr. Nagiza F. Samatova
Professor, Department of Computer Science, North Carolina State University
Senior Scientist, Computer Science and Mathematics Division, Oak Ridge National Laboratory
Contact: samatova@csc.ncsu.edu


Software List:

  1. Maximum Parallel Clique Enumeration (MCE): This is a scalable, parallel algorithm for the NP-hard clique enumeration problem (http://freescience.org/cs/parallelclique/) (Matt Schmidt)
  2. Network Instance-Based Biased Subgraph Search (NIBBS): The method is capable of comparing hundreds of genome-scale metabolic networks to identify metabolic subsystems that are statistically biased toward phenotype-expressing organisms (http://freescience.org/cs/NIBBS/) (Matt Schmidt)
  3. alpha-beta, Motif Finder : The method allows for identification of protein functional modules that, in addition to metabolic subsystems, could include their regulators, sensors, transporters, and even uncharacterized proteins that are predicted to be related to the target phenotype (http://freescience.org/cs/ABClique/) (Matt Schmidt)
  4. Bi-clustering: The approach allows for identification of phenotype-related modules that are analyzed to identify potential phenotype-related cross-talking pathways (http://freescience.org/cs/bi-clustering/) (Kevin Wilson)
  5. Dense ENriched Subgraph Enumeration (DENSE): The algorithm allows for incorporating partial prior knowledge about the proteins involved in a phenotype-related process and enriches that knowledge with newly identified sets of functionally associated proteins present in individual phenotype expressing organisms (http://freescience.org/cs/DENSE/) (William Hendrix)
  6. Subsystem Network Biomarkers: The method comparatively analyzes organismal networks to identify subsystems that are biased towards a phenotype without density or size constraints and also not requiring input parameters (http://freescience.org/cs/phenotype-biased-biclusters/) (Kanchana Padmanabhan)
  7. System Phenotype-related Interplaying Components Enumerator (SPICE): The method iteratively enumerates statistically significant and phenotype-relevant cellular subsystems and can be applied to both network data and gene-expression data (http://freescience.org/cs/SPICE) (Zhengzhang Chen)
  8. Network Biomarker Biological Significance (Hierarchical Modularity Score (HMS)): The method has been developed for functional annotation and functional enrichment analysis of hierarchically organized a complementary method to analyze birotein functional modules by taking into account an inherent design principle of biological networks, hierarchical modularity (http://freescience.org/cs/HMS/) (Kanchana Padmanabhan)
  9. In-situ Sort-And-B-spline Error-bounded Lossy Abatement (ISABELA): This is a library for in-situ, embarassingly parallel B-spline based lossy compression of scientific floating point data with user-controlled accuracy bounds http://freescience.org/cs/ISABELA) (Sriram Lakshminarasimhan)
  10. ISABELA-QA: A memory and storage light-weight parallel query processing engine over ISABELA-compressed scientific data capable of multi-core, multi-node, GPU executions (Sriram Lakshminarasimhan)
  11. Analytics-driven Lossless dAta Compression for Rapid In-situ Indexing, sToring, and querYing (ALACRI2TY): The method provides fast and memory light-weight query processing (and lossless compression) engine for scientific floating point data that is optimized for heterogeneous access pattern (Isha Arkatkar and a large NCSU team of PhD students)
  12. In-Situ Orthogonal Byte Aggregate Reduction Compression (ISOBAR): A preconditioner-based, high-throughput lossless compression technique for hard-to-compress scientific datasets (http://freescience.org/cs/ISOBAR) (Eric Schendel)
  13. Prm_causality: A tool for data-driven, semi-automatic inference of plausible phenomenological models (http://freescience.org/cs/prm_causality/) (Saurabh Pendse)
  14. Forecast Oriented Feature Elimination-based Classification of Adverse Spatio-Temporal Extremes (FORECASTER): A forecast-oriented feature elimination-based ensemble of classifiers for robust forecasting of adverse spatio-temporal extremes (software available upon request) (Zhengzhang Chen)
  15. Forecast Error Detection and Correction (DETECTOR): A library for detecting and correcting prediction errors in extreme event forecasts (software available upon request) (Zhengzhang Chen)
  16. Anomalous Community Generator: A library for detecting predictive and phase-biased communities in contrasting groups of networks (software available upon request) (Zhengzhang Chen)
  17. Metabolic pathway alignment: Tool for evolutionary analysis of metabolic pathways via multiple metabolic pathway alignment (software available upon request) (Wenbin Chen)
  18. Protein network alignment algorithm: Tool for evolutionary analysis of protein functional association networks via alignment of multiple protein functional association networks (software available upon request) (Wenbin Chen)
  19. pR: Parallel R for High Performance Statistical Computing: It is an open source software distributed from R’s CRAN: http://cran.r-project.org/src/contrib/Descriptions/RScaLAPACK.html (Paul Breimyer)
    Impact: Distributed through more than 30 mirror sites across around 20 countries. RScaLAPACK library is part of RedHad Linux distribution through RPM packages
  20. mpiBlast-pio: Efficient Parallelization of NCBI BLAST, Open source software distributed from: http://mpiblast.lanl.gov/Downloads.Stable.html (Heshan Lin)
    Impact: The incorporation of performance optimizations in the NCSU-ORNL research prototype pioBLAST into the popular open-source mpiBLAST software
  21. Shewanella Knowledgebase: It is an integrated knowledgebase for DOE Shewanella federation
    Impact: A continuously growing knowledgebase used by a large community of users
  22. Vonode: A de novo sequencing tool that exploits the potential of high-resolution MS/MS data by using a unique tag scoring function and a novel type of spectrum graphs (http://compbio.ornl.gov/Vonode) (Chongle Pan)
  23. ProRata: It is a data analysis algorithm for quantitative proteomics, to address the following two critical needs: (1) to combine multiple replicates and to assess the reproducibility of measurements to obtain reliable quantification information and (2) to compare two unlabeled field samples of interest to a labeled reference sample grown in the laboratory because we cannot label metabolic labeling to a field sample (http://code.google.com/p/prorata/) (Chongle Pan)
    Impact: More than 1000s downloads since June 2006 till present; featured in the Journal of Proteome Research, Vol. 5, No. 11, 2006
  24. Sipros: An algorithm to identify peptide sequences and quantify their 15N incorporation level in stable isotope probing experiments (http://code.google.com/p/sipros/>) (Chongle Pan)
  25. MLOC: a parallel Multi-level Layout Optimization framework for Compressed scientific spatio-temporal data at extreme scale(http://freescience.org/cs/MLOC/) (Zhenhuan Gong)
  26. PARLO: a parallel run-time layout optimization framework, to achieve multi-level data layout optimization for scientific applications at run-time before data is written to storage (http://freescience.org/cs/PARLO/) (Zhenhuan Gong)
  27. APLOD: Analytics-driven Precision Level Of Detail: Floating-point data reorganization library enabling error-bounded, level-of-detail data access. (http://freescience.org/cs/APLOD/) (John Jenkins)
  28. Crosstalk Network Biomarker Discovery: A systematic methodology to identify inter-subsystem crosstalk biomarkers by fusing multiple biological clues and overlaying organism specific information (http://freescience.org/cs/Crosstalk/) (Kanchana Padmanabhan)
  29. Out-of-place Deinterleaving (OPD): A high performance, energy efficient, and generic to any data type, deinterleaving method that exploits data cache prefetching and reduces memory accesses, and optimizes the use of complete cache line writes (http://freescience.org/cs/Out-of-Place-Deinterleaving) (Eric Schendel)
  30. ISOBAR Hybrid Framework: We are providing a hybrid framework for interleaving I/O with data compression to achieve improved I/O throughput side-by-side with reduced dataset size (http://freescience.org/cs/ISOBAR_Hybrid) (Eric Schendel)