- Large scale feature subset selection
- Incremental and online learning in nonstationary environments
- Applied machine learning to comparative metagenomics
Large Scale Feature Subset Selection
There are an ever increasing number of applications that are generating massive amounts of data, which are of high dimensionality. Not only is the cardinality of the data rapidly increasing, but also the dimensionality. Such applications include analysis of a vast amount of data generated by social networks, media networks, blogs, healthcare informatics and genomics to name a few. Unfortunately, not all of the features in the data informative or meaningful and we are completely unaware of which features are meaningful. Therefore, we need algorithms to extract the meaningful and informative features, while remaining scalable to a massive population of data.
- G. Ditzler, R. Polikar, and G. Rosen, “A bootstrap based Neyman-Pearson test for identifying variable importance,” IEEE Transactions on Neural Networks and Learning Systems, 2015, vol. 26, no. 4, pp. 880-886.
- G. Ditzler, M. Austen, R, Polikar, and G. Rosen, “Scalable Subset Selection and Variable Importance,” IEEE Symposium on Computational Intelligence in Data Mining, 2014. (travel award)
Incremental & Life Long Learning with Ensembles
Two of the more common assumptions that applied machine learning researchers make when using an algorithm is that: (1) the training & testing data are sampled from a fixed - albeit unknown - probability distribution, and (2) there are an equal number of samples from all classes. The former is referred to as concept drift (a.k.a., learning in non-stationary environments) when new data are presented over times, and the latter is known as class imbalance. We developed two incremental multiple classifier system solutions, namely Learn++.NIE and Learn++.CDS, for explicitly addressing the joint problem of learning from data streams with concept drift and class imbalance, which has been largely understudied in the literature. I have demonstrated that these approaches are quite useful in practice and they beat the state-of-the-art algorithms in terms of skew insensitive statistics, which are not a simple error calculation. We have also developed algorithms for learning new - potentially imbalanced - classes with an ensemble of classifiers.
- G. Ditzler, M. Roveri, C. Alippi, and R. Polikar, “Adaptive strategies for learning in nonstationary environments: a survey,” IEEE Computational Intelligence Magazine, 2015, vol. 10, no. 4, pp. 12-25.
- G. Ditzler and R. Polikar, “Incremental learning of concept drift from streaming imbalanced data,” in IEEE Transactions on Knowledge & Data Engineering, 2013, vol. 25, no. 10, pp. 2283–2301.
- G. Ditzler, G. Rosen, and R. Polikar, “Domain Adaptation Bounds for Multiple Expert Systems Under Concept Drift,” International Joint Conference on Neural Networks, 2014. (travel award & best student paper)
Applied Machine Learning in Life Sciences
Metagenomics is the study of genetic material obtained directly from an environmental sample, which means that everything is sequenced from a sample (i.e., all of the organisms). Note that this differs from traditional genomics in that a single genome is generally sequenced. One of the important aspects of metagenomic sequencing is that it allows researchers to obtain an understanding of the bacterial and functional profiles of an environment from where the sample was collected. These profiles are important because the bacteria have been hypothesized to play a crucial role in not only human health, but also our surrounding environments.
We have applied my expertise on feature subset selection to 16S and metagenomic data to help microbial ecologists determine the protein families and microorganisms that best differentiate between multiple phenotypes within an environmental study. We worked closely with a colleague to implement information-theoretic feature subset selection algorithms in QIIME, which is a heavily used analysis tool in microbial ecology, and standalone tool that are freely available without restriction to researchers interested in using our tools (see http://goo.gl/YAnxRD).
- G. Ditzler, J. Calvin Morrison, Y. Lan, and G. Rosen, “Fizzy: Feature selection for metagenomics,” BMC Bioinformatics, 2015, vol 16, no. 358.
- G. Ditzler, R. Polikar, and G. Rosen, “Multi-Layer and Recursive Neural Networks for Metagenomic Classification,” IEEE Transactions on Nanobioscience, 2015, vol. 14, no. 6, pp. 608-616.
- J.-L. Bouchot, W. Trimble, G. Ditzler, Y. Lan, S. Essinger, and G. Rosen, “Advances in machine learning for processing and comparison of metagenomic data,” Computational Systems Biology, Springer, 2014.
- G. Ditzler and G. Rosen, “Feature Subset Selection for Inferring Relative Importance of Taxonomy,” ACM International Workshop on Big Data in Life Sciences, 2014. (invited and travel award)