"Recovery Patterns for Iterative Methods in a Parallel Unstable Environment", SIAM Journal on Scientific Computing, vol. 30, no. 1, pp. 102-116, 2007.
"Algorithmic Based Fault Tolerance Applied to High Performance Computing", Journal of Parallel and Distributed Computing, vol. 69, no. 4, pp. 410-416, april, 2009.
"Constructing Resilient Communication Infrastructure for Runtime Environments", Parallel Computing: From Multicores and GPU's to Petascale: IOS Press, pp. 441-451, 2010.
Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs, , no. UT-CS-10-663: Innovative Computing Laboratory, University of Tennessee, nov, 2010.
"Self-Healing Network for Scalable Fault-Tolerant Runtime Environments", Future Generation Computer Systems, vol. 26, no. 3, pp. 479-485, mar, 2010.
On Scalability for MPI Runtime Systems, , no. ICL-UT-11-05: Innovative Computing Laboratory, University of Tennessee, may, 2011.
"A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI", 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012), Christos Kaklamanis, Theodore Papatheodorou and Paul Spirakis eds., Springer-Verlag, Rhodes, Greece, August 27-31, 2012.
"An Evaluation of User-Level Failure Mitigation Support in MPI", roceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI, Springer, Vienna, Austria, September 23 - 26, 2012.