"Thread-level parallelization and optimization of NWChem for the Intel MIC architecture", Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM@PPoPP 2015, San Francisco, CA, USA, February 7-8, 2015, 2015.
"Performance Tuning of Fock Matrix and Two-Electron Integral Calculations for NWChem on Leading HPC Platforms", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS13) held as part of SC13, 11/2013.
"Optimization of Parallel Particle-to-Grid Interpolation on Leading Multicore Platforms", IEEE Transactions on Parallel Distributed Systems, vol. 23, issue 10, pp. 1915 - 1922, October 2012.
"Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning", International Conference for High Performance Computing, Networking, Storage and Analysis (SC11), Seattle, Washington, ACM/IEEE, 11/2011.
"Gyrokinetic Particle-in-Cell Optimization on Emerging Multi- and Manycore Platforms", Parallel Computing, vol. 37, no. 9, pp. 501-520, sept, 2011.
"Gyrokinetic Toroidal Simulations on Leading Mult- and Manycore HPC Systems", (submitted to) Supercomputing, April, 2011.
"Auto-tuning the 27-point stencil for multicore", In Proc. iWAPT2009: The Fourth International Workshop on Automatic Performance Tuning, Tokyo, Japan, The Parallel Computing Laboratory, 2009.
"Stencil Computation Optimization and Auto-Tuning on State-of-the-art Multicore Architectures", Proc.\ 2008 ACM/IEEE Conf.\ on Supercomputing (SC 2008), pp. 1–12, 2008.