One of the best ways to learn about Data Science, and to be informed of the new developments and ideas, is to read and study the papers on the various subjects.
Derived of being a very hot topic, there are tons of Data Science papers with a huge variety of subjects. Depending on which topic you are more interessed in, there is a ton of information to read and search for.
In order to facilitate your search for information, we decided to compile a list of the best papers in data science, divided by 3 big topics: General, Clustering Algorithms and Machine Learning.
Here is the list!
General Papers
- MapReduce: Simplified Data Processing on Large Clusters
Authors: Jeffrey Dean and Sanjay Ghemawat
Year: 2004 - Dynamo: Amazon’s Highly Available Key-value Store
Authors: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels
Year: 2007 - Bigtable: A Distributed Storage System for Structured Data
Authors: Fay Chang,Jeffrey Dean, Sanjay Ghemawat,Wilson C. Hsieh,Deborah A. Wallach Mike Burrows,Tushar Chandra,Andrew Fikes,Robert E.Gruber
Year: 2006 - NoSQL Databases
Author: Christof Strauch
Year: 2009 - The Pathologies of Big Data
Author: Adam Jacobs
Year: 2009 - Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics
Author: Ralph Kimball
Year: 2011 - Big data: The next frontier for innovation, competition, and productivity
Authors: James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers
Year: 2011 - Dremel: Interactive Analysis of Web-Scale Datasets
Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer,Shiva Shivakumar, Matt Tolton, Theo Vassilakis - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications
Authors: Ion Stoica ,Robert Morris ,David Liben-Nowell ,David R. Karger ,M. Frans Kaashoek ,Frank Dabek ,Hari Balakrishnan
Year: 2001 - Cassandra – A Decentralized Structured Storage System
Authors: Avinash Lakshman and Prashant Malik
Year: 2009 - Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems
Authors: Antony Rowstron and Peter Druschel
Year: 2001 - Interpreting the Data: Parallel Analysis with Sawzall
Authors: Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan
Year: 2005 - RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems
Authors: Yongqiang He, Rubao Lee,Yin Huai, Zheng Shao,Namit Jain, Xiaodong Zhang ,ZhiweiXu
Year: 2011 - The Google File System
Authors: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Year: 2003 - Spanner: Google’s Globally-Distributed Database
Authors: Google Team
Year: 2012 - Large-scale Incremental Processing Using Distributed Transactions and Notifications
Authors: Daniel Peng, Frank Dabek
Year: 2010 - A Relational Model of Data for Large Shared Data Banks
Author: E. F. Codd
Year: 1970 - Pasting Small Votes for Classification in Large Databases and On-Line
Author: Leo Breiman
Year: 1999 - Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Authors: Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng and Christopher Potts
Year: 2013 - Megastore: Providing Scalable, Highly Available Storage for Interactive Services
Authors: Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh
Year: 2011 - F1: A Distributed SQL Database That Scales
Authors: Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Little?eld, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, Himani Apte
Year: 2013 - Top 10 algorithms in data mining
Authors: Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg
Year: 2007 - Show and Tell: A Neural Image Caption Generator
Authors: Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan
Year: 2014 - Data Science and its Relationship to Big Data and Decision Making
Authors: Foster Provost and Tom Fawcett
Year: 2013 - Mining Contrast Subspaces
Authors: Lei Duan, Guanting Tang, Jian Pei, James Bailey, Guozhu Dong, Akiko Campbell and Changjie Tang
Year: 2014 - Experimental evidence of massive-scale emotional contagion through social networks
Authors: Adam D. I. Kramera, Jamie E. Guilloryb and Jeffrey T. Hancockb
Year: 2013 - Preventing False Discovery in Interactive Data Analysis is Hard
Authors: Moritz Hardt and Jonathan Ullman
Year: 2014 - ClusCite: Effective Citation Recommendation by Information Network-Based Clustering
Authors: Xiang Ren, Jialu Liu, Xiao Yu, Urvashi Khandelwal, Quanquan Gu, Lidan Wang and Jiawei Han
Year: 2014 - Reducing the Sampling Complexity of Topic Models
Authors: Aaron Q. Li, Amr Ahmed, Sujith Ravi and Alexander J. Smola
Year: 2014 - LSTM: A Search Space Odyssey
Authors: Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink and Jürgen Schmidhuber
Year: 2015 - Semi-Supervised Learning with Ladder Network
Authors: Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund and Tapani Raiko
Year: 2015 - Towards Neural Network-based Reasoning
Authors: Baolin Peng, Zhengdong Lu, Hang Li and Kam-Fai Wong
Year: 2015
Clustering Algorithms
- Algorithms for hierarchical clustering: An overview
Authors: Fionn Murtagh and Pedro Contreras
Year: 2012 - SLINK: An optimally efficient algorithm for the single-link cluster method
Author: R. Sibson
Year: 1972 - Optimal algorithms for complete linkage clustering in d dimensions
Authors: Drago Krznaric and Christos Levcopoulos
Year: 2002 - An efficient algorithm for a complete link method
Author: D. Defays
Year: 1977 - Robust Hierarchical Clustering
Authors: Maria Florina Balcan and Pramod Gupta
Year: 2014 - Optimal Implementations of UPGMA and Other Common Clustering Algorithm
Authors: Ilan Gronaua and Shlomo Moran
Year: 2007 - An Efficient k-Means Clustering Algorithm: Analysis and Implementation
Authors: Tapas Kanungo, David M. Mount, Nathan SD. Netanyahu, Christine D. Piatko, Ruth Silverman and Angela Y. Wu
Year: 2002 - A K-Means Clustering Algorithm
Authors: J. A. Hartigan and M. A. Wong
Year: 1979 - A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
Authors: Martin Ester, Hans-Peter Kriegel, Jiirg Sander and Xiaowei Xu
Year: 1996 - OPTICS: Ordering Points To Identify the Clustering Structure
Authors: Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander
Year: 1999 - BIRCH: An Efficient Data Clustering Method for Very Large Databases
Authors: Tian Zhang, Raghu Ramakrishnan and Miron Livny
Year: 1996 - CURE: An Efficient Clustering Algorithm for Large Databases
Authors: Sudipto Guha, Rajeev Rastogi and Kyuseok Shim
Year: 2001 - CLARANS: a method for clustering objects for spatial data mining
Authors: Raymond T. Ng and Jiawei Han
Year: 2002 - FCM: The Fuzzy C-Means Clustering Algorithm
Authors: James C. Bezdek, Robert Ehrlich and William Full
Year: 1982 - The Expectation Maximization Algorithm
Author: Frank Dellaert
Year: 2002 - The EM Algorithm
Author: Xiaojin Zhu
Year: 2007
Machine Learning
- Parallel Spectral Clustering in Distributed Systems
Authors: Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, Edward Y. Chang
Year: 2011 - Learning Multiple Layers of Features from Tiny Images
Author: Alex Krizhevsky
Year: 2009 - Distributed Algorithms for Topic Models
Authors: David Newman, Arthur Asuncion, Padhraic Smyth and Max Welling
Year: 2009 - Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation
Authors: U Kang, Brendan Meeder and Christos Faloutsos
Year: 2011 - Large Language Models in Machine Translation
Authors: Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och and Jeffrey Dean
Year: 2007 - Learning using Large Datasets
Authors: Léon Bottou and Olivier Bousquet
Year: 2008 - Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Authors: Samy Bengio, Oriol Vinyals, Navdeep Jaitly and Noam Shazeer
Year: 2015 - Training recurrent networks online without backtracking
Authors: Yann Ollivier and Guillaume Charpiat
Year: 2015 - PEGASUS: A Peta-Scale Graph Mining System- Implementation and Observations
Authors: U Kang , Charalampos E. Tsourakakis and Christos Faloutso
Year: 2009 - Learning Deep Architectures for AI
Authors: Yoshua Bengio
Year: 2009 - Intriguing properties of neural networks
Authors: Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow and Rob Fergus
Year: 2014 - Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
Author: Uzi Vishkin
Year: 2010 - Pattern Recognition and Machine Learning
Author: Christopher M. Bishop
Year: 2006 - A Few Useful Things to Know about Machine Learning
Authors: Pedro Domingos
Year: 2012 - Map-Reduce for Machine Learning on Multicore
Authors: Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng and Kunle Olukotun
Year: 2006 - Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images
Authors: Anh Nguyen, Jason Yosinski and Jeff Clune
Year: 2014 - Towards Neural Network-based Reasoning
Authors: Baolin Peng, Zhengdong Lu, Hang Li and Kam-Fai Wong
Year: 2015
Hope you like our selection of important papers on Data Science! We think you could learn A LOT about certain Data Science subjects by reading and studying these documents. We’ll be updating and enlarging this list in order to provide the best list ever of papers.