CSCI 780: Big Data

Instructors: Smirni, Stathopoulos, and Li
Time/Place: M-S 002, W: 3:30-6:20 (includes a 5 minute break)
Office Hours: By appointment


The term ”Big Data” is a recent bandwagon that has attracted the attention of a wide variety of practitioners from science, engineering, business, medicine, and even politics and law. Each area has its own interpretation of the term, with ”Big” seemingly the only common denominator. Even in Computer Science, Big Data overlaps with network security, machine learning, data mining, distributed or Cloud computing, high performance computing, and many other areas. I urge everyone to browse through the website for a host of news, information, and articles.

The aim of this class is to introduce some challenges and techniques that fall under the general umbrella of Big Data. The class is structured in a seminar format. A reading list of several papers in various topics that relate to big data has been assembled ranging from data mining and learning to systems paradigms that can support it. A companion undergraduate textbook titled ”Mining of Massive Datasets” by Jure Leskoveci, Anand Rajaraman, and Jeffrey D. Ullman is available at The book will help cover background material on topics that will be deeply covered by the papers in the reading list.

Each student will be responsible for two in class presentations from the reading list. The first presentations will start on January 28, 2015. Assignment of papers to students will be done on a first-come-first-serve basis. A few of the papers on the reading list may constitute more than one presentation, some others will have to be bundled together and will be a part of one presentation. The student who is in charge of the presentation will have to do a high-quality 45-minute presentation that will be followed by 20-25 minutes of discussion. Class participation in this discussion will be evaluated on a weekly basis, and constitute a significant portion of the grade. It goes without saying that every student will have to read every paper that will be presented.

At the start of every class period, a 10-15 minute quiz will be given on the papers that would have been presented/discussed in the previous week. Occasionally (depending on the type of the papers), the quiz may be an open notes one.

Final grades will be computed as follows:
Class Presentations 45%
Participation 15%
Weekly Quizzes 40%
Extra Credit: Project 20%

Reading List

Privacy Tradeoffs in Predictive Analytics Stratis Ioannidis (Technicolor); Andrea Montanari (Stanford); Udi Weinsberg (Technicolor); Smriti Bhagat (Technicolor); Nadia Fawaz (Technicolor); Nina Taft (Technicolor), Sigmetrics 2014

Online Algorithms for Joint Application-VM-Physical-Machine Auto-Scaling in a Cloud Yang Guo (Bell Labs, Alcatel-Lucent); Alexander L. Stolyar (Bell Labs, Alcatel-Lucent); Anwar Walid (Bell Labs, Alcatel-Lucent), Sigmetrics 2014

Beyond Random Walk and Metropolis-Hastings Samplers: Why You Should Not Backtrack for Unbiased Graph Sampling Chul-Ho Lee (North Carolina State University), Xin Xu (North Carolina State University) and Do Young Eun (North Carolina State University), Sigmetrics 2012

Scalable Big Graph Processing in MapReduce Lu Qin, University of Technology, Sydney; Jeffrey Xu Yu, The Chinese University of Hong Kong; Lijun Chang, The University of New South Wales; Hong Cheng, The Chinese University of Hong Kong; Chengqi Zhang, University of Technology, Sydney; Xuemin Lin, The University of New South Wales, Sigmod 14

Stratified-Sampling over Social Networks Using MapReduce Roy Levin, IBM Haifa Research Lab; Yaron Kanza, Jacobs Technion-Cornell Innovation Institute, Cornell Tech, Sigmod 14

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Nadathur Satish, Intel; Narayanan Sundaram, Intel; Mostofa Patwary, Intel; Jiwon Seo, Stanford University; Jongsoo Park, Intel; Muhammad Hassaan, University of Texas, Austin; Shubho Sengupta, Intel; Zhaoming Yin, Georgia Tech; Pradeep Dubey, Intel, Sigmod'14

Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, and Jingren Zhou, Microsoft; Zhengping Qian, Ming Wu, and Lidong Zhou, Microsoft Research, OSDI'14

The Power of Choice in Data-Aware Cluster Scheduling Shivaram Venkataraman and Aurojit Panda, University of California, Berkeley; Ganesh Ananthanarayanan, Microsoft Research; Michael J. Franklin and Ion Stoica, University of California, Berkeley, OSDI'14

  1. (Deep Learning) Large Scale Distributed Deep Networks Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng, NIPS 2012: Neural Information Processing Systems

  2. (deep learning) Project Adam: Building an Efficient and Scalable Deep Learning Training System Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman, Microsoft Research, OSDI'14

Scaling Distributed Machine Learning with the Parameter Server Mu Li, Carnegie Mellon University and Baidu; David G. Andersen and Jun Woo Park, Carnegie Mellon University; Alexander J. Smola, Carnegie Mellon University and Google, Inc.; Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su, Google, Inc., OSDI'14

  1. GraphX: Graph Processing in a Distributed Dataflow Framework Joseph E. Gonzalez, University of California, Berkeley; Reynold S. Xin, University of California, Berkeley, and Databricks; Ankur Dave, Daniel Crankshaw, and Michael J. Franklin, University of California, Berkeley; Ion Stoica, University of California, Berkeley, and Databricks, OSDI'14

  2. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica. SOSP'13

Exploiting Bounded Staleness to Speed Up Big Data Analytics Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, and Gregory RGanger, Carnegie Mellon University; Phillip B. Gibbons, Intel Labs; Garth A. Gibson and Eric P. Xing, Carnegie Mellon, ATC'14

TAO: Facebook’s Distributed Data Store for the Social Graph Authors: Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani, Facebook, Inc., ATC'13

Exploiting Iterative-ness for Parallel ML computations Henggang Cui, Alexey Tumanov, Jinliang Wei, Lianghong Xu, Wei Dai, Jesse Haber-Kucharsky, Qirong Ho, Greg R. Ganger (Carnegie Mellon University); Phil B. Gibbons (Intel Labs); Garth A. Gibson, Eric P. Xing (Carnegie Mellon University), SoCC'14

Yuecheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, Joseph M. Hellerstein: Distributed GraphLab: A Framework for Machine Learning in the Cloud. 716-727, VLDB 2012

  1. (classic, basic) MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, OSDI'04 
  2. (classic) The Hadoop Distributed File System, MSST'10

(counts as two papers)
Top 10 algorithms in data mining. Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg, Knowl. Inf. Syst. (2008) 14:1–37 DOI 10.1007/s10115-007-0114-2 (37 pages)

  3. Trends in big data analytics. Karthik Kambatla, Giorgos Kollias, Vipin Kumar, Ananth Grama, Volume 74, Issue 7, July 2014, Pages 2561–2573.

Network Properties Revealed through Matrix Functions. Ernesto Estrada and Desmond J. Higham, SIAM Rev., 52(4), 696–714. (19 pages)

(counts as two papers)
Randomized algorithms for matrices and data. Michael W. Mahoney (54 pages)

(counts as two papers)
The Structure and Function of Complex Networks SIAM Rev., 45(2), 167–256. (90 pages)