CSCI 780: Big Data
Instructors: Smirni, Stathopoulos, and Li
Time/Place: M-S 002, W: 3:30-6:20 (includes a 5 minute break)
Office Hours: By appointment
Synopsis/Purpose
The term ”Big Data” is a recent bandwagon that has
attracted the attention of a wide variety of practitioners
from science, engineering, business, medicine, and
even politics and law. Each area has its own interpretation
of the term, with ”Big” seemingly the only common denominator.
Even in Computer Science, Big Data
overlaps with network security, machine learning, data mining,
distributed or Cloud computing, high performance
computing, and many other areas. I urge everyone to browse
through the www.datanami.com website
for a host of news, information, and articles.
The aim of this class is to introduce some
challenges and techniques that fall under the general umbrella
of Big Data. The class is
structured in a seminar format. A reading list of several
papers in various topics that
relate to big data has been assembled ranging from data mining
and learning to systems paradigms that can
support it. A companion undergraduate textbook titled ”Mining
of Massive Datasets” by Jure Leskoveci, Anand
Rajaraman, and Jeffrey D. Ullman is available at
http://infolab.stanford.edu/~ullman/mmds/book.pdf.
The book will help cover background material on
topics that will be deeply covered by the papers in the
reading list.
Each student will be responsible for two in class
presentations from the reading list. The first presentations
will start on January 28, 2015. Assignment of papers to
students will be done on a first-come-first-serve
basis. A few of the papers on the reading list may
constitute more than one presentation, some others will
have to be bundled together and will be a part of
one presentation. The student who is in charge of the presentation
will have to do a high-quality 45-minute presentation that
will be followed by 20-25 minutes of
discussion. Class participation in this discussion will be
evaluated on a weekly basis, and constitute a significant
portion of the grade. It goes without saying that every
student will have to read every paper that
will be presented.
At the start of every class
period, a 10-15 minute quiz will be given on the papers that
would have been presented/discussed
in the previous week. Occasionally (depending on the type of
the papers), the quiz may be
an open notes one.
Final grades will be computed as follows:
Class Presentations 45%
Participation 15%
Weekly Quizzes 40%
Extra Credit: Project 20%
- 1.
- Privacy Tradeoffs in Predictive Analytics
Stratis Ioannidis (Technicolor); Andrea Montanari (Stanford);
Udi Weinsberg (Technicolor); Smriti Bhagat (Technicolor); Nadia
Fawaz (Technicolor); Nina Taft (Technicolor), Sigmetrics 2014
- 2.
- Online Algorithms for Joint Application-VM-Physical-Machine
Auto-Scaling in a Cloud
Yang Guo (Bell Labs, Alcatel-Lucent); Alexander L. Stolyar (Bell
Labs, Alcatel-Lucent); Anwar Walid (Bell Labs, Alcatel-Lucent),
Sigmetrics 2014
- 3.
- Beyond Random Walk and Metropolis-Hastings Samplers: Why You
Should Not Backtrack for Unbiased Graph Sampling
Chul-Ho Lee (North Carolina State University), Xin Xu (North
Carolina State University) and Do Young Eun (North Carolina
State University), Sigmetrics 2012
- 4.
- Scalable Big Graph Processing in MapReduce
Lu Qin, University of Technology, Sydney; Jeffrey Xu Yu, The
Chinese University of Hong Kong; Lijun Chang, The University of
New South Wales; Hong Cheng, The Chinese University of Hong
Kong; Chengqi Zhang, University of Technology, Sydney; Xuemin
Lin, The University of New South Wales, Sigmod 14
- 5.
- Stratified-Sampling over Social Networks Using MapReduce Roy
Levin, IBM Haifa Research Lab; Yaron Kanza, Jacobs
Technion-Cornell Innovation Institute, Cornell Tech, Sigmod 14
- 6.
- Navigating the Maze of Graph Analytics Frameworks using
Massive Graph Datasets
Nadathur Satish, Intel; Narayanan Sundaram, Intel; Mostofa
Patwary, Intel; Jiwon Seo, Stanford University; Jongsoo Park,
Intel; Muhammad Hassaan, University of Texas, Austin; Shubho
Sengupta, Intel; Zhaoming Yin, Georgia Tech; Pradeep Dubey,
Intel, Sigmod'14
- 7.
- Apollo: Scalable and Coordinated Scheduling for Cloud-Scale
Computing
Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, and Jingren
Zhou, Microsoft; Zhengping Qian, Ming Wu, and Lidong Zhou,
Microsoft Research, OSDI'14
- 8.
- The Power of Choice in Data-Aware Cluster Scheduling
Shivaram Venkataraman and Aurojit Panda, University of
California, Berkeley; Ganesh Ananthanarayanan, Microsoft
Research; Michael J. Franklin and Ion Stoica, University of
California, Berkeley, OSDI'14
- 9.
- (bundle)
- (Deep Learning)
Large Scale Distributed Deep Networks
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen,
Matthieu Devin,
Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew
Senior, Paul
Tucker, Ke Yang, and Andrew Y. Ng, NIPS 2012: Neural
Information
Processing Systems
- (deep learning) Project Adam: Building an Efficient and
Scalable Deep Learning Training System
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and
Karthik Kalyanaraman, Microsoft Research, OSDI'14
- 10.
- Scaling Distributed Machine Learning with the Parameter Server
Mu Li, Carnegie Mellon University and Baidu; David G. Andersen
and Jun Woo Park, Carnegie Mellon University; Alexander J.
Smola, Carnegie Mellon University and Google, Inc.; Amr Ahmed,
Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing
Su, Google, Inc., OSDI'14
- 11.
- (bundle)
- GraphX: Graph Processing in a Distributed Dataflow
Framework
Joseph E. Gonzalez, University of California, Berkeley;
Reynold S. Xin, University of California, Berkeley, and
Databricks; Ankur Dave, Daniel Crankshaw, and Michael J.
Franklin, University of California, Berkeley; Ion Stoica,
University of California, Berkeley, and Databricks, OSDI'14
- Discretized Streams: Fault-Tolerant Streaming Computation
at Scale. Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy
Hunter, Scott Shenker, Ion Stoica. SOSP'13
http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf
- 12.
- Exploiting Bounded Staleness to Speed Up Big Data Analytics
Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee,
Abhimanu Kumar,
Jinliang Wei, Wei Dai, and Gregory RGanger, Carnegie Mellon
University;
Phillip B. Gibbons, Intel Labs; Garth A. Gibson and Eric P.
Xing,
Carnegie Mellon, ATC'14
https://www.usenix.org/system/files/conference/atc14/atc14-paper-cui.pdf
- 13.
- TAO: Facebook’s Distributed Data Store for the Social Graph
Authors: Nathan Bronson, Zach Amsden, George Cabrera, Prasad
Chakka, Peter Dimov,
Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry
Li,
Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and
Venkat
Venkataramani, Facebook, Inc., ATC'13
https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson
- 14.
- Exploiting Iterative-ness for Parallel ML computations
Henggang Cui, Alexey Tumanov, Jinliang Wei, Lianghong Xu, Wei
Dai, Jesse
Haber-Kucharsky, Qirong Ho, Greg R. Ganger (Carnegie Mellon
University);
Phil B. Gibbons (Intel Labs); Garth A. Gibson, Eric P. Xing
(Carnegie
Mellon University), SoCC'14
- 15.
- Yuecheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson,
Carlos Guestrin,
Joseph M. Hellerstein: Distributed GraphLab: A Framework for
Machine
Learning in the Cloud. 716-727, VLDB 2012
- 16.
- (bundle)
- (classic, basic)
MapReduce:
Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat, OSDI'04
- (classic)
The Hadoop Distributed File System, MSST'10
- 17-18.
- (counts as two papers)
Top 10 algorithms in data mining.
Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang
Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu,
Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand,
Dan Steinberg, Knowl. Inf. Syst. (2008) 14:1–37 DOI
10.1007/s10115-007-0114-2 (37 pages)
http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf
- 19.
- (bundle)
- https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
- https://hbr.org/2013/01/why-it-fumbles-analytics
- Trends in big data analytics. Karthik Kambatla, Giorgos
Kollias, Vipin Kumar, Ananth Grama, Volume 74, Issue 7, July
2014, Pages 2561–2573.
http://www.sciencedirect.com/science/article/pii/S0743731514000057
- 20.
- Network Properties Revealed through Matrix Functions.
Ernesto Estrada and Desmond J. Higham,
SIAM Rev., 52(4), 696–714. (19 pages)
http://epubs.siam.org/doi/abs/10.1137/090761070
- 21-22.
- (counts as two papers)
Randomized algorithms for matrices and data. Michael W. Mahoney
http://arxiv.org/abs/1104.5557
(54 pages)
- 23-24.
- (counts as two papers)
The Structure and Function of Complex Networks SIAM Rev., 45(2),
167–256. (90 pages)
http://arxiv.org/pdf/cond-mat/0303516.pdf