Preface xvii
1 Introduction: Becoming a Unicorn 1
1.1 Arent Data Scientists Just Overpaid Statisticians? 2
1.2 How is This Book Organized? 3
1.3 How to Use This Book? 3
1.4 Why is It All in Python, Anyway? 4
1.5 Example Code and Datasets 4
1.6 Parting Words 5
Part I The Stuff Youll Always Use 7
2 The Data Science Road Map 9
2.1 Frame the Problem 10
2.2 Understand the Data: Basic Questions 11
2.3 Understand the Data: Data Wrangling 12
2.4 Understand the Data: Exploratory Analysis 13
2.5 Extract Features 14
2.6 Model 15
2.7 Present Results 15
2.8 Deploy Code 16
2.9 Iterating 16
2.10 Glossary 17
3 Programming Languages 19
3.1 Why Use a Programming Language? What are the Other Options? 19
3.2 A Survey of Programming Languages for Data Science 20
3.2.1 Python 20
3.2.2 R 21
3.2.3 MATLAB®and Octave 21
3.2.4 SAS®21
3.2.5 Scala®22
3.3 Python Crash Course 22
3.3.1 A Note on Versions 22
3.3.2 Hello World Script 23
3.3.3 More Complicated Script 23
3.3.4 Atomic Data Types 26
3.4 Strings 27
3.4.1 Comments and Docstrings 28
3.4.2 Complex Data Types 29
3.4.3 Lists 29
3.4.4 Strings and Lists 30
3.4.5 Tuples 31
3.4.6 Dictionaries 31
3.4.7 Sets 32
3.5 Defining Functions 32
3.5.1 For Loops and Control Structures 33
3.5.2 A Few Key Functions 34
3.5.3 Exception Handling 35
3.5.4 Libraries 35
3.5.5 Classes and Objects 35
3.5.6 GOTCHA: Hashable and Unhashable Types 36
3.6 Pythons Technical Libraries 37
3.6.1 Data Frames 38
3.6.2 Series 39
3.6.3 Joining and Grouping 40
3.7 Other Python Resources 42
3.8 Further Reading 42
3.9 Glossary 43
3a Interlude: My Personal Toolkit 45
4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning 47
4.1 The Worst Dataset in the World 48
4.2 How to Identify Pathologies 48
4.3 Problems with Data Content 49
4.3.1 Duplicate Entries 49
4.3.2 Multiple Entries for a Single Entity 49
4.3.3 Missing Entries 49
4.3.4 NULLs 50
4.3.5 Huge Outliers 50
4.3.6 OutofDate Data 50
4.3.7 Artificial Entries 50
4.3.8 Irregular Spacings 51
4.4 Formatting Issues 51
4.4.1 Formatting is Irregular between Different Tables/Columns 51
4.4.2 Extra Whitespace 51
4.4.3 Irregular Capitalization 52
4.4.4 Inconsistent Delimiters 52
4.4.5 Irregular NULL Format 52
4.4.6 Invalid Characters 52
4.4.7 Weird or Incompatible Datetimes 52
4.4.8 Operating System Incompatibilities 53
4.4.9 Wrong Software Versions 53
4.5 Example Formatting Script 54
4.6 Regular Expressions 55
4.6.1 Regular Expression Syntax 56
4.7 Life in the Trenches 60
4.8 Glossary 60
5 Visualizations and Simple Metrics 61
5.1 A Note on Pythons Visualization Tools 62
5.2 Example Code 62
5.3 Pie Charts 63
5.4 Bar Charts 65
5.5 Histograms 66
5.6 Means, Standard Deviations, Medians, and Quantiles 69
5.7 Boxplots 70
5.8 Scatterplots 72
5.9 Scatterplots with Logarithmic Axes 74
5.10 Scatter Matrices 76
5.11 Heatmaps 77
5.12 Correlations 78
5.13 Anscombes Quartet and the Limits of Numbers 80
5.14 Time Series 81
5.15 Further Reading 85
5.16 Glossary 85
6 Machine Learning Overview 87
6.1 Historical Context 88
6.2 Supervised versus Unsupervised 89
6.3 Training Data, Testing Data, and the Great Boogeyman of Overfitting 89
6.4 Further Reading 91
6.5 Glossary 91
7 Interlude: Feature Extraction Ideas 93
7.1 Standard Features 93
7.2 Features That Involve Grouping 94
7.3 Preview of More Sophisticated Features 95
7.4 Defining the Feature You Want to Predict 95
8 Machine Learning Classification 97
8.1 What is a Classifier, and What Can You Do with It? 97
8.2 A Few Practical Concerns 98
8.3 Binary versus Multiclass 99
8.4 Example Script 99
8.5 Specific Classifiers 101
8.5.1 Decision Trees 101
8.5.2 Random Forests 103
8.5.3 Ensemble Classifiers 104
8.5.4 Support Vector Machines 105
8.5.5 Logistic Regression 108
8.5.6 Lasso Regression 110
8.5.7 Naive Bayes 110
8.5.8 Neural Nets 112
8.6 Evaluating Classifiers 114
8.6.1 Confusion Matrices 114
8.6.2 ROC Curves 115
8.6.3 Area under the ROC Curve 116
8.7 Selecting Classification Cutoffs 117
8.7.1 Other Performance Metrics 118
8.7.2 LiftReach Curves 118
8.8 Further Reading 119
8.9 Glossary 119
9 Technical Communication and Documentation 121
9.1 Several Guiding Principles 122
9.1.1 Know Your Audience 122
9.1.2 Show Why It Matters 122
9.1.3 Make It Concrete 123
9.1.4 A Picture is Worth a Thousand Words 123
9.1.5 Dont Be Arrogant about Your Tech Knowledge 124
9.1.6 Make It Look Decent 124
9.2 Slide Decks 124
9.2.1 C.R.A.P. Design 125
9.2.2 A Few Tips and Rules of Thumb 127
9.3 Written Reports 128
9.4 Speaking: What Has Worked for Me 130
9.5 Code Documentation 131
9.6 Further Reading 132
9.7 Glossary 132
Part II Stuff You Still Need to Know 133
10 Unsupervised Learning: Clustering and Dimensionality Reduction 135
10.1 The Curse of Dimensionality 136
10.2 Example: Eigenfaces for Dimensionality Reduction 138
10.3 Principal Component Analysis and Factor Analysis 140
10.4 Skree Plots and Understanding Dimensionality 142
10.5 Factor Analysis 143
10.6 Limitations of PCA 143
10.7 Clustering 144
10.7.1 RealWorld Assessment of Clusters 144
10.7.2kMeans Clustering 145
10.7.3 Gaussian Mixture Models 146
10.7.4 Agglomerative Clustering 147
10.7.5 Evaluating Cluster Quality 148
10.7.6 SiIhouette Score 148
10.7.7 Rand Index and Adjusted Rand Index 149
10.7.8 Mutual Information 150
10.8 Further Reading 151
10.9 Glossary 151
11 Regression 153
11.1 Example: Predicting Diabetes Progression 153
11.2 Least Squares 156
11.3 Fitting Nonlinear Curves 157
11.4 Goodness of Fit:R2 and Correlation 159
11.5 Correlation of Residuals 160
11.6 Linear Regression 161
11.7 LASSO Regression and Feature Selection 162
11.8 Further Reading 164
11.9 Glossary 164
12 Data Encodings and File Formats 165
12.1 Typical File Format Categories 165
12.1.1 Text Files 166
12.1.2 Dense Numerical Arrays 166
12.1.3 ProgramSpecific Data Formats 166
12.1.4 Compressed or Archived Data 166
12.2 CSV Files 167
12.3 JSON Files 168
12.4 XML Files 170
12.5 HTML Files 172
12.6 Tar Files 174
12.7 GZip Files 175
12.8 Zip Files 175
12.9 Image Files: Rasterized, Vectorized, and/or Compressed 176
12.10 Its All Bytes at the End of the Day 177
12.11 Integers 178
12.12 Floats 179
12.13 Text Data 180
12.14 Further Reading 183
12.15 Glossary 183
13 Big Data 185
13.1 What is Big Data? 185
13.2 Hadoop: The File System and the Processor 187
13.3 Using HDFS 188
13.4 Example PySpark Script 189
13.5 Spark Overview 190
13.6 Spark Operations 192
13.7 Two Ways to Run PySpark 193
13.8 Configuring Spark 194
13.9 Under the Hood 195
13.10 Spark Tips and Gotchas 196
13.11 The MapReduce Paradigm 197
13.12 Performance Considerations 199
13.13 Further Reading 200
13.14 Glossary 200
14 Databases 203
14.1 Relational Databases and MySQL® 204
14.1.1 Basic Queries and Grouping 204
14.1.2 Joins 207
14.1.3 Nesting Queries 208
14.1.4 Running MySQL and Managing the DB 209
14.2 Key-Value Stores 210
14.3 Wide Column Stores 211
14.4 Document Stores 211
14.4.1 MongoDB®212
14.5 Further Reading 214
14.6 Glossary 214
15 Software Engineering Best Practices 217
15.1 Coding Style 217
15.2 Version Control and Git for Data Scientists 220
15.3 Testing Code 222
15.3.1 Unit Tests 223
15.3.2 Integration Tests 224
15.4 Test-Driven Development 225
15.5 AGILE Methodology 225
15.6 Further Reading 226
15.7 Glossary 226
16 Natural Language Processing 229
16.1 Do I Even Need NLP? 229
16.2 The Great Divide: Language versus Statistics 230
16.3 Example: Sentiment Analysis on Stock Market Articles 230
16.4 Software and Datasets 232
16.5 Tokenization 233
16.6 Central Concept: BagofWords 233
16.7 Word Weighting: TFIDF 235
16.8nGrams 235
16.9 Stop Words 236
16.10 Lemmatization and Stemming 236
16.11 Synonyms 237
16.12 Part of Speech Tagging 237
16.13 Common Problems 238
16.13.1 Search 238
16.13.2 Sentiment Analysis 239
16.13.3 Entity Recognition and Topic Modeling 240
16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding 240
16.15 Further Reading 241
16.16 Glossary 242
17 Time Series Analysis 243
17.1 Example: Predicting Wikipedia Page Views 244
17.2 A Typical Workflow 247
17.3 Time Series versus Time-Stamped Events 248
17.4 Resampling an Interpolation 249
17.5 Smoothing Signals 251
17.6 Logarithms and Other Transformations 252
17.7 Trends and Periodicity 252
17.8 Windowing 253
17.9 Brainstorming Simple Features 254
17.10 Better Features: Time Series as Vectors 255
17.11 Fourier Analysis: Sometimes a Magic Bullet 256
17.12 Time Series in Context: The Whole Suite of Features 259
17.13 Further Reading 259
17.14 Glossary 260
18 Probability 261
18.1 Flipping Coins: Bernoulli Random Variables 261
18.2 Throwing Darts: Uniform Random Variables 263
18.3 The Uniform Distribution and Pseudorandom Numbers 263
18.4 Nondiscrete, Noncontinuous Random Variables 265
18.5 Notation, Expectations, and Standard Deviation 267
18.6 Dependence, Marginal and Conditional Probability 268
18.7 Understanding the Tails 269
18.8 Binomial Distribution 271
18.9 Poisson Distribution 272
18.10 Normal Distribution 272
18.11 Multivariate Gaussian 273
18.12 Exponential Distribution 274
18.13 Log-Normal Distribution 276
18.14 Entropy 277
18.15 Further Reading 279
18.16 Glossary 279
19 Statistics 281
19.1 Statistics in Perspective 281
19.2 Bayesian versus Frequentist: Practical Tradeoffs and Differing Philosophies 282
19.3 Hypothesis Testing: Key Idea and Example 283
19.4 Multiple Hypothesis Testing 285
19.5 Parameter Estimation 286
19.6 Hypothesis Testing: t-Test 287
19.7 Confidence Intervals 290
19.8 Bayesian Statistics 291
19.9 Naive Bayesian Statistics 293
19.10 Bayesian Networks 293
19.11 Choosing Priors: Maximum Entropy or Domain Knowledge 294
19.12 Further Reading 295
19.13 Glossary 295
20 Programming Language Concepts 297
20.1 Programming Paradigms 297
20.1.1 Imperative 298
20.1.2 Functional 298
20.1.3 ObjectOriented 301
20.2 Compilation and Interpretation 305
20.3 Type Systems 307
20.3.1 Static versus Dynamic Typing 308
20.3.2 Strong versus Weak Typing 308
20.4 Further Reading 309
20.5 Glossary 309
21 Performance and Computer Memory 311
21.1 Example Script 311
21.2 Algorithm Performance and BigO Notation 314
21.3 Some Classic Problems: Sorting a List and Binary Search 315
21.4 Amortized Performance and Average Performance 318
21.5 Two Principles: Reducing Overhead and Managing Memory 320
21.6 Performance Tip: Use Numerical Libraries When Applicable 322
21.7 Performance Tip: Delete Large Structures You Dont Need 323
21.8 Performance Tip: Use BuiltIn Functions When Possible 324
21.9 Performance Tip: Avoid Superfluous Function Calls 324
21.10 Performance Tip: Avoid Creating Large New Objects 325
21.11 Further Reading 325
21.12 Glossary 325
Part III Specialized or Advanced Topics 327
22 Computer Memory and Data Structures 329
22.1 Virtual Memory, the Stack, and the Heap 329
22.2 Example C Program 330
22.3 Data Types and Arrays in Memory 330
22.4 Structs 332
22.5 Pointers, the Stack, and the Heap 333
22.6 Key Data Structures 337
22.6.1 Strings 337
22.6.2 AdjustableSize Arrays 338
22.6.3 Hash Tables 339
22.6.4 Linked Lists 340
22.6.5 Binary Search Trees 342
22.7 Further Reading 343
22.8 Glossary 343
23 Maximum Likelihood Estimation and Optimization 345
23.1 Maximum Likelihood Estimation 345
23.2 A Simple Example: Fitting a Line 346
23.3 Another Example: Logistic Regression 348
23.4 Optimization 348
23.5 Gradient Descent and Convex Optimization 350
23.6 Convex Optimization 353
23.7 Stochastic Gradient Descent 355
23.8 Further Reading 355
23.9 Glossary 356
24 Advanced Classifiers 357
24.1 A Note on Libraries 358
24.2 Basic Deep Learning 358
24.3 Convolutional Neural Networks 361
24.4 Different Types of Layers. What the Heck is a Tensor? 362
24.5 Example: The MNIST Handwriting Dataset 363
24.6 Recurrent Neural Networks 366
24.7 Bayesian Networks 367
24.8 Training and Prediction 369
24.9 Markov Chain Monte Carlo 369
24.10 PyMC Example 370
24.11 Further Reading 373
24.12 Glossary 373
25 Stochastic Modeling 375
25.1 Markov Chains 375
25.2 Two Kinds of Markov Chain, Two Kinds of Questions 377
25.3 Markov Chain Monte Carlo 379
25.4 Hidden Markov Models and the Viterbi Algorithm 380
25.5 The Viterbi Algorithm 382
25.6 Random Walks 384
25.7 Brownian Motion 384
25.8 ARIMA Models 385
25.9 ContinuousTime Markov Processes 386
25.10 Poisson Processes 387
25.11 Further Reading 388
25.12 Glossary 388
25a Parting Words: Your Future as a Data Scientist 391
Index 393