Data Science Terms Every
Data Scientist Should Know
1. A/B
Testing: A statistical method used to compare two versions of a
product, webpage, or model to determine which performs better.
2. Accuracy: The
measure of how often a classification model correctly predicts outcomes among
all instances it evaluates.
3. Adaboost: An ensemble learning algorithm
that combines weak classifiers to create a strong classifier.
4. Algorithm: A
step-by-step set of instructions or rules followed by a computer to solve a
problem or perform a task.
5. Analytics: The
process of interpreting and examining data to extract meaningful insights.
6. Anomaly
Detection: Identifying unusual patterns or outliers in data.
7. ANOVA (Analysis
of Variance): A statistical method used to analyze the differences
among group means in a sample.
8. API
(Application Programming Interface): A set of rules that allows one
software application to interact with another.
9. AUC-ROC (Area Under the ROC Curve): A metric that tells us how well a
classification model is doing overall, considering different ways of deciding
what counts as a positive or negative prediction.
B
10. Batch Gradient
Descent: An optimization algorithm that updates model parameters using
the entire training dataset (different from mini-batch gradient descent)
11. Bayesian
Statistics: A statistical approach that combines prior knowledge with
observed data.
12. BI (Business
Intelligence): Technologies, processes, and tools that help
organizations make informed business decisions.
13. Bias: An
error in a model that causes it to consistently predict values away from the
true values.
14. Bias-Variance
Tradeoff: The balance between the error introduced by bias and
variance in a model.
15. Big Data: Large
and complex datasets that cannot be easily processed using traditional data
processing methods.
16. Binary Classification: Categorizing
data into two groups, such as spam or not spam.
17. Bootstrap
Sampling: A resampling technique where random samples are drawn with
replacement from a dataset.
C
18. Categorical
data: variables that represent categories or groups and can take on a
limited, fixed number of distinct values.
19. Chi-Square Test:
A statistical test used to determine if there is a significant association
between two categorical variables.
20. Classification: Categorizing data
points into predefined classes or groups.
21. Clustering: Grouping
similar data points together based on certain criteria.
22. Confidence
Interval: A range of values used to estimate the true value of a
parameter with a certain level of confidence.
23. Confusion Matrix: A table used to
evaluate the performance of a classification algorithm.
24. Correlation: A
statistical measure that describes the degree of association between two
variables.
25. Covariance: A
measure of how much two random variables change together.
26. Cross-Entropy
Loss: A loss function commonly used in classification problems.
27. Cross-Validation: A
technique to assess the performance of a model by splitting the data into
multiple subsets for training and testing.
D
28. Data
Cleaning: The process of identifying and correcting errors or
inconsistencies in datasets.
29. Data Mining: Extracting
valuable patterns or information from large datasets.
30. Data Preprocessing: Cleaning and
transforming raw data into a format suitable for analysis.
31. Data Visualization: Presenting data in
graphical or visual formats to aid understanding.
32. Decision Boundary: The dividing line
that separates different classes in a classification problem.
33. Decision
Tree: A tree-like model that makes decisions based on a set of rules.
34. Dimensionality
Reduction: Reducing the number of features in a dataset while retaining
important information.
E
35. Eigenvalue
and Eigenvector: Concepts used in linear algebra, often employed in
dimensionality reduction to transform and simplify complex datasets.
36. Elastic
Net: A regularization technique that combines L1 and L2 penalties.
37. Ensemble
Learning: Combining multiple models to improve overall performance and
accuracy.
38. Exploratory
Data Analysis (EDA): Analyzing and visualizing data to understand its
characteristics and relationships.
F
39. F1 Score:
A metric that combines precision and recall in classification models.
40. False
Positive and False Negative: Incorrect predictions in binary
classification.
41. Feature:
data column that’s used as the input for ML models to make predictions.
42. Feature
Engineering: Creating new features from existing ones to improve model
performance.
43. Feature
Extraction: Reducing the dimensionality of data by selecting important
features.
44. Feature
Importance: Assessing the contribution of each feature to the model’s
predictions.
45. Feature
Selection: Choosing the most relevant features for a model.
G
46. Gaussian
Distribution: A type of probability distribution often used in
statistical modeling.
47. Geospatial
Analysis: Analyzing and interpreting patterns and relationships within
geographic data.
48. Gradient
Boosting: An ensemble learning technique where weak models are trained
sequentially, each correcting the errors of the previous one.
49. Gradient
Descent: An optimization algorithm used to minimize the error in a
model by adjusting its parameters.
50. Grid Search: A
method for tuning hyperparameters by evaluating
models at all possible combinations.
H
51. Heteroscedasticity: Unequal
variability of errors in a regression model.
52. Hierarchical
Clustering: A method of cluster analysis that organizes data into a
tree-like structure of clusters, where each level of the tree shows the
relationships and similarities between different groups of data points.
53. Hyperparameter: A parameter whose value
is set before the training process begins.
54. Hypothesis
Testing: A statistical method to test a hypothesis about a population
parameter based on sample data.
I
55. Imputation: Filling
in missing values in a dataset using various techniques.
56. Inferential
Statistics: A branch of statistics that involves making inferences
about a population based on a sample of data.
57. Information
Gain: A measure used in decision trees to assess the effectiveness of
a feature in classifying data.
58. Interquartile
Range (IQR): A measure of statistical dispersion, representing the
range between the first and third quartiles.
J
59. Joint
Plot: A type of data visualization in Seaborn
used for exploring relationships between two variables and their individual
distributions.
60. Joint
Probability: The probability of two or more events happening at the
same time, often used in statistical analysis.
61. Jupyter Notebook: An open-source web
application for creating and sharing documents containing live code, equations,
visualizations, and narrative text.
K
62. K-Means
Clustering: A popular algorithm for partitioning a dataset into
distinct, non-overlapping subsets.
63. K-Nearest
Neighbors (KNN): A simple and widely used classification algorithm
based on how close a new data point is to other data points.
L
64. L1
Regularization: Adding the absolute values of coefficients as a
penalty term to the loss function.
65. L2
Regularization (Ridge): Adding the squared values of coefficients as a
penalty term to the loss function.
66. Linear
Regression: A statistical method for modeling the relationship between
a dependent variable and one or more independent variables.
67. Log
Likelihood: The logarithm of the likelihood function,
often used in maximum likelihood estimation.
68. Logistic
Function: A sigmoid function used in logistic regression to model the
probability of a binary outcome.
69. Logistic
Regression: A statistical method for predicting the probability of a
binary outcome.
M
70. Machine
Learning: A subset of artificial intelligence that enables systems to
learn and make predictions from data.
71. Mean Absolute
Error (MAE): A measure of the average absolute differences between
predicted and actual values.
72. Mean Squared
Error (MSE): A measure of the average squared difference between
predicted and actual values.
73. Mean: The
average value of a set of numbers.
74. Median: The
middle value in a set of sorted numbers.
75. Metrics: Criteria
used to assess the performance of a machine learning model, such as accuracy,
precision, recall, and F1 score.
76. Model
Evaluation: Assessing the performance of a machine learning model
using various metrics.
77. Multicollinearity: The
presence of a high correlation between independent variables in a regression
model.
78. Multi-Label
Classification: Assigning multiple labels to an input, as opposed to
just one.
79. Multivariate
Analysis: Analyzing data with multiple variables to understand
relationships between them.
N
80. Naive Bayes: A
probabilistic algorithm based on Bayes’ theorem used for classification.
81. Normalization: Scaling
numerical variables to a standard range.
82. Null
Hypothesis: A statistical hypothesis that assumes there is no
significant difference between observed and expected results.
O
83. One-Hot
Encoding: A technique to convert categorical variables into a binary matrix
for machine learning models.
84. Ordinal
Variable: A categorical variable with a meaningful order but not
necessarily equal intervals.
85. Outlier: An
observation that deviates significantly from other observations in a dataset.
86. Overfitting: A
model that performs well on the training data but poorly on new, unseen data.
P
87. Pandas: A
standard data manipulation library for Python for working with structured data.
88. Pearson
Correlation Coefficient: A measure of the linear relationship between
two variables.
89. Poisson
Distribution: A discrete probability distribution that expresses the
probability of a given number of events occurring in a fixed interval of time
or space.
90. Precision: The
ratio of true positive predictions to the total number of positive predictions
made by a classification model.
91. Predictive
Analytics: Using data, statistical algorithms, and machine learning
techniques to identify the likelihood of future outcomes.
92. Principal
Component Analysis (PCA): A dimensionality reduction technique that
transforms data into a new framework of features, simplifying the information
while preserving its fundamental patterns.
93. Principal
Component: The axis that captures the most variance in a dataset in
principal component analysis.
94. P-value: The
probability of obtaining a result as extreme as, or more extreme than, the
observed result during hypothesis testing.
Q
95. Q-Q Plot
(Quantile-Quantile Plot): A graphical tool to assess if a dataset
follows a particular theoretical distribution.
96. Quantile: A
data point or set of data points that divide a dataset into equal parts.
R
97. Random
Forest: An ensemble learning method that constructs a multitude of
decision trees and merges them together for more accurate and stable
predictions.
98. Random
Sample: A sample where each member of the population has an equal
chance of being selected.
99. Random Variable: A
variable whose possible values are outcomes of a random phenomenon.
100. Recall: The
ratio of true positive predictions to the total number of actual positive
instances in a classification model.
101. Regression
Analysis: A statistical method used for modeling the relationship
between a dependent variable and one or more independent variables.
102. Regularization: Adding
a penalty term to the cost function to prevent overfitting in machine learning
models.
103. Resampling: Techniques
like bootstrapping or cross-validation to assess the performance of a model.
104. ROC Curve
(Receiver Operating Characteristic Curve): A graphical representation
of the trade-off between true positive rate and false positive rate for
different thresholds in a classification model.
105. Root Mean
Square Error (RMSE): A measure of the difference between predicted and
actual values.
106. R-squared: A
statistical measure that represents the proportion of the variance in the
dependent variable explained by the independent variables in a regression
model.
S
107. Sampling
Bias: A bias in the selection of participants or data points that may
affect the generalizability of results.
108. Sampling: The
process of selecting a subset of data points from a larger dataset.
109. Scalability: The
ability of a system to handle increasing amounts of data or workload.
110. Sigmoid
Function: A mathematical function used in binary classification
problems.
111. Silhouette
Score: A metric used to calculate the goodness of a clustering
technique.
112. Singular
Value Decomposition (SVD): A matrix factorization technique used in
dimensionality reduction.
113. Spearman
Rank Correlation: A non-parametric measure of correlation between two
variables.
114. Standard
Deviation: A measure of the amount of variation or dispersion in a set
of values.
115. Stationarity: A
property of time series data where statistical properties remain constant over
time.
116. Stratified
Sampling: A sampling method that ensures proportional representation
of subgroups within a population.
117. Supervised
Learning: Learning from labeled data where the algorithm is trained on
a set of input-output pairs.
118. Support
Vector Machine (SVM): A supervised machine learning algorithm used for
classification and regression analysis.
T
119. t-Distribution:
A probability distribution used in hypothesis testing when the sample size is
small or the population standard deviation is unknown.
120. Time Series
Analysis: Analyzing data collected over time to identify patterns and
trends.
121. t-test: A
statistical test used to determine if there is a significant difference between
the means of two groups.
122. Two-sample t-test: A statistical test
used to compare the means of two independent samples.
U
123. Underfitting: A model that is too simple
to capture the underlying patterns in the data.
124. Univariate
Analysis: Analyzing the variation of a single variable in the dataset.
125. Unsupervised
Learning: Learning from unlabeled data where the algorithm identifies
patterns and relationships on its own.
V
126. Validation
Set: A subset of data used to assess the performance of a model during
training.
127. Variance: The
degree of spread or dispersion of a set of values, and also the variability of
model predictions.
X
128. XGBoost: An open-source library for
gradient-boosted decision trees designed for speed and performance.
Z
129. Zero-shot
Learning: Training a model to perform a task without explicit
examples.
130. Z-Score: A
standardized score that represents the number of standard deviations a data
point is from the mean.