List of topics for examination-based assessment. ================================================= You will be given 5 topics (at least one from each section) and you will have 90min to complete your assesment. Data exploration ================== 1. Explain what are the numerical and categorical variables. 2. Explain what is the difference between experimental studies and observations. What it is a confounding variable. When we can draw conclusion about correlations and when about causation? 3. What is the difference between census and the sampling? What are the sampling methods? What are the possible sources of the sampling bias. 4. How we define point estimate of sample statistics: measures of center and spread. What does it mean "robust statistics". Which definitions have this features. 5. What is the box plot. Could you draw example and explain meaning of its content. How we deduce that distributions has a tail, outlayers, etc. Regression ================== 1. Draw the flow-chart diagram for making predictions using regression as ML algorithm. Explain briefly each box on the flow-chart: ML model, ML algorithm, Quality metric, Feature extraction. 2. Write down formula for simple linear regression model. How one can interpret coefficients. Write formula for defining cost function using RSS estimator. 3. Write down sequence of iterative gradient descent algorithm finding minimum of the cost function for simple linear regression. When do we stop iteration, how do we choose step-size. 4. How do we access performance? Explain what is the "training error", "validation error", "generalization error", "test error". What does it mean "cross-validation"? Draw illustrative plot how they typically behave with regression model complexity. 5. Explain what are the sources of errors on the prediction: noise, bias, variance. Draw simple illustration explaining it. What does it mean bias-variance trade-off. 6. What does it mean "over-fitting"? Explain how we can mitigate it adding extra term to the cost function: "regge regression" or "lasso regression". Write formula of the respective cost functions. Show illustrative plot how the coefficients w will behave in each case. 7. Explain procedure for selecting features for regression with the greedy algorithm. 8. What does it mean "non-parametric regression". Explain concept of: (1-NN) regression, (k-NN) regression, weighted regression, kernel regression. Classification ================== 1. Explain model of logistic regression classifier. Write down formula for linear score and logistic link function. How it is extended in case of multi-classification problem. 2. We measure performance of the classifier based on: "classification error", "classification accuracy", "confusion matrix". Could you explain what does it mean? What is the problem of "class majority". 3. Write formula for quality metric in case of logistic classifier: likelihood function. The best classifier is found using MLE (maximum likelihood estimation) method and gradient ascent. Could you write down and explain final formula of that algorithm. How do we choose step size. 4. Classification with decision trees. How one defines classification of the final leafs. How one measure quality of the predictions: error and accuracy. Explain simple greedy algorithm to find the best decision tree. How one is measuring performance. 5. Greedy decision tree learning: what are the steps for building tree. Stopping conditions for the splitting in the decision tree model. What is the sign of over-fitting in decision trees, how one mitigate this effect: early stopping or pruning. Could you explain what does it mean? 6. What are strategies for handling missing data in case of decision trees. 7. Idea of ensemble classifiers and boosting. Could you explain the concept of weighted weak classifiers and weighted data. Could you write down formula for final mode predictions. 8. AdaBoost algorithm, formulas, learning process. Clustering&Retrieval ===================== 1. Explain TF-IDF representation of documents. What are the metrics which are most commonly used to search for k-NN documents. 2. What are the KD-trees. How to build and query KD-tree. What is the complexity of querying and how it compares with complexity of other queries: 1-NN, k-NN. 3. Explain LSH method (locality sensitive hashing). Is it competitive to KD-tree method? 4. Describe steps of k-means clustering algorithm. How we measure its quality? Could you comment on its convergence 5. Explain probabilistic approach for clustering. The soft assignment can be optimised with MLE approach (maximum likelihood estimator). Can you explain what does it mean, give some formulas? 6. Explain what is the model for "bag-of-words" for clustering documents. 7. LDA method (Latent Dirichlet allocation). Can you explain the concept? 8. Hierarchical clustering. Explain algorithm, illustrate with dendrogram.