# Data mining

## Kaggle

As part of my summer internship at Creative Data, in 2015, I worked on a Kaggle competition. The goal of this competition was to identify different hand movements based on electroencephalograms. I worked on several Python scripts in order to solve this problem.

### Algorithms

I first tried some very classic machine learning algorithms, such as logistic regression, SVM and random forest. Then I tried neural networks, which performed way better on my validation set. I used a dense neural network made of two couples of layers dense-drop out and a convolution neural network made of a convolution layer, a max-pooling layer and a dense layer. I used a weighted mean of the scores predicted by those two networks to compute my final result.

### Pre-processing

Seeing that one class was over-represented, I tried re-balancing the classes by selecting only 20% of the data from the biggest class. The result wasn't good because a lot of data was lost so I didn't do this for my final solution. I also applied a band pass filter, as well as a Common Spatial Pattern algorithm which I used to create new variables. I also tried reducing the number of features but it resulted in a greater test error.

## Recognition of 3D point clouds

I worked on a school project which goal was to implement solutions from the paper "Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes", by Andrew E. Johnson and Martial Hebert. The goal was to identify 3D point clouds representing an object on the road (pedestrian, car...).

### Algorithms

We first practised on only two classes, using a Gaussian and a linear SVM. We decided to keep the linear SVM for the multi-class algorithm. We used an SVM one versus all, and an SVM one versus one, for which we tried two kinds of parameter validation: one where the parameter was chosen for each SVM one versus one and one where the same parameter was used for all SVMs.

### Pre-processing

The choice of the attributes was based on the recommendations of the article we were studying. We used statistics on the intensity of the points, the bounding box and the attributes scatter-ness, linear-ness and surface-ness. As one class was over-represented compared to the others, we also rebalanced the classes.

## School work

I studied data mining in class during three semesters. It is hard to sum up everything we have seen during this time and the algorithms we have implemented, but here are some topics:

- Unsupervised data mining: k-means, fuzzy k-means, k nearest neighbours, hierarchical clustering, PCA;
- Optimisation: without constraints (gradient descent and Newton), and with constraints;
- Regression: linear and polynomial regression;
- Classification: SVM, neural networks, random forest, bagging, Bayesian decision, logistic regression;
- Focus on SVM: multi-class SVM, hyper-parameters tuning, kernels;
- Others: Lasso, ridge regression.