Financial Analytics Big Data Project
This code runs 5 basic data tasks in order to make your data ready for modelling and also helps you choose among different models.
- Automated Data Cleaning
- Human-Assisted Data Cleaning
- Automated Dummy Creation with Automated Supervised Binning
- Automated Model Selection and Comparison
- Human-Assisted Model Selection
This function identifies invalid values, NAs, missing values, outliers, unreliable and out of range values. First it will check the type of data if it's numeric or not. First step is importing a dataset and check all the values of the columns. If it's not numeric it will replace missing values with the mode. For numeric it will replace NAs and unreliable values with the mean, outliers with min or max depending on the case.
autodataclean(trainingdataset)
autodataclean(trainingdataset,testdataset)This function follows the same methodology than the automated version but allows the user to interact by allowing the selection of the value that will replace the non correct values. For all the cases it will ask the user to select between mean, median or other value the user wants to use.
manudataclean(trainingdataset)
manudataclean(trainingdataset,testdataset)This function creates bins based on the entropy with respect to the target variable, and then proceeds to create dummies for the bins.
autobindummy(trainingdataset,testdataset,targetvariable)This program takes as an input a file with multiple variables, selects a target variable and applies different classification models to later on show the most accurate one. First step is selecting the target variable, then it uses K-Fold method to partition training data and testing data. The number of folds that are selected is 5. It will first train each of the models within the partitions and finally give a result of the accuracy of that given model. Once each model is tested, a data frame is created with the result of all the models sorted by the one with the highest accuracy.
automodel(trainingdataset,testdataset,targetvariable)This program follows the same methodology than the previous one, the only difference is that the program interacts with the user by asking which model the user wants to select. As a result there isn't a comparison between models but just the result of a single one.
manumodel(trainingdataset,testdataset,targetvariable)This project has been completed by:
- Christabelle Santos
- Javier Lameda
- Michail Pintchiouk
- Mounir
- Natalia Hernandez
- Sergio Salas
- Tran Nyugen
- Vamsee Krishna
This project is licensed under the MIT License.