This codebook describes how the data provided are being used and processed.
The script run_analysis.R contains one functi on called run_analysis. This function is intended to be used with the set of datasets collected from the accelerometers from the Samsung Galaxy S smartphone. A full description is available at the site where the data was obtained:
Here are the data for the project to be used in order to fulfiill the project:
The script was designed with the given dataset in mind but might also be used with a different one, as long as is has the same structure and is placed in the same set of files in the same directories (note that this has never been tested).
This function reads the files from the "UCI HAR Dataset", and performs the merging, reshaping and aggregation.
Args:
- path: a character value which points to the path, where the extracted dataset can be found. This can either be an absolute or a relative path. If a relative path is being used, make sure that current working directory is the one you expect. The default is the relative path "UCI HAR Dataset".
Returns:
- A data frame based on the original training and test data sets, reduced by the "std()" and "mean()" variables, grouped by the activity label and subject, aggregated by the mean of each variable (excluding the grouped by ones).
The .zip provides a lot of data files which contains
- the raw data observed
- a preprocessed version
- several explanation files
- some "master data" files (e.g. for the coding of the activities)
The function makes use of the following files (relative to the extracted .zip archive):
- activity_labels.txt
- features.txt
- train/X_train.txt
- train/y_train.txt
- train/subject_train.txt
- test/X_test.txt
- test/y_test.txt
- test/subject_test.txt
(for a detailed description of the data format and content refer to the documentation at http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) In all cases the data are being read my read.table with a blank separator.
The observed data from the training data set X_train.txt and test data set X_test.txt are being read into
the variabled df.train and df.test respectively. The union of both data frames is being generated by rbind and stored in the
variable df.all. Note that this implies that the data in the two dataset have the same structure and the
variables / columns have the same order.
The features which represents the measurement names are read in as a data frame from the features.txtand stored
in the variable df.features. This dataset is being filtered by features (second variable in the data frame)
which ends with -std() or -mean() and
is stored in df.features.stdAndMean. It mainly contains the set of features and positions which represents the standard
deviations and mean values.
The first variable of the df.features.stdAndMean which represents the required variable indexes in the df.all is being used in order to remove the not needed variabled from df.all. The resulting data frame is being stored in df.meanAndStd.
The activities for the training and test data set are being read from y_train.txtand y_test.txt, stored in
df.train_activity and df.test_activity. The union of both is being generated by rbind and stored in df.activity.
The master data labels for the activities are being read from the activity_labels.txtand stored in the df.activity_lables.
Based on the df.activity data frame (which has just on variable) which represents the activities for all the data sets,
the vector all.activities with all the mapped label is being generated by using the activity variable as an index into the
df.activity_lables. Note the code implies that the data in the df.activity_lables are being order by the activity (i.e. the
numeric id). No explicit ordering was performed.
The automatic generated column names in df.meanAndStd are being replaced by the (filtered) features in df.features.stdAndMean. It assumes that the order in the filtered data frame is still the same as in the df.meanAndStd.
average of each variable for each activity and each subject.
The subjects of the test and training data set are being read from subject_train.txt and subject_test.txt
and stored in df.train_subject and df.test_subject. The union of both is generated by rbind and the first (and only)
column as being stored in all.subject as a vector.
The filtered data set in df.meanAndStd is being grouped by the activities and subjects vectors by using the aggregate function. The columns in the data frames are aggregated by the mean function and the result is being stored in the data frame df.result. The the automatically added columns for the group by steps are renamed by reasonable names.
Finally the df.result dataset is being returned by the function.