Skip to content

Elcess/Class-project

Repository files navigation

Description of Class Project

The Project for the Coursera Class "Getting and Cleaning Data" was set out with five tasks. They were:

  1. Merge the training and the test sets to create one data set.
  2. Extract only the measurements on the mean and standard deviation for each measurement.
  3. Use descriptive activity names to name the activities in the data set
  4. Appropriately label the data set with descriptive variable names.
  5. From the data set in step 4, create a second, independent tidy data set with the average of each variable for each activity and each subject.

Code to perform these tasks was to be provided in a single R script, "run_analysis.R."

====================================================================================

Cleaned and processed data was obtained from https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip. Reference: Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec 2012.

The authors provided processed datasets for measurements obtained from reading the outputs of the embedded accelerometer and gyroscope of a smartphone (Samsung Galaxy S II) worn on the waist of 30 volunteers aged from 19 to 48 years old while performing six activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying [sic]. They deconvolved the total accelerometer and gyroscope readings into those due to gravity acting on the sensors and those due to the motion of the body within the gravity field. They further processed the signals to obtain signals in three orthogonal directions, x, y, and z, and computed an overall magnitude and angles between the directions of motion and the primary orthogonal coordinates. Originally measured signals were preprocessed to remove noise and sampled in fixed-width sliding windows. From each window, time- and frequency-domain features were obtained. Measurement and calculated variables were transformed into 561-feature vectors for each observation of a subject performing an activity. The data were split for training and testing of the machine learning models by randomly assigning 70% of the subjects to a training set, leaving 30% of the subject for the testing set. For each (training or testing) set, three files were supplied. Subject identifiers were supplied in the "subject_train.txt" and "subject_test.txt" files. Activity identifiers were supplied in the "y_train.txt" and "y_test.txt" files. The "X_train.txt" and "X_test.txt" files contained the processed measurement data. For the class project there was no need to consult the files with individual accelerometer or gyroscope measurements.

Other files provided in the dataset included the "README.txt," "features_info.txt" with general information about the variables used on the feature vector, "features.txt" containing a list of all features, and "activity_labels.txt" that provided a table linking the activity label with the activity ID found in the "y" data tables.

Due to the preprocessing and cleaning, there were no missing or "NA" values in the datasets as read. The preprocessing also rendered the original measurement units moot, and they were not supplied.

====================================================================================

The tidy dataset for Tasks 1 through 4 of the project have been provided as text output in the same format as the original data. File "activityIDtable.txt" corresponds to the original "y" data and lists the activity IDs in the order found in the merged test and training datasets. The "activitytable.txt" file contains the correspondence table to match activity labels to their IDs. It is similar to the original "activity_labels.txt" file, but the activity lables have been modified: changed to lowercase and stripped of underscores. The "datasettable.txt" file contains the complete tidy dataset with activity lables, the subject ID, and named columns containing all (test and training) measurements corresponding to only means and standard deviations from the original "X" datasets; the size is 10299 observations (rows) of 68 variables (columns). The "subjectIDtable.txt" file contains a list of subject IDs in the order of the merged dataset, similar to the original "subject" files. This file is provided for completeness only, as the subject IDs were incorporated into the tidy data set. Names for all the variables, i.e., columns, including activity and subject ID, are provided in the "variablestable.txt" file.

All processing of the data was done in R and is documented in the R script "run_analysis.R" Prior to beginning the tasks, basic housekeeping was performed. A working directory was selected to received the data and several packages for working with data and data tables were loaded, along with their dependencies. Packages "data.table," "DBI," "dplyr," and "readr" were used; other choices could have been made.

To obtain the data for the first four tasks, the datafile was downloaded as a .zip file using download.file(), and unzipped using unzip() with the "junkpaths = TRUE" option to place all the files in the same (working) directory. Test and train data tables were read from the corresponding text files using fread(). The data tables were merged individually row-wise using rbind() to create a single dataset from the test and training sets of the original data.

After consulting the "features_info.txt" and "features.txt" files, the columns containing means and standard deviations were identified and loaded into a selection vector, "msdcolumns." The 66 of the 561 original feature columns containing mean and standard deviation measurements were retained using the select() function.

Activity names were changed to lowercase using tolower(), and underscores in activity names were removed using gsub(). Names were added to the "y" data table and the activitytable for use in joining. Activity IDs were replaced by activity names using a join() operation on the "y" dataset with the activitytable. A name was also added to the subject ID ("s") data table. The measurement dataset was then updated using cbind() to add columns for the activity names and the subject IDs.

Names for the remaining columns were obtained by reading "features.txt" into a data table ("features"), selecting only the second column of "features" (containing the feature names), and selecting only the features of interest by using slice() with the msdcolumns selector. The resulting character vector was used to set the names of columns 3 to 68 of the "x" data table. Variable names, including the activity and subjectID as well as the feature names, were assigned to a character vector. Data was written to text files using write.table() to produce the five output tables mentioned previously.

=====================================================================

Task 5 of the Class Project was to create a second, independent tidy data set with the average of each variable for each activity and each subject. This was interpreted as meaning that for each activity-subject ID pair a mean was to be taken of each of the 66 remaining measurements, and the results stored in a data table along with corresponding activity names and subject IDs.

After making a copy of the "x" dataset at the end of Task 4, and to facilitate the use of a loop, the (character) "activity" column was replaced by an (integer) "activityID" column. Although the number of distinct activities and subjects was known, the R code allows for a variable number of each. The range for the subjectID (inner) loop was determined by taking the length of the vector of unique subjectIDs, while that for the activityID (outer) loop was obtained in a similar manner using the unique activityIDs. A tracking variable was set to obtain the number of times the inner loop executed for each value of the outer loop in case not all subjects had entries for all activities. Averages for the measurements were calculated by using the colMeans() function for the selection of observations for which both the activityID and the subject ID matched those for the observation. [Note: It is not clear that averaging standard deviations is a statistically valid operation. Nevertheless, that was the task.] Each pass through the inner loop produced a row vector which was added to the "xmeans" matrix using rbind(). The tracking variable was used to add subjectIDs and activityIDs to separate columns.

At the end of the loop, the index column vectors and the "xmeans" matrix were reclassified as data frames using data.frame(). The activity and subject ID columns were named and activity names recovered using a join() operation. The full dataset was obtained by using cbind() to add the activity and subjectID columns to the calculated variables. Feature (variable) names were modified to reflect that an average had been taken by prefixing "Avg" to each of the original variable names using gsub(). The dataset was named using the new variable names and the new variable names were added to a character vector.

Although the data frame "xmeans" contains the complete dataset, five text files were output for consistency with Task 4. They are:

  1. "activitytable-means.txt" links the activity IDs with their activity names. Its dimensions are 6 rows by 2 columns.
  2. "subjectIDtable-means.txt" lists the subject ID for the subject who performed the activity for each row in the calculated dataset. Its range is 1 to 30.
  3. "activityIDtable-means.txt" provides the activity ID and activity name for row in the calculated dataset. Its dimensions are 180 rows by 2 columns.
  4. "datasettable-means.txt" contains a data table with the activity, subject ID, and 66 variables selected from the calculated dataset. Its dimensions are 180 rows by 68 columns.
  5. "variablestable-means.txt" provides variable names for each of the 68 columns in the dataset, including the activity, the subject ID, and 66 measurements (features or variables). This file is available in Appendix B of this code book.

See the file "run_analysis.R" for code used to accomplish these tasks along with descriptions of the code's actions.

About

Repository for the Class Project for Coursera "Getting and Cleaning Data"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages