Skip to content

dig7er/datasciencecoursera

Repository files navigation

README

This file describes how the script run_analysis.R works and the how the code book CodeBook.md is structured.

Source Code Description

The run_analysis.R code imports the dplyr library, which delivers the group_by() and the summarize_each() functions that are used in the code.

It then defines the utility functions:

  • getArchivedData(): Checks whether the input data is provided and downloads it otherwise.
  • readData(): Performs read operation over the specified data files.

The actual execution of the algorithm starts from the section "2. Getting Data". The list of characters archivePaths defines 8 paths to the input files. The data from these files is read to the list of data.frames rawData with the help of the readData() function.

In the section "3. Cleaning data" the test and training data are brought to a tidy format. Subject and activity data is merged with the measurements in both data sets.

The following assignments tasks are being fullfilled one after the other in the section "4. Do the assignment tasks":

  • Merge the training and the test sets to create one data set. This is done by applying rbind().
  • Extract only the variables on the mean and standard deviation for each measurement. This is done by applying grep() function with the pattern which selects only the variables which contain "subject"", "activity"", "mean()" or "std()" in their names.
  • Use descriptive activity names to name the activities in the data set. This is done by using factors with the labels from the activity_labels.txt instead of the activity id's.
  • Appropriately labels the data set with descriptive variable names. This is done by applying colnames() function.
  • From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject. This is done by applying the group_by() and summarize_each() functions.

Input

The source code assumes the existence of the following directory UCI HAR Dataset. If the directory does not exist, then the zip file from the assignment is being downloaded and extracted.

The following filepaths are used by the code:

  • UCI HAR Dataset/activity_labels.txt
  • UCI HAR Dataset/features.txt
  • UCI HAR Dataset/test/X_test.txt
  • UCI HAR Dataset/test/y_test.txt
  • UCI HAR Dataset/test/subject_test.txt
  • UCI HAR Dataset/train/X_train.txt
  • UCI HAR Dataset/train/y_train.txt
  • UCI HAR Dataset/train/subject_train.txt

The files are described in the features_info.txt and README.txt files in the UCI HAR Dataset directory.

Output

The output.txt files contains the resulting data.frame of the assignment. Please, use the codebook CodeBook.md to learn more about it.

About

Data Science Coursera

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages