IN498-2: System Specification: Design, implement, and evaluate an analytics-based solution to meet a given set of requirements in the context of the discipline.

Purpose

For this assignment, you will perform linear and logistic regression on a data set. You will apply the models to make predictions, or probabilities, based on installs versus 30-day retentions. You will also use data munging methods to add columns to better the model results.

You will use a single final CSV file that contains the contents of the nine provided data sets. Using the IN498_Unit3_student.py file, and the final CSV file, you will perform linear and logistic regression.

Assignment Instructions

You must have Python and PyCharm installed to perform this assignment. You should use the free editions for each.

If you do not have the above software installed, please perform the required installations. The following documents will assist you with installation of the software:

Python

PyCharm

Complete the following:

Please number the assignment items in your Microsoft Word document.

For items 1–7 below, provide a screenshot of the execution, in Python, showing the code and the result set. Be sure to submit the actual .py file. Make sure to also respond to items 8–11. Another possibility might be to copy and paste your code and results in the assignment document.

Start the next action on a new page.

If you have not done so, convert the provided raw data files into one final comma separated value (CSV) file with each data point having its own cell. This will be used for the linear and logistic regression analysis.

You will use the IN498_Unit3_Student.py file for this assignment.

Read the final CSV file into a data frame.

Explore the data set by performing the following steps:

Print the top 10 rows

Print the shape

Print the description

3. Replace NaN with 0 for Installers_retained_for_30_days.

4. Perform the below steps for linear regression:

Create a linear regression model

Set feature_cols to Installers column data only

Set X to feature_cols

Set y to installer retained for 30 days

Print the shape of X

Print the description of X

Print the shape of y

Print the description of X

Fit the linear regression model with X and y

Get predictions for retained for 30 days with 1 install

Get predictions for retained for 30 days with 2 installs

Get predictions for retained for 30 days with 4 installs

Print the intercept

Print the coefficient

5. Perform the below steps for logistic regression:

Get logistic regression model

Fit X and y to logistic regression model

Predict classes using X

Print the predictions using X

Get the predicted probabilities of class 1

Print the probabilities using X

Get probability for retained users for 30 days with 1 install

Get probability for retained users for 30 days with 2 installs

Get probability for ret ained users for 30 days with 4 installs

6. Using data munging techniques, perform the below steps on the data set:

Add a new column for installers retained for 30 days. Call it Install_30. If greater than 0, put 1, if 0, put 0

Print the top 10 rows of the new data set

Print the shape of the new data set

Print the description of the new data set

7. Using the new data set, complete the following for logistic regression:

Perform logistic regression using Install_30 column for y and X = Installers column

Fit X and y for logistic regression

Print the top 10 rows of X

Print the shape rows of X

Print the description of X

Print the top 10 rows of y

Print the shape of y

Print the description of y

Predict on X and capture the result to assorted_pred_class

Print the predictions using assorted _pred_class (X predictions)

Get the predicted probabilities of class 1 and save to assorted_pred_prob

Print the probabilities using assorted_pred_prob

Get probability for retained users for 30 days with 1 install

Get probability for retained users for 30 days with 2 installs

Get probability for retained users for 30 days with 4 installs

Print the intercept

Print the coefficient

8. Compare the probability results of the logistic regression models for both data sets. Explain the results.

What does the intercept values mean?

What does the coefficient values mean?

Summarize what the results are telling you based on the linear and logistic regression analysis.

# Category: Computer Programming

## Do some internet or library research beyond the text and find a credible resource that deals with model selection and regularization.

Model Selection and Regularization Methods

Do some Internet or Library research beyond the text and find a credible resource that deals with model selection and regularization. Based on the resources, respond to the following for your initial post:

In paragraph form, summarize model selection and regularization as applied in data analytics. Include in your summary what shrinkage, dimension reduction, and principal component analysis are. Also include actual examples of where each method can be applied. Conclude your discussion with a table comparing the methods used for model selection and regularization.

For the initial post:

Cite your source properly using APA in-text citation.

Provide a full APA reference entry at the end of the post.

In responses to at least two others, consider sharing experiences about model selection and regularization or do some light research “beyond” the other student’s initial post (citing your sources, of course).

## Provide a full apa reference entry at the end of the post.

Do some Internet or Library research beyond the text and find a credible resource that deals with parsing CSV files using Python and Pandas. Based on the resources, respond to the following for your initial post:

In paragraph form, summarize the process it takes to parse a CSV file using Python. Use the example output below to guide your summary. Include in your summary how the CSV is input into a Pandas data frame and how the top 10 rows of a data frame are printed. Also, include actual Python code of how to read a CSV called IN499_Unit2.csv, create a Pandas data frame, and print the first 10 rows per the below columns. Conclude your discussion with the value of using the Pandas library for data analysis.

Date Installers

0 4/1/19 1.0

1 4/1/19 2.0

2 4/1/19 0.0

3 4/1/19 0.0

4 4/1/19 2.0

5 4/1/19 1.0

6 4/1/19 0.0

7 4/1/19 0.0

8 4/1/19 1.0

9 4/1/19 0.0

For the initial post:

Cite your source properly using APA in-text citation.

Provide a full APA reference entry at the end of the post.

In responses to at least two others, consider sharing experiences about parsing CSV files with Python and Pandas or do some light research “beyond” the other student’s initial post (citing your sources, of course).

## Be sure to include the following:

Submit two files in the Assessment Dropbox. Be sure to include the following:

Word file with social issue description and URLs of three articles.

Plain Text file (.txt) contains 2000+ words from all three sources.

## Provide a full apa reference entry at the end of the post.

Do some Internet or Library research beyond the text and find a credible resource that deals with data cleaning. Based on the resources, respond to the following for your initial post:

In paragraph form, summarize data cleaning. Include the five characteristics a data set should contain after cleaning: correctness, completeness, accuracy, consistency, and uniformity. Also, discuss methods used for data cleaning or munging, covering statistical methods, text parsing, and data transformations. Conclude your discussion with a conversation about which two characteristics and one method are most important and why.

For the initial post:

Cite your source properly using APA in-text citation.

Provide a full APA reference entry at the end of the post.