|
70 | 70 | "| --- | --- | --- |\n", |
71 | 71 | "| survival | Survival | 0 = No, 1 = Yes |\n", |
72 | 72 | "| pclass | Ticket class\t| 1 = 1st, 2 = 2nd, 3 = 3rd |\n", |
73 | | - "| sex | Sex | male/femail |\t\n", |
| 73 | + "| sex | Sex | male/female |\t\n", |
74 | 74 | "| Age | Age | in years |\n", |
75 | 75 | "| sibsp | # of siblings / spouses aboard the Titanic | |\n", |
76 | 76 | "| parch | # of parents / children aboard the Titanic | |\n", |
|
104 | 104 | "source": [ |
105 | 105 | "### Load Data\n", |
106 | 106 | "\n", |
107 | | - "This dataset is in titanic.csv. Make sure the file is in current folder. Please download the file from [here](https://github.com/data-lessons/python-business/tree/gh-pages/data) if you haven't done so yet." |
| 107 | + "This dataset is in titanic.csv. Make sure the file is in current folder." |
108 | 108 | ] |
109 | 109 | }, |
110 | 110 | { |
|
934 | 934 | "cell_type": "markdown", |
935 | 935 | "metadata": {}, |
936 | 936 | "source": [ |
937 | | - "##### Task7: Plot Perished vs. Survived Bar for Male and Femail\n", |
| 937 | + "##### Task7: Plot Perished vs. Survived Bar for Male and Female\n", |
938 | 938 | "We will use seaborn countplot() again, but set argument `hue` to 'Survived'." |
939 | 939 | ] |
940 | 940 | }, |
|
1727 | 1727 | "### Feature Engineering\n", |
1728 | 1728 | "We'll create a new column FamilySize. There are 2 columns related to family size, parch indicates parent or children number, Sibsp indicates sibling and spouse number.\n", |
1729 | 1729 | "\n", |
1730 | | - "Take one name 'Asplund' as example, we can see that total family size is 7(Parch + SibSp + 1), and each family member has same Fare, which means the Fare is for the whole group. So family size will be an important feature to predict Fare. There're only 4 Asplunds out of 7 in the dataset becasue the dataset is only a subset of all passengers." |
| 1730 | + "Take one name 'Asplund' as example, we can see that total family size is 7 (Parch + SibSp + 1), and each family member has same Fare, which means the Fare is for the whole group. So family size will be an important feature to predict Fare. There're only 4 Asplunds out of 7 in the dataset becasue the dataset is only a subset of all passengers." |
1731 | 1731 | ] |
1732 | 1732 | }, |
1733 | 1733 | { |
|
2054 | 2054 | "\n", |
2055 | 2055 | "## Step 4: Modeling\n", |
2056 | 2056 | "\n", |
2057 | | - "Now we have a relatively clean dataset(Except for Cabin column which has many missing values). We can do a classification on Survived to predict whether a passenger could survive the desaster or a regression on Fare to predict ticket fare. This dataset is not a good dataset for regression. But since we don't talk about classification in this workshop we will construct a linear regression on Fare in this exercise." |
2058 | | - ] |
2059 | | - }, |
2060 | | - { |
2061 | | - "cell_type": "markdown", |
2062 | | - "metadata": {}, |
2063 | | - "source": [ |
2064 | | - "##### Task16: Contruct a regresson on Fare\n", |
2065 | | - "Construct regression model with statsmodels.\n", |
2066 | | - "\n", |
2067 | | - "Pick Pclass, Embarked, FamilySize as independent variables." |
| 2057 | + "Now we have a relatively clean dataset (except for the **Cabin** column which has many missing values). We can do a classification on Survived to predict whether a passenger could survive the disaster or a regression on Fare to predict ticket fare. This dataset is not a good dataset for regression. But since we don't talk about classification in this workshop we will construct a linear regression on Fare in this exercise." |
2068 | 2058 | ] |
2069 | 2059 | }, |
2070 | 2060 | { |
|
0 commit comments