From ff372087d9554c42840e87c6d80c55bd325f30d5 Mon Sep 17 00:00:00 2001 From: pmqvrsp4i <2522769846@qq.com> Date: Tue, 10 May 2022 23:06:00 +0800 Subject: [PATCH] ADD file via upload --- Kaggle_tatinic_logistic .ipynb | 3009 ++++++++++++++++++++++++++++++++ 1 file changed, 3009 insertions(+) create mode 100644 Kaggle_tatinic_logistic .ipynb diff --git a/Kaggle_tatinic_logistic .ipynb b/Kaggle_tatinic_logistic .ipynb new file mode 100644 index 0000000..307965e --- /dev/null +++ b/Kaggle_tatinic_logistic .ipynb @@ -0,0 +1,3009 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "2D3940F0AF89478480A8B8B999B95629", + "mdEditEnable": false + }, + "source": [ + "# Kaggle泰坦尼克之灾\n", + "\n", + "\n", + "\n", + "\n", + "### 关于泰坦尼克号之灾\n", + "\n", + "\n", + "- 泰坦尼克号问题之背景\n", + "\n", + "\t- 就是那个大家都熟悉的『Jack and Rose』的故事,豪华游艇倒了,大家都惊恐逃生,可是救生艇的数量有限,无法人人都有,副船长发话了『lady and kid first!』,所以是否获救其实并非随机,而是基于一些背景有rank先后的。\n", + "\n", + "\t- 训练和测试数据是一些乘客的个人信息以及存活状况,要尝试根据它生成合适的模型并预测其他人的存活状况。\n", + "\n", + "\t- 这是一个二分类问题,是我们之前讨论的logistic regression所能处理的范畴。\n", + "\t\n", + "## 说明\n", + "\n", + "『解决一个问题的方法和思路不止一种』 \n", + "『没有所谓的机器学习算法优劣,也没有绝对高性能的机器学习算法,只有在特定的场景、数据和特征下更合适的机器学习算法。』\n", + "\n", + "## 怎么做?\n", + "Andrew Ng老师似乎在coursera上说过,应用机器学习,千万不要一上来就试图做到完美,先撸一个baseline的model出来,再进行后续的分析步骤,一步步提高,所谓后续步骤可能包括『分析model现在的状态(欠/过拟合),分析我们使用的feature的作用大小,进行feature selection,以及我们模型下的bad case和产生的原因』等等。\n", + "\n", + "Kaggle上的大神们,也分享过一些experience:\n", + "\n", + "『对数据的认识太重要了!』\n", + "『数据中的特殊点/离群点的分析和处理太重要了!』\n", + "『特征工程(feature engineering)太重要了!』\n", + "『要做模型融合(model ensemble)!』\n", + "\n", + "## 初探数据\n", + "\n", + "pandas是常用的python数据处理包,把csv文件读入成dataframe各式,我们在ipython notebook中,看到data_train如下所示:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "DFC95DE04483465083C0140B0D735524", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Pclass \\\n", + "0 1 0 3 \n", + "1 2 1 1 \n", + "2 3 1 3 \n", + "3 4 1 1 \n", + "4 5 0 3 \n", + "\n", + " Name Sex Age SibSp \\\n", + "0 Braund, Mr. Owen Harris male 22.0 1 \n", + "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", + "2 Heikkinen, Miss. Laina female 26.0 0 \n", + "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", + "4 Allen, Mr. William Henry male 35.0 0 \n", + "\n", + " Parch Ticket Fare Cabin Embarked \n", + "0 0 A/5 21171 7.2500 NaN S \n", + "1 0 PC 17599 71.2833 C85 C \n", + "2 0 STON/O2. 3101282 7.9250 NaN S \n", + "3 0 113803 53.1000 C123 S \n", + "4 0 373450 8.0500 NaN S " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import warnings\n", + "warnings.filterwarnings(\"ignore\", message=\"numpy.dtype size changed\")\n", + "warnings.filterwarnings(\"ignore\", message=\"numpy.ufunc size changed\")\n", + "#https://stackoverflow.com/q/40845304/10704205\n", + "#Ignore RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility\n", + "\n", + "import pandas as pd #数据分析\n", + "import numpy as np #科学计算\n", + "from pandas import Series,DataFrame\n", + "\n", + "data_train = pd.read_csv('train.csv',engine = 'python',encoding='UTF-8')\n", + "data_train.head() #dataframe格式" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0C9407007EEC48D78DEA889D00BFFFB5", + "mdEditEnable": false + }, + "source": [ + "\n", + "我们看到,总共有12列,其中Survived字段表示的是该乘客是否获救,其余都是乘客的个人信息,包括:\n", + "\n", + "- PassengerId => 乘客ID\n", + "- Pclass => 乘客等级(1/2/3等舱位)\n", + "- Name => 乘客姓名\n", + "- Sex => 性别\n", + "- Age => 年龄\n", + "- SibSp => 堂兄弟/妹个数\n", + "- Parch => 父母与小孩个数\n", + "- Ticket => 船票信息\n", + "- Fare => 票价\n", + "- Cabin => 客舱\n", + "- Embarked => 登船港口\n", + "\n", + "让dataframe自己告诉我们一些信息,如下所示:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "878F32156AC54EFD832D9C6E816038C8", + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 891 entries, 0 to 890\n", + "Data columns (total 12 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 PassengerId 891 non-null int64 \n", + " 1 Survived 891 non-null int64 \n", + " 2 Pclass 891 non-null int64 \n", + " 3 Name 891 non-null object \n", + " 4 Sex 891 non-null object \n", + " 5 Age 714 non-null float64\n", + " 6 SibSp 891 non-null int64 \n", + " 7 Parch 891 non-null int64 \n", + " 8 Ticket 891 non-null object \n", + " 9 Fare 891 non-null float64\n", + " 10 Cabin 204 non-null object \n", + " 11 Embarked 889 non-null object \n", + "dtypes: float64(2), int64(5), object(5)\n", + "memory usage: 83.7+ KB\n" + ] + } + ], + "source": [ + "data_train.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2E77A7E73003473DBE95365B98118688", + "mdEditEnable": false + }, + "source": [ + "上面的数据告诉我们,训练数据中总共有891名乘客,但是有些属性的数据不全,比如说:\n", + "\n", + "- Age(年龄)属性只有714名乘客有记录\n", + "- Cabin(客舱)更是只有204名乘客是已知的\n", + "\n", + "想观察具体数据数值情况,用下列的方法,得到数值型数据的一些分布(因为有些属性,比如姓名,是文本型;而另外一些属性,比如登船港口,是类目型。这些我们用下面的函数是看不到的):\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "5DDB3F0E16054E048EA1C5EDC1BB8C6D", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Pclass Age SibSp \\\n", + "count 891.000000 891.000000 891.000000 714.000000 891.000000 \n", + "mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n", + "std 257.353842 0.486592 0.836071 14.526497 1.102743 \n", + "min 1.000000 0.000000 1.000000 0.420000 0.000000 \n", + "25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n", + "50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n", + "75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n", + "max 891.000000 1.000000 3.000000 80.000000 8.000000 \n", + "\n", + " Parch Fare \n", + "count 891.000000 891.000000 \n", + "mean 0.381594 32.204208 \n", + "std 0.806057 49.693429 \n", + "min 0.000000 0.000000 \n", + "25% 0.000000 7.910400 \n", + "50% 0.000000 14.454200 \n", + "75% 0.000000 31.000000 \n", + "max 6.000000 512.329200 " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_train.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B6D0B24DA7CD46538040DB87EFDEA1BC", + "mdEditEnable": false + }, + "source": [ + " mean字段告诉我们,大概0.383838的人最后获救了,2/3等舱的人数比1等舱要多,平均乘客年龄大概是29.7岁(计算这个时候会略掉无记录的)等等…" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7AAF8DDD496A4E429B47B60927BC95D5", + "mdEditEnable": false + }, + "source": [ + "## 数据初步分析\n", + "\n", + "**- 『对数据的认识太重要了!』**\n", + "\n", + "**- 『对数据的认识太重要了!』**\n", + "\n", + "**- 『对数据的认识太重要了!』**\n", + "\n", + "仅仅最上面的对数据了解,依旧无法给我们提供想法和思路。我们再深入一点来看看我们的数据,看看每个/多个 属性和最后的Survived之间有着什么样的关系呢。\n", + "\n", + "### 乘客各属性分布" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "7D84F87CCD8D4DECA515BA51E3931BDF", + "scrolled": false + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "%matplotlib inline \n", + "import matplotlib.pyplot as plt\n", + "fig = plt.figure()\n", + "fig.set(alpha=0.2) # 设定图表颜色alpha参数\n", + "plt.rcParams['font.sans-serif']=['Microsoft YaHei'] #显示中文标签 plt.rcParams['axes.unicode_minus']=False\n", + "\n", + "plt.subplot2grid((2,3),(0,0)) # 在一张大图里分列几个小图\n", + "data_train.Survived.value_counts().plot(kind='bar')# 柱状图 \n", + "plt.title(u\"获救情况 (1为获救)\") # 标题\n", + "plt.ylabel(u\"人数\") \n", + "\n", + "plt.subplot2grid((2,3),(0,1))\n", + "data_train.Pclass.value_counts().plot(kind=\"bar\")\n", + "plt.ylabel(u\"人数\")\n", + "plt.title(u\"乘客等级分布\")\n", + "\n", + "plt.subplot2grid((2,3),(0,2))\n", + "plt.scatter(data_train.Survived, data_train.Age)\n", + "plt.ylabel(u\"年龄\") # 设定纵坐标名称\n", + "plt.grid(b=True, which='major', axis='y') \n", + "plt.title(u\"按年龄看获救分布 (1为获救)\")\n", + "\n", + "\n", + "plt.subplot2grid((2,3),(1,0), colspan=2)\n", + "data_train.Age[data_train.Pclass == 1].plot(kind='kde') \n", + "data_train.Age[data_train.Pclass == 2].plot(kind='kde')\n", + "data_train.Age[data_train.Pclass == 3].plot(kind='kde')\n", + "plt.xlabel(u\"年龄\")# plots an axis lable\n", + "plt.ylabel(u\"密度\") \n", + "plt.title(u\"各等级的乘客年龄分布\")\n", + "plt.legend((u'头等舱', u'2等舱',u'3等舱'),loc='best') # sets our legend for our graph.\n", + "\n", + "\n", + "plt.subplot2grid((2,3),(1,2))\n", + "data_train.Embarked.value_counts().plot(kind='bar')\n", + "plt.title(u\"各登船口岸上船人数\")\n", + "plt.ylabel(u\"人数\") \n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A329BAF0557149368662B041B9D39B74", + "mdEditEnable": false + }, + "source": [ + "在图上可以看出来,被救的人300多点,不到半数;3等舱乘客灰常多;遇难和获救的人年龄似乎跨度都很广;3个不同的舱年龄总体趋势似乎也一致,2/3等舱乘客20岁多点的人最多,1等舱40岁左右的最多;登船港口人数按照S、C、Q递减,而且S远多于另外俩港口。\n", + "\n", + "我们可能会有一些想法了:\n", + "\n", + "- 不同舱位/乘客等级可能和财富/地位有关系,最后获救概率可能会不一样\n", + "\n", + "- 年龄对获救概率也一定是有影响的,毕竟前面说了,副船长还说『小孩和女士先走』呢\n", + "\n", + "- 和登船港口是不是有关系呢?也许登船港口不同,人的出身地位不同?\n", + "\n", + "来统计统计,看看这些属性值的统计分布吧。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A097A57249B140889E39306B2C44A70B", + "mdEditEnable": false + }, + "source": [ + "### 属性与获救结果的关联统计" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A8B155CD2A154348A15AB35464861C24", + "mdEditEnable": false + }, + "source": [ + "#### 看看各乘客等级的获救情况" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "2CC0D1CFD0E54BCE8486ADB8FE0E2EAD", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "fig = plt.figure()\n", + "fig.set(alpha=0.2) # 设定图表颜色alpha参数\n", + "\n", + "Survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts()\n", + "Survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts()\n", + "df=pd.DataFrame({u'未获救':Survived_0,u'获救':Survived_1})\n", + "df.plot(kind='bar', stacked=True)\n", + "plt.title(u\"各乘客等级的获救情况\")\n", + "plt.xlabel(u\"乘客等级\") \n", + "plt.ylabel(u\"人数\") \n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "613662C50043454394E17B230BE8EA05", + "mdEditEnable": false + }, + "source": [ + "明显等级为1的乘客,获救的概率高很多。恩,这个一定是影响最后获救结果的一个特征。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4A247574DA87475C872ABC0755B32F90", + "mdEditEnable": false + }, + "source": [ + "#### 看看各性别的获救情况" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "CF47A03A505D43C6B34753EF10403A4E", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "fig = plt.figure()\n", + "fig.set(alpha=0.2) # 设定图表颜色alpha参数\n", + "\n", + "Survived_m = data_train.Survived[data_train.Sex == 'male'].value_counts()\n", + "Survived_f = data_train.Survived[data_train.Sex == 'female'].value_counts()\n", + "df=pd.DataFrame({u'男性':Survived_m, u'女性':Survived_f})\n", + "df.plot(kind='bar', stacked=True)\n", + "plt.title(u\"按性别看获救情况\")\n", + "plt.xlabel(u\"性别\") \n", + "plt.ylabel(u\"人数\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C89799DF55AB47B283D543D89B63F3A3", + "mdEditEnable": false + }, + "source": [ + "歪果盆友果然很尊重lady,lady first践行得不错。性别无疑也要作为重要特征加入最后的模型之中。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "159A753C26E24D25845EFCBE173B3D37", + "mdEditEnable": false + }, + "source": [ + "再来个详细版的好了。" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "409F149602FF49C19E127C9C074FE6B4", + "scrolled": false + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "#然后我们再来看看各种舱级别情况下各性别的获救情况\n", + "fig=plt.figure()\n", + "fig.set(alpha=0.65) # 设置图像透明度,无所谓\n", + "plt.title(u\"根据舱等级和性别的获救情况\")\n", + "\n", + "ax1=fig.add_subplot(141)\n", + "data_train.Survived[data_train.Sex == 'female'][data_train.Pclass != 3].value_counts().sort_index().plot(kind='bar', label=\"female highclass\", color='#FA2479')\n", + "ax1.set_xticks([0,1])\n", + "ax1.set_xticklabels([u\"未获救\", u\"获救\"], rotation=0)\n", + "ax1.legend([u\"女性/高级舱\"], loc='best')\n", + "\n", + "ax2=fig.add_subplot(142, sharey=ax1)\n", + "data_train.Survived[data_train.Sex == 'female'][data_train.Pclass == 3].value_counts().sort_index().plot(kind='bar', label='female, low class', color='pink')\n", + "ax2.set_xticklabels([u\"未获救\", u\"获救\"], rotation=0)\n", + "plt.legend([u\"女性/低级舱\"], loc='best')\n", + "\n", + "ax3=fig.add_subplot(143, sharey=ax1)\n", + "data_train.Survived[data_train.Sex == 'male'][data_train.Pclass != 3].value_counts().sort_index().plot(kind='bar', label='male, high class',color='lightblue')\n", + "ax3.set_xticklabels([u\"未获救\", u\"获救\"], rotation=0)\n", + "plt.legend([u\"男性/高级舱\"], loc='best')\n", + "\n", + "ax4=fig.add_subplot(144, sharey=ax1)\n", + "data_train.Survived[data_train.Sex == 'male'][data_train.Pclass == 3].value_counts().sort_index().plot(kind='bar', label='male low class', color='steelblue')\n", + "ax4.set_xticklabels([u\"未获救\", u\"获救\"], rotation=0)\n", + "plt.legend([u\"男性/低级舱\"], loc='best')\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A8D55FBCD7EA43BF8C7CF42851098D95", + "mdEditEnable": false + }, + "source": [ + "#### 我们看看各登船港口的获救情况" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "18C26D1BD0BC41CE8C1CFF996D2225B2", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "fig = plt.figure()\n", + "fig.set(alpha=0.2) # 设定图表颜色alpha参数\n", + "\n", + "Survived_0 = data_train.Embarked[data_train.Survived == 0].value_counts()\n", + "Survived_1 = data_train.Embarked[data_train.Survived == 1].value_counts()\n", + "df=pd.DataFrame({u'未获救':Survived_0,u'获救':Survived_1})\n", + "df.plot(kind='bar', stacked=True)\n", + "plt.title(u\"各登船港口乘客的获救情况\")\n", + "plt.xlabel(u\"登船港口\") \n", + "plt.ylabel(u\"人数\") \n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "908B60F9DD894D929DA6D5728CD1929A", + "mdEditEnable": false + }, + "source": [ + "#### 看看 堂兄弟/妹,孩子/父母有几人,对是否获救的影响" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "D5C98FE13EBB435F85FAF4F21938EA2C", + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " PassengerId\n", + "SibSp Survived \n", + "0 0 398\n", + " 1 210\n", + "1 0 97\n", + " 1 112\n", + "2 0 15\n", + " 1 13\n", + "3 0 12\n", + " 1 4\n", + "4 0 15\n", + " 1 3\n", + "5 0 5\n", + "8 0 7\n", + " PassengerId\n", + "Parch Survived \n", + "0 0 445\n", + " 1 233\n", + "1 0 53\n", + " 1 65\n", + "2 0 40\n", + " 1 40\n", + "3 0 2\n", + " 1 3\n", + "4 0 4\n", + "5 0 4\n", + " 1 1\n", + "6 0 1\n" + ] + } + ], + "source": [ + "gg = data_train.groupby(['SibSp','Survived'])\n", + "df = pd.DataFrame(gg.count()['PassengerId'])\n", + "print(df)\n", + "\n", + "gp = data_train.groupby(['Parch','Survived'])\n", + "df = pd.DataFrame(gp.count()['PassengerId'])\n", + "print(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "630CA44341D84F3D85B265A347016F96", + "mdEditEnable": false + }, + "source": [ + "好吧,没看出特别特别明显的规律(为自己的智商感到捉急…),先作为备选特征,放一放。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "18CB6E8DC7194E9490C41F27B755EBB7", + "mdEditEnable": false + }, + "source": [ + "#### tickets cabin的分析\n", + "ticket是船票编号,应该是unique的,和最后的结果没有太大的关系,先不纳入考虑的特征范畴\n", + "\n", + "cabin只有204个乘客有值,我们先看看它的一个分布" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "52EFF0E17D9141F28A47633B2969D050", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "B96 B98 4\n", + "G6 4\n", + "C23 C25 C27 4\n", + "C22 C26 3\n", + "F33 3\n", + " ..\n", + "E34 1\n", + "C7 1\n", + "C54 1\n", + "E36 1\n", + "C148 1\n", + "Name: Cabin, Length: 147, dtype: int64" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_train.Cabin.value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AE065A86E10C46A78140AE25C0C9E628", + "mdEditEnable": false + }, + "source": [ + "Cabin属性应该算作类目型的,本来缺失值就多,还不集中,…\n", + "如果直接按照类目特征处理的话,太散了,估计每个因子化后的特征都拿不到什么权重。加上有那么多缺失值,不如先把Cabin缺失与否作为条件(虽然这部分信息缺失可能并非未登记,maybe只是丢失了而已,所以这样做未必妥当),先在有无Cabin信息这个粗粒度上看看Survived的情况好了。" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "C46525090059436083CF94D6599456E0", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "fig = plt.figure()\n", + "fig.set(alpha=0.2) # 设定图表颜色alpha参数\n", + "\n", + "Survived_cabin = data_train.Survived[pd.notnull(data_train.Cabin)].value_counts()\n", + "Survived_nocabin = data_train.Survived[pd.isnull(data_train.Cabin)].value_counts()\n", + "df=pd.DataFrame({u'有':Survived_cabin, u'无':Survived_nocabin}).transpose()\n", + "df.plot(kind='bar', stacked=True)\n", + "plt.title(u\"按Cabin有无看获救情况\")\n", + "plt.xlabel(u\"Cabin有无\") \n", + "plt.ylabel(u\"人数\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C0F627D831324F4A8E02278E49A895B6", + "mdEditEnable": false + }, + "source": [ + "有Cabin记录的似乎获救概率稍高一些,先这么着放一放吧。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2BED89CF8A7D45E6832F5D54E84CAF75", + "mdEditEnable": false + }, + "source": [ + "## 简单数据预处理\n", + "\n", + "\n", + "数据预处理,其实就包括了很多Kaggler津津乐道的feature engineering过程。\n", + "\n", + "- **『特征工程(feature engineering)太重要了!』 **\n", + "- **『特征工程(feature engineering)太重要了!』 **\n", + "- **『特征工程(feature engineering)太重要了!』 **\n", + "\n", + "\n", + "先从最突出的数据属性开始Cabin和Age,有丢失数据实在是对下一步工作影响太大。\n", + "\n", + "Cabin,暂时就按照刚才说的,按Cabin有无数据,将这个属性处理成Yes和No两种类型。\n", + "\n", + "Age:\n", + "\n", + "通常遇到缺值的情况,我们会有几种常见的处理方式\n", + "\n", + "如果缺值的样本占总数比例极高,可能就直接舍弃了,作为特征加入的话,可能反倒带入noise,影响最后的结果了\n", + "如果缺值的样本适中,而该属性非连续值特征属性(比如说类目属性),那就把NaN作为一个新类别,加到类别特征中\n", + "如果缺值的样本适中,而该属性为连续值特征属性,有时候我们会考虑给定一个step(比如这里的age,我们可以考虑每隔2/3岁为一个步长),然后把它离散化,之后把NaN作为一个type加到属性类目中。\n", + "有些情况下,缺失的值个数并不是特别多,那我们也可以试着根据已有的值,拟合一下数据,补充上。\n", + "本例中,后两种处理方式应该都是可行的,我们先试试拟合补全吧(虽然说没有特别多的背景可供我们拟合,这不一定是一个多么好的选择)\n", + "\n", + "这里用scikit-learn中的RandomForest来拟合一下缺失的年龄数据(注:RandomForest是一个用在原始数据中做不同采样,建立多颗DecisionTree,再进行average等等来降低过拟合现象,提高结果的机器学习算法,我们之后会介绍到)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "77DB5AD318FB481C8F098E7DC145348D", + "scrolled": false + }, + "outputs": [], + "source": [ + "from sklearn.ensemble import RandomForestRegressor\n", + "\n", + "### 使用 RandomForestClassifier 填补缺失的年龄属性\n", + "def set_missing_ages(df):\n", + "\n", + " # 把已有的数值型特征取出来丢进Random Forest Regressor中\n", + " age_df = df[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]\n", + "\n", + " # 乘客分成已知年龄和未知年龄两部分\n", + " known_age = age_df[age_df.Age.notnull()].values\n", + " unknown_age = age_df[age_df.Age.isnull()].values\n", + "\n", + " # y即目标年龄\n", + " y = known_age[:, 0]\n", + "\n", + " # X即特征属性值\n", + " X = known_age[:, 1:]\n", + "\n", + " # fit到RandomForestRegressor之中\n", + " rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)\n", + " rfr.fit(X, y)\n", + "\n", + " # 用得到的模型进行未知年龄结果预测\n", + " predictedAges = rfr.predict(unknown_age[:, 1::])\n", + "\n", + " # 用得到的预测结果填补原缺失数据\n", + " df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges \n", + "\n", + " return df, rfr\n", + "\n", + "def set_Cabin_type(df):\n", + " df.loc[ (df.Cabin.notnull()), 'Cabin' ] = \"Yes\"\n", + " df.loc[ (df.Cabin.isnull()), 'Cabin' ] = \"No\"\n", + " return df\n", + "\n", + "data_train, rfr = set_missing_ages(data_train)\n", + "data_train = set_Cabin_type(data_train)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "5036CC74FAE84F658FB4EF8FB8998E2A", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.00000010A/5 211717.2500NoS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.00000010PC 1759971.2833YesC
2313Heikkinen, Miss. Lainafemale26.00000000STON/O2. 31012827.9250NoS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.0000001011380353.1000YesS
4503Allen, Mr. William Henrymale35.000000003734508.0500NoS
5603Moran, Mr. Jamesmale23.838953003308778.4583NoQ
6701McCarthy, Mr. Timothy Jmale54.000000001746351.8625YesS
7803Palsson, Master. Gosta Leonardmale2.0000003134990921.0750NoS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.0000000234774211.1333NoS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.0000001023773630.0708NoC
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Pclass \\\n", + "0 1 0 3 \n", + "1 2 1 1 \n", + "2 3 1 3 \n", + "3 4 1 1 \n", + "4 5 0 3 \n", + "5 6 0 3 \n", + "6 7 0 1 \n", + "7 8 0 3 \n", + "8 9 1 3 \n", + "9 10 1 2 \n", + "\n", + " Name Sex Age \\\n", + "0 Braund, Mr. Owen Harris male 22.000000 \n", + "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.000000 \n", + "2 Heikkinen, Miss. Laina female 26.000000 \n", + "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.000000 \n", + "4 Allen, Mr. William Henry male 35.000000 \n", + "5 Moran, Mr. James male 23.838953 \n", + "6 McCarthy, Mr. Timothy J male 54.000000 \n", + "7 Palsson, Master. Gosta Leonard male 2.000000 \n", + "8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.000000 \n", + "9 Nasser, Mrs. Nicholas (Adele Achem) female 14.000000 \n", + "\n", + " SibSp Parch Ticket Fare Cabin Embarked \n", + "0 1 0 A/5 21171 7.2500 No S \n", + "1 1 0 PC 17599 71.2833 Yes C \n", + "2 0 0 STON/O2. 3101282 7.9250 No S \n", + "3 1 0 113803 53.1000 Yes S \n", + "4 0 0 373450 8.0500 No S \n", + "5 0 0 330877 8.4583 No Q \n", + "6 0 0 17463 51.8625 Yes S \n", + "7 3 1 349909 21.0750 No S \n", + "8 0 2 347742 11.1333 No S \n", + "9 1 0 237736 30.0708 No C " + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_train.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "954507CD225B48F19CC9BC688024355A", + "mdEditEnable": false + }, + "source": [ + "因为逻辑回归建模时,需要输入的特征都是数值型特征,我们通常会先对类目型的特征因子化。 \n", + "什么叫做因子化呢?举个例子:\n", + "\n", + "以Cabin为例,原本一个属性维度,因为其取值可以是[‘yes’,’no’],而将其平展开为’Cabin_yes’,’Cabin_no’两个属性\n", + "\n", + "原本Cabin取值为yes的,在此处的”Cabin_yes”下取值为1,在”Cabin_no”下取值为0\n", + "原本Cabin取值为no的,在此处的”Cabin_yes”下取值为0,在”Cabin_no”下取值为1\n", + "使用pandas的”get_dummies”来完成这个工作,并拼接在原来的”data_train”之上,如下所示。" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "10DE8C414A214AD9842CA4C772644829", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedAgeSibSpParchFareCabin_NoCabin_YesEmbarked_CEmbarked_QEmbarked_SSex_femaleSex_malePclass_1Pclass_2Pclass_3
01022.0107.25001000101001
12138.01071.28330110010100
23126.0007.92501000110001
34135.01053.10000100110100
45035.0008.05001000101001
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Age SibSp Parch Fare Cabin_No Cabin_Yes \\\n", + "0 1 0 22.0 1 0 7.2500 1 0 \n", + "1 2 1 38.0 1 0 71.2833 0 1 \n", + "2 3 1 26.0 0 0 7.9250 1 0 \n", + "3 4 1 35.0 1 0 53.1000 0 1 \n", + "4 5 0 35.0 0 0 8.0500 1 0 \n", + "\n", + " Embarked_C Embarked_Q Embarked_S Sex_female Sex_male Pclass_1 \\\n", + "0 0 0 1 0 1 0 \n", + "1 1 0 0 1 0 1 \n", + "2 0 0 1 1 0 0 \n", + "3 0 0 1 1 0 1 \n", + "4 0 0 1 0 1 0 \n", + "\n", + " Pclass_2 Pclass_3 \n", + "0 0 1 \n", + "1 0 0 \n", + "2 0 1 \n", + "3 0 0 \n", + "4 0 1 " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dummies_Cabin = pd.get_dummies(data_train['Cabin'], prefix= 'Cabin')\n", + "\n", + "dummies_Embarked = pd.get_dummies(data_train['Embarked'], prefix= 'Embarked')\n", + "\n", + "dummies_Sex = pd.get_dummies(data_train['Sex'], prefix= 'Sex')\n", + "\n", + "dummies_Pclass = pd.get_dummies(data_train['Pclass'], prefix= 'Pclass')\n", + "\n", + "df = pd.concat([data_train, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)\n", + "df.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DFF9132E01D14C268CD38CB8B1C56419", + "mdEditEnable": false + }, + "source": [ + "成功地把这些类目属性全都转成0,1的数值属性了。\n", + "\n", + "这样,看起来,是不是我们需要的属性值都有了,且它们都是数值型属性呢。\n", + "\n", + "仔细看看Age和Fare两个属性,乘客的数值幅度变化太大了!各属性值之间scale差距太大,将对收敛速度造成影响!甚至不收敛! \n", + "所以先用scikit-learn里面的preprocessing模块对这俩做一个scaling,所谓scaling,其实就是将一些变化幅度较大的特征化到[-1,1]之内。" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "FBA4537B8FDB444DA85DF162F8A28D95", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedAgeSibSpParchFareCabin_NoCabin_YesEmbarked_CEmbarked_QEmbarked_SSex_femaleSex_malePclass_1Pclass_2Pclass_3Age_scaledFare_scaled
01022.0107.25001000101001-0.561377-0.502445
12138.01071.283301100101000.6131730.786845
23126.0007.92501000110001-0.267740-0.488854
34135.01053.100001001101000.3929450.420730
45035.0008.050010001010010.392945-0.486337
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Age SibSp Parch Fare Cabin_No Cabin_Yes \\\n", + "0 1 0 22.0 1 0 7.2500 1 0 \n", + "1 2 1 38.0 1 0 71.2833 0 1 \n", + "2 3 1 26.0 0 0 7.9250 1 0 \n", + "3 4 1 35.0 1 0 53.1000 0 1 \n", + "4 5 0 35.0 0 0 8.0500 1 0 \n", + "\n", + " Embarked_C Embarked_Q Embarked_S Sex_female Sex_male Pclass_1 \\\n", + "0 0 0 1 0 1 0 \n", + "1 1 0 0 1 0 1 \n", + "2 0 0 1 1 0 0 \n", + "3 0 0 1 1 0 1 \n", + "4 0 0 1 0 1 0 \n", + "\n", + " Pclass_2 Pclass_3 Age_scaled Fare_scaled \n", + "0 0 1 -0.561377 -0.502445 \n", + "1 0 0 0.613173 0.786845 \n", + "2 0 1 -0.267740 -0.488854 \n", + "3 0 0 0.392945 0.420730 \n", + "4 0 1 0.392945 -0.486337 " + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import sklearn.preprocessing as preprocessing\n", + "scaler = preprocessing.StandardScaler()\n", + "age_scale_param = scaler.fit(df['Age'].values.reshape(-1,1))\n", + "df['Age_scaled'] = scaler.fit_transform(df['Age'].values.reshape(-1,1), age_scale_param)\n", + "fare_scale_param = scaler.fit(df['Fare'].values.reshape(-1,1))\n", + "df['Fare_scaled'] = scaler.fit_transform(df['Fare'].values.reshape(-1,1), fare_scale_param)\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F6D48EE010B042EC870A17B8FFFD61B5", + "mdEditEnable": false + }, + "source": [ + "恩,好看多了,万事俱备,只欠建模。马上就要看到成效了,哈哈。我们把需要的属性值抽出来,转成scikit-learn里面LogisticRegression可以处理的格式。\n", + "## 逻辑回归建模\n", + "我们把需要的feature字段取出来,转成numpy格式,使用scikit-learn中的LogisticRegression建模。" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "DCE1A3BC05B7443E8810C1EF8B5C14C8", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "LogisticRegression(penalty='l1', solver='liblinear', tol=1e-06)" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn import linear_model\n", + "\n", + "# 用正则取出我们要的属性值\n", + "train_df = df.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')\n", + "train_np = train_df.values\n", + "\n", + "# y即第0列:Survival结果\n", + "y = train_np[:, 0]\n", + "\n", + "# X即第1列及以后:特征属性值\n", + "X = train_np[:, 1:]\n", + "\n", + "# fit到LogisticRegression之中\n", + "clf = linear_model.LogisticRegression(solver='liblinear',C=1.0, penalty='l1', tol=1e-6)\n", + "clf.fit(X, y)\n", + "\n", + "clf" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D5A6805C05604A8D8DDB2382248F29D0", + "mdEditEnable": false + }, + "source": [ + "good,很顺利,我们得到了一个model。\n", + "\n", + "先淡定!淡定!你以为把test.csv直接丢进model里就能拿到结果啊…骚年,图样图森破啊!我们的”test_data”也要做和”train_data”一样的预处理啊!!" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "E984C35272844F5182912EE3898B3A81", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdAgeSibSpParchFareCabin_NoCabin_YesEmbarked_CEmbarked_QEmbarked_SSex_femaleSex_malePclass_1Pclass_2Pclass_3Age_scaledFare_scaled
089234.5007.829210010010010.307526-0.496637
189347.0107.000010001100011.256242-0.511497
289462.0009.687510010010102.394702-0.463335
389527.0008.66251000101001-0.261704-0.481704
489622.01112.28751000110001-0.641190-0.416740
\n", + "
" + ], + "text/plain": [ + " PassengerId Age SibSp Parch Fare Cabin_No Cabin_Yes Embarked_C \\\n", + "0 892 34.5 0 0 7.8292 1 0 0 \n", + "1 893 47.0 1 0 7.0000 1 0 0 \n", + "2 894 62.0 0 0 9.6875 1 0 0 \n", + "3 895 27.0 0 0 8.6625 1 0 0 \n", + "4 896 22.0 1 1 12.2875 1 0 0 \n", + "\n", + " Embarked_Q Embarked_S Sex_female Sex_male Pclass_1 Pclass_2 Pclass_3 \\\n", + "0 1 0 0 1 0 0 1 \n", + "1 0 1 1 0 0 0 1 \n", + "2 1 0 0 1 0 1 0 \n", + "3 0 1 0 1 0 0 1 \n", + "4 0 1 1 0 0 0 1 \n", + "\n", + " Age_scaled Fare_scaled \n", + "0 0.307526 -0.496637 \n", + "1 1.256242 -0.511497 \n", + "2 2.394702 -0.463335 \n", + "3 -0.261704 -0.481704 \n", + "4 -0.641190 -0.416740 " + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_test = pd.read_csv(\"test.csv\")\n", + "data_test.loc[ (data_test.Fare.isnull()), 'Fare' ] = 0\n", + "# 接着我们对test_data做和train_data中一致的特征变换\n", + "# 首先用同样的RandomForestRegressor模型填上丢失的年龄\n", + "tmp_df = data_test[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]\n", + "null_age = tmp_df[data_test.Age.isnull()].values\n", + "# 根据特征属性X预测年龄并补上\n", + "X = null_age[:, 1:]\n", + "predictedAges = rfr.predict(X)\n", + "data_test.loc[ (data_test.Age.isnull()), 'Age' ] = predictedAges\n", + "\n", + "data_test = set_Cabin_type(data_test)\n", + "dummies_Cabin = pd.get_dummies(data_test['Cabin'], prefix= 'Cabin')\n", + "dummies_Embarked = pd.get_dummies(data_test['Embarked'], prefix= 'Embarked')\n", + "dummies_Sex = pd.get_dummies(data_test['Sex'], prefix= 'Sex')\n", + "dummies_Pclass = pd.get_dummies(data_test['Pclass'], prefix= 'Pclass')\n", + "\n", + "\n", + "df_test = pd.concat([data_test, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)\n", + "df_test.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)\n", + "df_test['Age_scaled'] = scaler.fit_transform(df_test['Age'].values.reshape(-1,1), age_scale_param)\n", + "df_test['Fare_scaled'] = scaler.fit_transform(df_test['Fare'].values.reshape(-1,1), fare_scale_param)\n", + "df_test.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "25901C4D9F6B47678E87FC4D898D1CDA", + "mdEditEnable": false + }, + "source": [ + "不错不错,数据很OK,差最后一步了。 \n", + "下面就做预测取结果吧!!" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "id": "E140D03943F24EE6BC56983AFC1D8182", + "scrolled": false + }, + "outputs": [], + "source": [ + "test = df_test.filter(regex='Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')\n", + "predictions = clf.predict(test)\n", + "result = pd.DataFrame({'PassengerId':data_test['PassengerId'].values, 'Survived':predictions.astype(np.int32)})\n", + "result.to_csv(\"logistic_regression_predictions.csv\", index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "id": "5AD6B7EEB61B420C894E2964A3AB7993", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvived
08920
18930
28940
38950
48961
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived\n", + "0 892 0\n", + "1 893 0\n", + "2 894 0\n", + "3 895 0\n", + "4 896 1" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.read_csv(\"logistic_regression_predictions.csv\").head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6D6B7CAACAEF4264824F93794CC731FE", + "mdEditEnable": false + }, + "source": [ + "\n", + "格式正确\n", + "\n", + "在Kaggle的Make a submission页面,提交上结果。如下: \n", + "\n", + "0.7635,这只是简单分析处理过后出的一个baseline模型嘛。\n", + "\n", + "## 逻辑回归系统优化\n", + "### 模型系数关联分析\n", + "\n", + "Andrew Ng老师的machine Learning课程说的,现在应该分析分析模型现在的状态,是过/欠拟合?以确定我们需要更多的特征还是更多数据,或者其他操作。\n", + "\n", + "有一条很著名的learning curves。\n", + "\n", + "不过在现在的场景下先不着急做这个事情这个baseline系统还有些粗糙,再挖掘挖掘。\n", + "\n", + "首先,Name和Ticket两个属性完整舍弃了(因为这俩属性几乎每一条记录都是一个完全不同的值并没有找到很直接的处理方式)。\n", + "\n", + "然后,年龄的拟合本身也未必是一件非常靠谱的事情,依据其余属性,其实并不能很好地拟合预测出未知的年龄。再一个,日常经验,小盆友和老人可能得到的照顾会多一些,这样看的话,年龄作为一个连续值,给一个固定的系数,应该和年龄是一个正相关或者负相关,似乎体现不出两头受照顾的实际情况,所以,把年龄离散化,按区段分作类别属性会更合适一些。\n", + "\n", + "把得到的model系数和feature关联起来看看。\n", + "\n", + "**LR模型系数:**" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "id": "34F8D229D4764822AF8E21C87060A4B0", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
columnscoef
0SibSp[-0.34423350629771865]
1Parch[-0.10491782795293876]
2Cabin_No[0.0]
3Cabin_Yes[0.9020907454858608]
4Embarked_C[0.0]
5Embarked_Q[0.0]
6Embarked_S[-0.4172600741014519]
7Sex_female[1.9565674266199975]
8Sex_male[-0.677419771178524]
9Pclass_1[0.34116840933660547]
10Pclass_2[0.0]
11Pclass_3[-1.1941311197849658]
12Age_scaled[-0.5237628105223407]
13Fare_scaled[0.08443592417790625]
\n", + "
" + ], + "text/plain": [ + " columns coef\n", + "0 SibSp [-0.34423350629771865]\n", + "1 Parch [-0.10491782795293876]\n", + "2 Cabin_No [0.0]\n", + "3 Cabin_Yes [0.9020907454858608]\n", + "4 Embarked_C [0.0]\n", + "5 Embarked_Q [0.0]\n", + "6 Embarked_S [-0.4172600741014519]\n", + "7 Sex_female [1.9565674266199975]\n", + "8 Sex_male [-0.677419771178524]\n", + "9 Pclass_1 [0.34116840933660547]\n", + "10 Pclass_2 [0.0]\n", + "11 Pclass_3 [-1.1941311197849658]\n", + "12 Age_scaled [-0.5237628105223407]\n", + "13 Fare_scaled [0.08443592417790625]" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame({\"columns\":list(train_df.columns)[1:], \"coef\":list(clf.coef_.T)})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CAEA1C315A1D43E0B045C0FBF9C2C4F3", + "mdEditEnable": false, + "scrolled": false + }, + "source": [ + "系数为正的特征,和最后结果是一个正相关,反之为负相关。\n", + "\n", + "那些权重绝对值非常大的feature,在模型上:\n", + "\n", + "- Sex属性,如果是female会极大提高最后获救的概率,而male会很大程度拉低这个概率。\n", + "- Pclass属性,1等舱乘客最后获救的概率会上升,而乘客等级为3会极大地拉低这个概率。\n", + "- 有Cabin值会很大程度拉升最后获救概率(这里似乎能看到了一点端倪,事实上从最上面的有无Cabin记录的Survived分布图上看出,即使有Cabin记录的乘客也有一部分遇难了,估计这个属性上我们挖掘还不够)\n", + "- Age是一个负相关,意味着在我们的模型里,年龄越小,越有获救的优先权(还得回原数据看看这个是否合理)\n", + "- 有一个登船港口S会很大程度拉低获救的概率,另外俩港口压根就没啥作用(这个实际上非常奇怪,因为我们从之前的统计图上并没有看到S港口的获救率非常低,所以也许可以考虑把登船港口这个feature去掉试试)。\n", + "- 船票Fare有小幅度的正相关(并不意味着这个feature作用不大,有可能是我们细化的程度还不够,举个例子,说不定我们得对它离散化,再分至各个乘客等级上?)\n", + "\n", + "噢啦,观察完了,我们现在有一些想法了,但是怎么样才知道,哪些优化的方法是promising的呢?\n", + "\n", + "因为test.csv里面并没有Survived这个字段无法在这份数据上评定我们算法在该场景下的效果…\n", + "\n", + "『每做一次调整就make a submission,然后根据结果来判定这次调整的好坏』是行不通的…\n", + "\n", + "### 交叉验证\n", + "\n", + "\n", + "- **『要做交叉验证(cross validation)!』 **\n", + "- **『要做交叉验证(cross validation)!』 **\n", + "- **『要做交叉验证(cross validation)!』 **\n", + "\n", + "通常情况下,这么做cross validation:把train.csv分成两部分,一部分用于训练我们需要的模型,另外一部分数据上看我们预测算法的效果。\n", + "\n", + "用scikit-learn的cross_validation来帮我们完成小数据集上的这个工作。\n", + "\n", + "先简单看看cross validation情况下的打分\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "id": "0399477A313B4B0484FF4E558E6F06A8", + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[0.81564246 0.80898876 0.78651685 0.78651685 0.81460674]\n" + ] + } + ], + "source": [ + "# from sklearn import cross_validation\n", + "# 参考https://blog.csdn.net/cheneyshark/article/details/78640887 , 0.18版本中,cross_validation被废弃\n", + "# 改为下面的从model_selection直接import cross_val_score 和 train_test_split\n", + "from sklearn.model_selection import cross_val_score, train_test_split\n", + "\n", + " #简单看看打分情况\n", + "clf = linear_model.LogisticRegression(solver='liblinear',C=1.0, penalty='l1', tol=1e-6)\n", + "all_data = df.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')\n", + "X = all_data.values[:,1:]\n", + "y = all_data.values[:,0]\n", + "# print(cross_validation.cross_val_score(clf, X, y, cv=5))\n", + "print(cross_val_score(clf, X, y, cv=5))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "325C5FFAE140492B829C3750CC0F2C3B", + "mdEditEnable": false + }, + "source": [ + "结果是下面酱紫的: \n", + "[0.81564246 0.81005587 0.78651685 0.78651685 0.81355932]\n", + "\n", + "似乎比Kaggle上的结果略高哈,毕竟用的是不是同一份数据集评估的。\n", + "\n", + "等等,既然我们要做交叉验证,那我们干脆先把交叉验证里面的bad case拿出来看看,看看人眼审核,是否能发现什么蛛丝马迹,是我们忽略了哪些信息,使得这些乘客被判定错了。再把bad case上得到的想法和前头系数分析的合在一起,然后逐个试试。\n", + "\n", + "下面我们做数据分割,并且在原始数据集上瞄一眼bad case:" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "id": "FDB2637F9C574ED88B41F198D3C677C8", + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
232411Sloper, Mr. William Thompsonmale28.000011378835.5000A6S
252613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...female38.001534707731.3875NaNS
495003Arnold-Franchi, Mrs. Josef (Josefine Franchi)female18.001034923717.8000NaNS
555611Woolner, Mr. HughmaleNaN001994735.5000C52S
656613Moubarek, Master. GeriosmaleNaN11266115.2458NaNC
787912Caldwell, Master. Alden Gatesmale0.830224873829.0000NaNS
818213Sheerlinck, Mr. Jan Baptistmale29.00003457799.5000NaNS
11811901Baxter, Mr. Quigg Edmondmale24.0001PC 17558247.5208B58 B60C
13914001Giglio, Mr. Victormale24.0000PC 1759379.2000B86C
16516613Goldsmith, Master. Frank John William \"Frankie\"male9.000236329120.5250NaNS
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived Pclass \\\n", + "23 24 1 1 \n", + "25 26 1 3 \n", + "49 50 0 3 \n", + "55 56 1 1 \n", + "65 66 1 3 \n", + "78 79 1 2 \n", + "81 82 1 3 \n", + "118 119 0 1 \n", + "139 140 0 1 \n", + "165 166 1 3 \n", + "\n", + " Name Sex Age SibSp \\\n", + "23 Sloper, Mr. William Thompson male 28.00 0 \n", + "25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.00 1 \n", + "49 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18.00 1 \n", + "55 Woolner, Mr. Hugh male NaN 0 \n", + "65 Moubarek, Master. Gerios male NaN 1 \n", + "78 Caldwell, Master. Alden Gates male 0.83 0 \n", + "81 Sheerlinck, Mr. Jan Baptist male 29.00 0 \n", + "118 Baxter, Mr. Quigg Edmond male 24.00 0 \n", + "139 Giglio, Mr. Victor male 24.00 0 \n", + "165 Goldsmith, Master. Frank John William \"Frankie\" male 9.00 0 \n", + "\n", + " Parch Ticket Fare Cabin Embarked \n", + "23 0 113788 35.5000 A6 S \n", + "25 5 347077 31.3875 NaN S \n", + "49 0 349237 17.8000 NaN S \n", + "55 0 19947 35.5000 C52 S \n", + "65 1 2661 15.2458 NaN C \n", + "78 2 248738 29.0000 NaN S \n", + "81 0 345779 9.5000 NaN S \n", + "118 1 PC 17558 247.5208 B58 B60 C \n", + "139 0 PC 17593 79.2000 B86 C \n", + "165 2 363291 20.5250 NaN S " + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# 分割数据,按照 训练数据:cv数据 = 7:3的比例\n", + "# split_train, split_cv = cross_validation.train_test_split(df, test_size=0.3, random_state=0)\n", + "split_train, split_cv = train_test_split(df, test_size=0.3, random_state=42)\n", + "\n", + "train_df = split_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')\n", + "# 生成模型\n", + "clf = linear_model.LogisticRegression(solver='liblinear',C=1.0, penalty='l1', tol=1e-6)\n", + "clf.fit(train_df.values[:,1:], train_df.values[:,0])\n", + "\n", + "# 对cross validation数据进行预测\n", + "\n", + "cv_df = split_cv.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')\n", + "predictions = clf.predict(cv_df.values[:,1:])\n", + "\n", + "origin_data_train = pd.read_csv(\"train.csv\")\n", + "bad_cases = origin_data_train.loc[origin_data_train['PassengerId'].isin(split_cv[predictions != cv_df.values[:,0]]['PassengerId'].values)]\n", + "bad_cases.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FCBECB2990084A9587F17BB87BE509ED", + "mdEditEnable": false + }, + "source": [ + "大家可以自己跑一遍试试,拿到bad cases之后,仔细看看。也会有一些猜测和想法。其中会有一部分可能会印证在系数分析部分的猜测,那这些优化的想法优先级可以放高一些。\n", + "\n", + "现在有了”train_df” 和 “vc_df” 两个数据部分,前者用于训练model,后者用于评定和选择模型。可以开始可劲折腾了。\n", + "\n", + "我们随便列一些可能可以做的优化操作:\n", + "\n", + "- Age属性不使用现在的拟合方式,而是根据名称中的『Mr』『Mrs』『Miss』等的平均值进行填充。\n", + "- Age不做成一个连续值属性,而是使用一个步长进行离散化,变成离散的类目feature。\n", + "- Cabin再细化一些,对于有记录的Cabin属性,我们将其分为前面的字母部分(我猜是位置和船层之类的信息) 和 后面的数字部分(应该是房间号,有意思的事情是,如果你仔细看看原始数据,你会发现,这个值大的情况下,似乎获救的可能性高一些)。\n", + "- Pclass和Sex俩太重要了,我们试着用它们去组出一个组合属性来试试,这也是另外一种程度的细化。\n", + "- 单加一个Child字段,Age<=12的,设为1,其余为0(你去看看数据,确实小盆友优先程度很高啊)\n", + "- 如果名字里面有『Mrs』,而Parch>1的,我们猜测她可能是一个母亲,应该获救的概率也会提高,因此可以多加一个Mother字段,此种情况下设为1,其余情况下设为0\n", + "- 登船港口可以考虑先去掉试试(Q和C本来就没权重,S有点诡异)\n", + "- 把堂兄弟/兄妹 和 Parch 还有自己 个数加在一起组一个Family_size字段(考虑到大家族可能对最后的结果有影响)\n", + "- Name是一个我们一直没有触碰的属性,我们可以做一些简单的处理,比如说男性中带某些字眼的(‘Capt’, ‘Don’, ‘Major’, ‘Sir’)可以统一到一个Title,女性也一样。\n", + "\n", + "大家接着往下挖掘,可能还可以想到更多可以细挖的部分。我这里先列这些了,然后我们可以使用手头上的”train_df”和”cv_df”开始试验这些feature engineering的tricks是否有效了。\n", + "\n", + "试验的过程比较漫长,也需要有耐心,而且我们经常会面临很尴尬的状况,就是我们灵光一闪,想到一个feature,然后坚信它一定有效,结果试验下来,效果还不如试验之前的结果。恩,需要坚持和耐心,以及不断的挖掘。\n", + "\n", + "我最好的结果是在『Survived~C(Pclass)+C(Title)+C(Sex)+C(Age_bucket)+C(Cabin_num_bucket)Mother+Fare+Family_Size』下取得的,结果如下(抱歉,博主君commit的时候手抖把页面关了,于是没截着图,下面这张图是在我得到最高分之后,用这次的结果重新make commission的,截了个图,得分是0.79426,不是目前我的最高分哈,因此排名木有变…):\n", + "\n", + "![做完feature engineering调整之后的结果](https://www.z4a.net/images/2018/11/28/result_3.jpg)\n", + "\n", + "### learning curves\n", + "\n", + "有一个很可能发生的问题是,我们不断地做feature engineering,产生的特征越来越多,用这些特征去训练模型,会对我们的训练集拟合得越来越好,同时也可能在逐步丧失泛化能力,从而在待预测的数据上,表现不佳,也就是发生过拟合问题。\n", + "\n", + "从另一个角度上说,如果模型在待预测的数据上表现不佳,除掉上面说的过拟合问题,也有可能是欠拟合问题,也就是说在训练集上,其实拟合的也不是那么好。\n", + "\n", + "额,这个欠拟合和过拟合怎么解释呢。这么说吧:\n", + "\n", + "- 过拟合就像是你班那个学数学比较刻板的同学,老师讲过的题目,一字不漏全记下来了,于是老师再出一样的题目,分分钟精确出结果。but数学考试,因为总是碰到新题目,所以成绩不咋地。\n", + "- 欠拟合就像是,咳咳,和博主level差不多的差生。连老师讲的练习题也记不住,于是连老师出一样题目复习的周测都做不好,考试更是可想而知了。\n", + "而在机器学习的问题上,对于过拟合和欠拟合两种情形。我们优化的方式是不同的。\n", + "\n", + "对过拟合而言,通常以下策略对结果优化是有用的:\n", + "\n", + "- 做一下feature selection,挑出较好的feature的subset来做training\n", + "- 提供更多的数据,从而弥补原始数据的bias问题,学习到的model也会更准确\n", + "而对于欠拟合而言,我们通常需要更多的feature,更复杂的模型来提高准确度。\n", + "\n", + "著名的learning curve可以帮我们判定我们的模型现在所处的状态。我们以样本数为横坐标,训练和交叉验证集上的错误率作为纵坐标,两种状态分别如下两张图所示:过拟合(overfitting/high variace),欠拟合(underfitting/high bias)\n", + "\n", + "![过拟合](https://www.z4a.net/images/2018/11/28/high_variance.jpg)\n", + "![欠拟合](https://www.z4a.net/images/2018/11/28/10067a39f8c5849405a.jpg)\n", + "\n", + "我们也可以把错误率替换成准确率(得分),得到另一种形式的learning curve(sklearn 里面是这么做的)。\n", + "\n", + "回到我们的问题,我们用scikit-learn里面的learning_curve来帮我们分辨我们模型的状态。举个例子,这里我们一起画一下我们最先得到的baseline model的learning curve。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "id": "C76CD67974F947BC8B7F5D32E2134DD3", + "scrolled": false + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/opt/conda/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.\n", + " warnings.warn(CV_WARNING, FutureWarning)\n" + ] + }, + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "(0.8065696844854024, 0.018258876711338634)" + ] + }, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "# from sklearn.learning_curve import learning_curve 修改以fix learning_curve DeprecationWarning\n", + "from sklearn.model_selection import learning_curve\n", + "\n", + "# 用sklearn的learning_curve得到training_score和cv_score,使用matplotlib画出learning curve\n", + "def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1, \n", + " train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True):\n", + " \"\"\"\n", + " 画出data在某模型上的learning curve.\n", + " 参数解释\n", + " ----------\n", + " estimator : 你用的分类器。\n", + " title : 表格的标题。\n", + " X : 输入的feature,numpy类型\n", + " y : 输入的target vector\n", + " ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点\n", + " cv : 做cross-validation的时候,数据分成的份数,其中一份作为cv集,其余n-1份作为training(默认为3份)\n", + " n_jobs : 并行的的任务数(默认1)\n", + " \"\"\"\n", + " train_sizes, train_scores, test_scores = learning_curve(\n", + " estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)\n", + " \n", + " train_scores_mean = np.mean(train_scores, axis=1)\n", + " train_scores_std = np.std(train_scores, axis=1)\n", + " test_scores_mean = np.mean(test_scores, axis=1)\n", + " test_scores_std = np.std(test_scores, axis=1)\n", + " \n", + " if plot:\n", + " plt.figure()\n", + " plt.title(title)\n", + " if ylim is not None:\n", + " plt.ylim(*ylim)\n", + " plt.xlabel(u\"训练样本数\")\n", + " plt.ylabel(u\"得分\")\n", + " plt.gca().invert_yaxis()\n", + " plt.grid()\n", + " \n", + " plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, \n", + " alpha=0.1, color=\"b\")\n", + " plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, \n", + " alpha=0.1, color=\"r\")\n", + " plt.plot(train_sizes, train_scores_mean, 'o-', color=\"b\", label=u\"训练集上得分\")\n", + " plt.plot(train_sizes, test_scores_mean, 'o-', color=\"r\", label=u\"交叉验证集上得分\")\n", + " \n", + " plt.legend(loc=\"best\")\n", + " \n", + " plt.draw()\n", + " plt.gca().invert_yaxis()\n", + " plt.show()\n", + " \n", + " midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2\n", + " diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])\n", + " return midpoint, diff\n", + "\n", + "plot_learning_curve(clf, u\"学习曲线\", X, y)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2B1321FDECB743C78976C4E94B469645", + "mdEditEnable": false + }, + "source": [ + "在实际数据上看,我们得到的learning curve没有理论推导的那么光滑哈,但是可以大致看出来,训练集和交叉验证集上的得分曲线走势还是符合预期的。\n", + "\n", + "目前的曲线看来,我们的model并不处于overfitting的状态(overfitting的表现一般是训练集上得分高,而交叉验证集上要低很多,中间的gap比较大)。因此我们可以再做些feature engineering的工作,添加一些新产出的特征或者组合特征到模型中。\n", + "\n", + "## 模型融合(model ensemble)\n", + "\n", + "\n", + "- **『模型融合(model ensemble)很重要!』 **\n", + "- **『模型融合(model ensemble)很重要!』 **\n", + "- **『模型融合(model ensemble)很重要!』 **\n", + "\n", + "\n", + "最简单的模型融合大概就是,比如分类问题,当我们手头上有一堆在同一份数据集上训练得到的分类器(比如logistic regression,SVM,KNN,random forest,神经网络),那我们让他们都分别去做判定,然后对结果做投票统计,取票数最多的结果为最后结果。\n", + "\n", + "模型融合可以比较好地缓解,训练过程中产生的过拟合问题,从而对于结果的准确度提升有一定的帮助。\n", + "\n", + "现在只用了logistic regression,如果还想用这个融合思想去提高结果\n", + "\n", + "既然这个时候模型没得选,那就在数据上动动手脚。\n", + "\n", + "那脆就不要用全部的训练集,每次取训练集的一个subset做训练,这样,我们虽然用的是同一个机器学习算法,但是得到的模型却是不一样的;同时,因为我们没有任何一份子数据集是全的,因此即使出现过拟合,也是在子训练集上出现过拟合,而不是全体数据上,这样做一个融合,可能对最后的结果有一定的帮助\n", + "\n", + "我们用scikit-learn里面的Bagging来完成上面的思路:" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "id": "0B5FC2631C72456186DCEC6929924625" + }, + "outputs": [], + "source": [ + "from sklearn.ensemble import BaggingRegressor\n", + "\n", + "train_df = df.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass.*|Mother|Child|Family|Title')\n", + "train_np = train_df.values\n", + "\n", + "# y即Survival结果\n", + "y = train_np[:, 0]\n", + "\n", + "# X即特征属性值\n", + "X = train_np[:, 1:]\n", + "\n", + "# fit到BaggingRegressor之中\n", + "#clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)\n", + "#'newton-cg'、'lbfgs'、'sag' 和 'saga' 处理 L2 或无惩罚\n", + "#'liblinear' 和 'saga' 也处理 L1 惩罚\n", + "clf = linear_model.LogisticRegression(C=1, penalty='l1', solver='liblinear')\n", + "bagging_clf = BaggingRegressor(clf, n_estimators=20, max_samples=0.8, max_features=1.0, bootstrap=True, bootstrap_features=False, n_jobs=-1)\n", + "bagging_clf.fit(X, y)\n", + "\n", + "test = df_test.filter(regex='Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass.*|Mother|Child|Family|Title')\n", + "predictions = bagging_clf.predict(test)\n", + "result = pd.DataFrame({'PassengerId':data_test['PassengerId'].values, 'Survived':predictions.astype(np.int32)})\n", + "result.to_csv(\"logistic_regression_bagging_predictions.csv\", index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "id": "7EABF7D54061460CA6CA37873DC2424F" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvived
08920
18930
28940
38950
48960
58970
68981
78990
89001
99010
\n", + "
" + ], + "text/plain": [ + " PassengerId Survived\n", + "0 892 0\n", + "1 893 0\n", + "2 894 0\n", + "3 895 0\n", + "4 896 0\n", + "5 897 0\n", + "6 898 1\n", + "7 899 0\n", + "8 900 1\n", + "9 901 0" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.read_csv(\"logistic_regression_bagging_predictions.csv\").head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4FD550B25C30489B8204F4C788E79F18", + "mdEditEnable": false + }, + "source": [ + "恩,对结果还是有帮助的。\n", + "\n", + "\n", + "## 总结\n", + "\n", + "对于任何的机器学习问题,不要一上来就追求尽善尽美,先用自己会的算法撸一个baseline的model出来,再进行后续的分析步骤,一步步提高。\n", + "\n", + "在问题的结果过程中:\n", + "\n", + "- **『对数据的认识太重要了!』**\n", + "- **『数据中的特殊点/离群点的分析和处理太重要了!』**\n", + "- **『特征工程(feature engineering)太重要了!』**\n", + "- **『模型融合(model ensemble)太重要了!』**\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}