You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

40 lines
1.8 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# 调参
很多机器学习算法有很多可以调整的参数(即超参数),例如我们用的随机森林需要我们指定森林中有多少棵决策树,每棵决策树的最大深度等。这些超参数都或多或少的会影响这模型的性能。那么怎样才能找到合适的超参数,来让我们的模型性能达到比较好的效果呢?可以使用网格搜索!
网格搜索的意思其实就是遍历所有我们想要尝试的参数组合,看看哪个参数组合的性能最高,那么这组参数组合就是模型的最佳参数。
`sklearn`为我们提供了网格搜索的接口,我们能很方便的进行网格搜索。
```python
from sklearn.model_selection import GridSearchCV
# 想要调整的参数的字典字典的key为参数名字value为想要尝试参数值
param_grid = {'n_estimators': [10, 20, 50, 100, 150, 200],'max_depth': [5, 10, 15, 20, 25, 30]}
# 采用5折验证的方式进行网格搜索分类器为随机森林
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, Y_train)
# 打印最佳参数组合
print(grid_search.best_params_)
# 打印最佳参数组合时模型的最佳性能
print(grid_search.best_score_)
```
<div align=center><img src="../img/58.jpg"/></div>
可以看到经过调参之后,我们的随机森林模型的性能提高到了 `0.8323`,提升了接近`1%`的准确率。然后我们使用最佳参数构造随机森林,并对测试集测试会发现,测试集的准确率达到了 `0.8525`
```python
Y_train = data['Survived']
X_train = data.drop(['Survived'], axis=1)
Y_test = test_data['Survived']
X_test = test_data.drop(['Survived'], axis=1)
clf = RandomForestClassifier(n_estimators=50, max_depth=5)
clf.fit(X_train, Y_train)
predict = clf.predict(X_test)
print(accuracy_score(Y_test, predict))
```