After Width: | Height: | Size: 14 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 12 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 29 KiB |
After Width: | Height: | Size: 29 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 18 KiB |
After Width: | Height: | Size: 65 KiB |
After Width: | Height: | Size: 18 KiB |
After Width: | Height: | Size: 9.2 KiB |
@ -0,0 +1,58 @@
|
||||
# 11.1 亚马逊雨林数据初窥
|
||||
|
||||
拿到数据的第一步,肯定是先看看数据中有多少个字段,每个字段代表什么意思。这份数据是一个 csv 文件,所以可以使用 pandas 来读取数据。
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# 读取csv文件
|
||||
data = pd.read_csv('./amazon.csv')
|
||||
|
||||
# 查看数据的前5行
|
||||
data.head(5)
|
||||
```
|
||||
|
||||
![](1.jpg)
|
||||
|
||||
可以看出,每一行数据表示当天在特定区域内发生火灾的次数。嗯,前 5 行数据中都没有发生过火灾,看上去还不错。
|
||||
|
||||
接下来,看看火灾发生次数的简单统计信息。
|
||||
|
||||
```python
|
||||
data['number'].describe()
|
||||
```
|
||||
|
||||
![](2.jpg)
|
||||
|
||||
哇,从 98 年到 19 年期间,亚马逊森林总共发生了 6454 起火灾!而且平均每天发生了 108 起火灾!
|
||||
|
||||
这样的统计对于分析来说,粒度还是有点粗,我们不妨对火灾发生的次数进行分组统计。比如根据年、月和地区来分组,那么在对月份分组之前,可以先将数据中的月份进行简化,然后再进行分组。
|
||||
|
||||
```python
|
||||
# 将month中的月份改成缩写
|
||||
data['month'].replace(to_replace = 'Janeiro', value = 'Jan', inplace = True)
|
||||
data['month'].replace(to_replace = 'Fevereiro', value = 'Feb', inplace = True)
|
||||
data['month'].replace(to_replace = 'Março', value = 'Mar', inplace = True)
|
||||
data['month'].replace(to_replace = 'Abril', value = 'Apr', inplace = True)
|
||||
data['month'].replace(to_replace = 'Maio', value = 'May', inplace = True)
|
||||
data['month'].replace(to_replace = 'Junho', value = 'Jun', inplace = True)
|
||||
data['month'].replace(to_replace = 'Julho', value = 'Jul', inplace = True)
|
||||
data['month'].replace(to_replace = 'Agosto', value = 'Aug', inplace = True)
|
||||
data['month'].replace(to_replace = 'Setembro', value = 'Sep', inplace = True)
|
||||
data['month'].replace(to_replace = 'Outubro', value = 'Oct', inplace = True)
|
||||
data['month'].replace(to_replace = 'Novembro', value = 'Nov', inplace = True)
|
||||
data['month'].replace(to_replace = 'Dezembro', value = 'Dec', inplace = True)
|
||||
|
||||
# 分组统计
|
||||
year_mo_state = data.groupby(by = ['year','state', 'month']).sum().reset_index()
|
||||
|
||||
year_mo_state
|
||||
```
|
||||
|
||||
![](3.jpg)
|
||||
|
||||
现在已经统计出了,某年某月某个地区发生火灾的次数。但是这样的一份不完全的报表(由于篇幅原因,这里只截出了部分统计结果)放在你的面前,你可能并不会有什么感觉,因为人类喜欢的是图,而不是一堆数字。所以下一节,将带你画出有趣的图表,让你对这份数据有一份更加直观的认识。
|
||||
|
||||
|
||||
|
||||
|
After Width: | Height: | Size: 1.6 KiB |
After Width: | Height: | Size: 20 KiB |
After Width: | Height: | Size: 24 KiB |
After Width: | Height: | Size: 11 KiB |
After Width: | Height: | Size: 24 KiB |
After Width: | Height: | Size: 48 KiB |
After Width: | Height: | Size: 2.4 KiB |
After Width: | Height: | Size: 35 KiB |
After Width: | Height: | Size: 57 KiB |
After Width: | Height: | Size: 3.1 KiB |
After Width: | Height: | Size: 25 KiB |
After Width: | Height: | Size: 11 KiB |
After Width: | Height: | Size: 10 KiB |
After Width: | Height: | Size: 14 KiB |
After Width: | Height: | Size: 12 KiB |
@ -0,0 +1,6 @@
|
||||
# 第十二章 综合实战:信用卡欺诈检测
|
||||
|
||||
信用卡又叫贷记卡,是由商业银行或信用卡公司对信用合格的消费者发行的信用证明。持有信用卡的消费者可以到特约商业服务部门购物或消费,再由银行同商户和持卡人进行结算,持卡人可以在规定额度内透支。
|
||||
|
||||
然而林子大了什么鸟都会有,有些不法分子会尝试各种手段来使用信用卡进行欺诈交易。例如本章的数据集就是欧洲银行联盟所提供的包含 2013 年 9 月欧洲持卡人通过信用卡进行的交易数据。此数据集显示两天内发生的交易。我们希望能够通过对这些数据分析、挖掘与建模,能够实现信用卡欺诈行为的检测。
|
||||
|
After Width: | Height: | Size: 73 KiB |
After Width: | Height: | Size: 19 KiB |
After Width: | Height: | Size: 24 KiB |
After Width: | Height: | Size: 11 KiB |
After Width: | Height: | Size: 80 KiB |
After Width: | Height: | Size: 12 KiB |
After Width: | Height: | Size: 20 KiB |
After Width: | Height: | Size: 30 KiB |
After Width: | Height: | Size: 44 KiB |
After Width: | Height: | Size: 37 KiB |
After Width: | Height: | Size: 69 KiB |
After Width: | Height: | Size: 66 KiB |
After Width: | Height: | Size: 13 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 65 KiB |
After Width: | Height: | Size: 64 KiB |
After Width: | Height: | Size: 43 KiB |
After Width: | Height: | Size: 48 KiB |
After Width: | Height: | Size: 45 KiB |
After Width: | Height: | Size: 16 KiB |
After Width: | Height: | Size: 4.1 KiB |
@ -0,0 +1,65 @@
|
||||
# 13.1 初步分析数据
|
||||
|
||||
这一步再熟悉不过了,可能会熟悉地让人心疼。但这又是数据挖掘非常重要也是必不可少的一步。
|
||||
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
|
||||
datafr = pd.read_csv("./data.csv")
|
||||
|
||||
# 查看前10行数据到底长什么样子
|
||||
datafr.head(10)
|
||||
```
|
||||
|
||||
![](1.jpg)
|
||||
|
||||
![](2.jpg)
|
||||
|
||||
![](3.jpg)
|
||||
|
||||
![](4.jpg)
|
||||
|
||||
![](5.jpg)
|
||||
|
||||
![](6.jpg)
|
||||
|
||||
![](7.jpg)
|
||||
|
||||
![](8.jpg)
|
||||
|
||||
怎么样,是不是感觉有点懵,有这么多特征,还有一些缺失值。看来这个数据集并不是很好处理的样子。是的,我们需要一步一步脚印的来分析它。
|
||||
|
||||
还是按照惯例,看一下数据集有多少行多少列。
|
||||
|
||||
```python
|
||||
print("Dimension of the dataset is: ",datafr.shape)
|
||||
```
|
||||
|
||||
![](9.jpg)
|
||||
|
||||
总共 89 个特征!然后再看一下数据缺失的情况。
|
||||
|
||||
```python
|
||||
# 统计出含有缺失值的特征的数量
|
||||
datafr.isnull().sum().sort_values(ascending=False)
|
||||
```
|
||||
|
||||
![](10.jpg)
|
||||
|
||||
![](11.jpg)
|
||||
|
||||
![](12.jpg)
|
||||
|
||||
总共 89 个特征,也就只有 13 个特征是完好无损的。不过值得庆幸的是,含有缺失值的特征们的缺失比例并不高,只有 Loaned From 这个特征的缺失严重。所以我们暂且可以认为这个特征没有什么用处,把它删掉就好了。
|
||||
|
||||
```python
|
||||
# 删除Loaned From
|
||||
datafr.drop('Loaned From',axis=1,inplace=True)
|
||||
```
|
||||
|
||||
|
||||
|