|
|
|
@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
## 5.1 Spark机器学习实战
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 5.1.1 数据类型
|
|
|
|
|
|
|
|
|
|
`MLlib`支持存储在一台机器上的局部向量和矩阵以及由一个或多个`RDD`支持的分布式矩阵。局部向量和局部矩阵是提供公共接口的简单数据模型。
|
|
|
|
|
|
|
|
|
|
### 5.1.1.1 局部向量
|
|
|
|
|
|
|
|
|
|
`MLlib`支持两种局部向量类型:密集向量(`dense`)和稀疏向量(`sparse`)。密集向量由`double`类型的数组支持,而稀疏向量则由两个平行数组支持。例如,向量`(1.0,0.0,3.0)`由密集向量表示的格式为`[1.0,0.0,3.0]`,由稀疏向量表示的格式为`(3,[0,2],[1.0,3.0])`。
|
|
|
|
|
|
|
|
|
|
注意:这里对稀疏向量做些解释。`3`是向量`(1.0,0.0,3.0)`的长度,除去`0`值外,其他两个值的索引和值分别构成了数组`[0,2]`和数组`[1.0,3.0]`。
|
|
|
|
|
|
|
|
|
|
**密集向量**
|
|
|
|
|
|
|
|
|
|
示例:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
from pyspark.ml.linalg import Vectors
|
|
|
|
|
|
|
|
|
|
dense = Vectors.dense(1.0, 0.0, 3.0)
|
|
|
|
|
print(dense)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
输出:
|
|
|
|
|
```
|
|
|
|
|
[1.0,0.0,3.0]
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**稀疏向量**
|
|
|
|
|
|
|
|
|
|
示例:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
from pyspark.ml.linalg import Vectors
|
|
|
|
|
|
|
|
|
|
sparse = Vectors.sparse(3, [0, 2], [1.0, 3.0])
|
|
|
|
|
print(sparse)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
输出:
|
|
|
|
|
```
|
|
|
|
|
(3,[0,2],[1.0,3.0])
|
|
|
|
|
```
|