Skip to content

Commit f324935

Browse files
committed
new doc
1 parent 51b4af0 commit f324935

19 files changed

+2149
-180
lines changed

201307/20130731_01.md

Lines changed: 1076 additions & 0 deletions
Large diffs are not rendered by default.

201307/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
### 文章列表
22
----
3+
##### 20130731_01.md [《PostgreSQL 多元线性回归 - 1 MADLib Installed in PostgreSQL 9.2》](20130731_01.md)
34
##### 20130730_01.md [《PostgreSQL 9.4 Add SQL Standard WITH ORDINALITY support for UNNEST (and any other SRF)》](20130730_01.md)
45
##### 20130727_01.md [《PostgreSQL 9.4 patch : Row-Level Security》](20130727_01.md)
56
##### 20130726_02.md [《PostgreSQL 如何打印函数调用栈信息》](20130726_02.md)

201504/20150414_01.md

Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
## PivotalR between R & PostgreSQL-like Databases(for exp : Greenplum, hadoop access by hawq)
2+
3+
### 作者
4+
digoal
5+
6+
### 日期
7+
2015-04-14
8+
9+
### 标签
10+
PostgreSQL , MADlib , PivotalR
11+
12+
----
13+
14+
## 背景
15+
PivotalR是R的一个包, 这个包提供了将R翻译成SQL语句的能力, 即对大数据进行挖掘的话. 用户将大数据存储在数据库中, 例如PostgreSQL , Greenplum.
16+
17+
用户在R中使用R的语法即可, 不需要直接访问数据库, 因为PivotalR 会帮你翻译成SQL语句, 并且返回结果给R.
18+
19+
这个过程不需要传输原始数据到R端, 所以可以完成R不能完成的任务(因为R是数据在内存中的运算, 如果数据量超过内存会有问题)
20+
21+
PivotalR还封装了MADlib, 里面包含了大量的机器学习的函数, 回归分析的函数等.
22+
23+
24+
这个包的说明 :
25+
26+
```
27+
PivotalR-package
28+
An R font-end to PostgreSQL and Greenplum database, and wrapper
29+
for in-database parallel and distributed machine learning open-source
30+
library MADlib
31+
32+
Description
33+
PivotalR is a package that enables users of R, the most popular open source statistical programming
34+
language and environment to interact with the Pivotal (Greenplum) Database as well as Pivotal
35+
HD/HAWQ for Big Data analytics. It does so by providing an interface to the operations on tables/views
36+
in the database. These operations are almost the same as those of data.frame. Thus the
37+
users of R do not need to learn SQL when they operate on the objects in the database. The latest
38+
code is available at https://github.com/madlib-internal/PivotalR. A training video and a
39+
quick-start guide are available at http://zimmeee.github.io/gp-r/#pivotalr.
40+
41+
Details
42+
Package: PivotalR
43+
Type: Package
44+
Version: 0.1.17
45+
Date: 2014-09-15
46+
License: GPL (>= 2)
47+
Depends: methods, DBI, RPostgreSQL
48+
49+
This package enables R users to easily develop, refine and deploy R scripts that leverage the parallelism
50+
and scalability of the database as well as in-database analytics libraries to operate on big
51+
data sets that would otherwise not fit in R memory - all this without having to learn SQL because
52+
the package provides an interface that they are familiar with.
53+
54+
The package also provides a wrapper for MADlib. MADlib is an open-source library for scalable
55+
in-database analytics. It provides data-parallel implementations of mathematical, statistical and
56+
machine-learning algorithms for structured and unstructured data. The number of machine learning
57+
algorithms that MADlib covers is quickly increasing.
58+
59+
As an R front-end to the PostgreSQL-like databases, this package minimizes the amount of data
60+
transferred between the database and R. All the big data is stored in the database. The user enters
61+
their familiar R syntax, and the package translates it into SQL queries and sends the SQL query into
62+
database for parallel execution. The computation result, which is small (if it is as big as the original
63+
data, what is the point of big data analytics?), is returned to R to the user.
64+
65+
On the other hand, this package also gives the usual SQL users the access of utilizing the powerful
66+
analytics and graphics functionalities of R. Although the database itself has difficulty in plotting,
67+
the result can be analyzed and presented beautifully with R.
68+
69+
This current version of PivotalR provides the core R infrastructure and data frame functions as well
70+
as over 50 analytical functions in R that leverage in-database execution. These include
71+
72+
* Data Connectivity - db.connect, db.disconnect, db.Rquery
73+
* Data Exploration - db.data.frame, subsets
74+
* R language features - dim, names, min, max, nrow, ncol, summary etc
75+
* Reorganization Functions - merge, by (group-by), samples
76+
* Transformations - as.factor, null replacement
77+
* Algorithms - linear regression and logistic regression wrappers for MADlib
78+
79+
Note
80+
This package is differernt from PL/R, which is another way of using R with PostgreSQL-like
81+
databases. PL/R enables the users to run R scripts from SQL. In the parallel Greenplum database,
82+
one can use PL/R to implement parallel algorithms.
83+
84+
However, PL/R still requires non-trivial knowledge of SQL to use it effectively. It is mostly limited
85+
to explicitly parallel jobs. And for the end user, it is still a SQL interface.
86+
87+
This package does not require any knowledge of SQL, and it works for both explicitly and implicitly
88+
parallel jobs by employing the open-source MADlib library. It is much more scalable. And for the
89+
end user, it is a pure R interface with the conventional R syntax.
90+
91+
Author(s)
92+
Author: Predictive Analytics Team at Pivotal Inc. <[email protected]>, with contributions from
93+
Data Scientist Team at Pivotal Inc.
94+
Maintainer: Caleb Welton, Pivotal Inc. <[email protected]>
95+
96+
References
97+
[1] MADlib website, http://madlib.net
98+
[2] MADlib user docs, http://doc.madlib.net/master
99+
[3] MADlib Wiki page, http://github.com/madlib/madlib/wiki
100+
[4] MADlib contribution guide, https://github.com/madlib/madlib/wiki/Contribution-Guide
101+
[5] MADlib on GitHub, https://github.com/madlib/madlib
102+
103+
See Also
104+
madlib.lm Linear regression
105+
madlib.glm Linear, logistic and multinomial logistic regressions
106+
madlib.summary summary of a table in the database.
107+
```
108+
109+
Examples
110+
111+
```
112+
## Not run:
113+
## get the help for the package
114+
help("PivotalR-package")
115+
## get help for a function
116+
help(madlib.lm)
117+
## create multiple connections to different databases
118+
db.connect(port = 5433) # connection 1, use default values for the parameters
119+
db.connect(dbname = "test", user = "qianh1", password = "", host =
120+
"remote.machine.com", madlib = "madlib07", port = 5432) # connection 2
121+
db.list() # list the info for all the connections
122+
## list all tables/views that has "ornst" in the name
123+
db.objects("ornst")
124+
## list all tables/views
125+
db.objects(conn.id = 1)
126+
## create a table and the R object pointing to the table
127+
## using the example data that comes with this package
128+
delete("abalone", conn.id = cid)
129+
x <- as.db.data.frame(abalone, "abalone")
130+
## OR if the table already exists, you can create the wrapper directly
131+
## x <- db.data.frame("abalone")
132+
dim(x) # dimension of the data table
133+
names(x) # column names of the data table
134+
madlib.summary(x) # look at a summary for each column
135+
lk(x, 20) # look at a sample of the data
136+
## look at a sample sorted by id column
137+
lookat(sort(x, decreasing = FALSE, x$id), 20)
138+
lookat(sort(x, FALSE, NULL), 20) # look at a sample ordered randomly
139+
## linear regression Examples --------
140+
## fit one different model to each group of data with the same sex
141+
fit1 <- madlib.lm(rings ~ . - id | sex, data = x)
142+
fit1 # view the result
143+
lookat(mean((x$rings - predict(fit1, x))^2)) # mean square error
144+
## plot the predicted values v.s. the true values
145+
ap <- x$rings # true values
146+
ap$pred <- predict(fit1, x) # add a column which is the predicted values
147+
## If the data set is very big, you do not want to load all the
148+
## data points into R and plot. We can just plot a random sample.
149+
random.sample <- lk(sort(ap, FALSE, "random"), 1000) # sort randomly
150+
plot(random.sample) # plot a random sample
151+
## fit a single model to all data treating sex as a categorical variable ---------
152+
y <- x # make a copy, y is now a db.data.frame object
153+
y$sex <- as.factor(y$sex) # y becomes a db.Rquery object now
154+
fit2 <- madlib.lm(rings ~ . - id, data = y)
155+
fit2 # view the result
156+
lookat(mean((y$rings - predict(fit2, y))^2)) # mean square error
157+
## logistic regression Examples --------
158+
## fit one different model to each group of data with the same sex
159+
fit3 <- madlib.glm(rings < 10 ~ . - id | sex, data = x, family = "binomial")
160+
fit3 # view the result
161+
## the percentage of correct prediction
162+
lookat(mean((x$rings < 10) == predict(fit3, x)))
163+
## fit a single model to all data treating sex as a categorical variable ----------
164+
y <- x # make a copy, y is now a db.data.frame object
165+
y$sex <- as.factor(y$sex) # y becomes a db.Rquery object now
166+
fit4 <- madlib.glm(rings < 10 ~ . - id, data = y, family = "binomial")
167+
fit4 # view the result
168+
## the percentage of correct prediction
169+
lookat(mean((y$rings < 10) == predict(fit4, y)))
170+
## Group by Examples --------
171+
## mean value of each column except the "id" column
172+
lk(by(x[,-1], x$sex, mean))
173+
## standard deviation of each column except the "id" column
174+
lookat(by(x[,-1], x$sex, sd))
175+
## Merge Examples --------
176+
## create two objects with different rows and columns
177+
key(x) <- "id"
178+
y <- x[1:300, 1:6]
179+
z <- x[201:400, c(1,2,4,5)]
180+
## get 100 rows
181+
m <- merge(y, z, by = c("id", "sex"))
182+
lookat(m, 20)
183+
## operator Examples --------
184+
y <- x$length + x$height + 2.3
185+
z <- x$length * x$height / 3
186+
lk(y < z, 20)
187+
## ------------------------------------------------------------------------
188+
## Deal with NULL values
189+
delete("null_data")
190+
x <- as.db.data.frame(null.data, "null_data")
191+
## OR if the table already exists, you can create the wrapper directly
192+
## x <- db.data.frame("null_data")
193+
dim(x)
194+
names(x)
195+
## ERROR, because of NULL values
196+
fit <- madlib.lm(sf_mrtg_pct_assets ~ ., data = x)
197+
## remove NULL values
198+
y <- x # make a copy
199+
for (i in 1:10) y <- y[!is.na(y[i]),]
200+
dim(y)
201+
fit <- madlib.lm(sf_mrtg_pct_assets ~ ., data = y)
202+
fit
203+
## Or we can replace all NULL values
204+
x[is.na(x)] <- 45
205+
## End(Not run)
206+
```
207+
208+
安装,使用 :
209+
210+
```
211+
> install.packages("PivotalR")
212+
> library(PivotalR)
213+
Loading required package: Matrix
214+
Attaching package: ‘PivotalR’
215+
The following objects are masked from ‘package:stats’:
216+
sd, var
217+
The following object is masked from ‘package:base’:
218+
cbind
219+
```
220+
221+
## 参考
222+
1\. http://blog.pivotal.io/data-science-pivotal/products/introducing-r-for-big-data-with-pivotalr
223+
224+
2\. http://cran.r-project.org/web/packages/PivotalR/PivotalR.pdf
225+
226+
3\. https://github.com/pivotalsoftware/PivotalR

201504/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
##### 20150429_02.md [《PostgreSQL 垃圾回收原理以及如何预防膨胀 - How to prevent object bloat in PostgreSQL》](20150429_02.md)
55
##### 20150429_01.md [《PostgreSQL Oracle 兼容性之 - 事件触发器实现类似Oracle的回收站功能》](20150429_01.md)
66
##### 20150419_01.md [《PostgreSQL 9.5 new feature - BRIN (block range index) index》](20150419_01.md)
7+
##### 20150414_01.md [《PivotalR between R & PostgreSQL-like Databases(for exp : Greenplum, hadoop access by hawq)》](20150414_01.md)
78
##### 20150409_02.md [《PostgreSQL 9.5 使用 import foreign schema 语法一键创建外部表》](20150409_02.md)
89
##### 20150409_01.md [《PostgreSQL 行安全策略 - PostgreSQL 9.5 new feature - can define row security policy for table》](20150409_01.md)
910
##### 20150407_02.md [《PostgreSQL aggregate function 4 : Hypothetical-Set Aggregate Functions》](20150407_02.md)

201511/20151111_01.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
## 一张图看懂MADlib能干什么
2+
3+
### 作者
4+
digoal
5+
6+
### 日期
7+
2015-11-11
8+
9+
### 标签
10+
PostgreSQL , MADlib , PivotalR
11+
12+
----
13+
14+
## 背景
15+
MADlib最初是由pivotal的一些数据科学家贡献的开源数据挖掘库,现已加入阿帕奇孵化器项目。
16+
17+
MADlib能干什么呢?看一张图就明白了,以下取自
18+
19+
http://user2014.stat.ucla.edu/files/PivotalR_user2014/userR2014_PivotalR.pdf
20+
21+
![pic](20151111_01_pic_001.png)
22+
23+
回归分析,决策树,随机森林,贝叶斯分类,向量机,风险模型,KMEAN聚集,文本挖掘,数据校验,。。。等。
24+
25+
一个线性回归的例子,对应上图
26+
27+
```
28+
supervised learning -> generalized linear models -> linear regression
29+
```
30+
31+
![pic](20151111_01_pic_002.png)
32+
33+
如果你是R的数据科学家,并且不习惯使用SQL的话,使用pivotalR的R包就可以了,左边是R的写法。右边对应的是SQL。
34+
35+
![pic](20151111_01_pic_003.png)
36+
37+
话说今天要预测每个时间点的11.11销售额,可以用到它了。
38+
39+
PostgreSQL用户来搞数据挖掘有天然优势。
40+
41+
![pic](20151111_01_pic_004.png)
42+
43+
madlib的使用手册:
44+
45+
http://doc.madlib.net/latest/index.html
46+
47+
pivotalR使用手册
48+
49+
https://cran.r-project.org/web/packages/PivotalR/PivotalR.pdf

201511/20151111_01_pic_001.png

160 KB
Loading

201511/20151111_01_pic_002.png

184 KB
Loading

201511/20151111_01_pic_003.png

70.6 KB
Loading

201511/20151111_01_pic_004.png

143 KB
Loading

201511/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22
----
33
##### 20151130_02.md [《安装iozone on CentOS 7 x64》](20151130_02.md)
44
##### 20151130_01.md [《PostgreSQL 安全警钟长鸣》](20151130_01.md)
5+
##### 20151111_01.md [《一张图看懂MADlib能干什么》](20151111_01.md)
56
##### 20151109_01.md [《PostgreSQL snapshot too old补丁, 防止数据库膨胀》](20151109_01.md)

0 commit comments

Comments
 (0)