digoal
diff --git a/‎201307/20130731_01.md‎
Lines changed: 1076 additions & 0 deletions b/‎201307/20130731_01.md‎
Lines changed: 1076 additions & 0 deletions
diff --git a/‎201307/readme.md‎
Lines changed: 1 addition & 0 deletions b/‎201307/readme.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎201504/20150414_01.md‎
Lines changed: 226 additions & 0 deletions b/‎201504/20150414_01.md‎
Lines changed: 226 additions & 0 deletions
diff --git a/‎201504/readme.md‎
Lines changed: 1 addition & 0 deletions b/‎201504/readme.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎201511/20151111_01.md‎
Lines changed: 49 additions & 0 deletions b/‎201511/20151111_01.md‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎201511/20151111_01_pic_001.png‎
160 KB b/‎201511/20151111_01_pic_001.png‎
160 KB
diff --git a/‎201511/20151111_01_pic_002.png‎
184 KB b/‎201511/20151111_01_pic_002.png‎
184 KB
diff --git a/‎201511/20151111_01_pic_003.png‎
70.6 KB b/‎201511/20151111_01_pic_003.png‎
70.6 KB
diff --git a/‎201511/20151111_01_pic_004.png‎
143 KB b/‎201511/20151111_01_pic_004.png‎
143 KB
diff --git a/‎201511/readme.md‎
Lines changed: 1 addition & 0 deletions b/‎201511/readme.md‎
Lines changed: 1 addition & 0 deletions
@@ -1,5 +1,6 @@
 ### 文章列表  
 ----  
+##### 20130731_01.md   [《PostgreSQL 多元线性回归 - 1 MADLib Installed in PostgreSQL 9.2》](20130731_01.md)  
 ##### 20130730_01.md   [《PostgreSQL 9.4 Add SQL Standard WITH ORDINALITY support for UNNEST (and any other SRF)》](20130730_01.md)  
 ##### 20130727_01.md   [《PostgreSQL 9.4 patch : Row-Level Security》](20130727_01.md)  
 ##### 20130726_02.md   [《PostgreSQL 如何打印函数调用栈信息》](20130726_02.md)  
 
@@ -0,0 +1,226 @@
+## PivotalR between R & PostgreSQL-like Databases(for exp : Greenplum, hadoop access by hawq)  
+                                        
+### 作者                                           
+digoal                                   
+                                    
+### 日期                                                                                                       
+2015-04-14                                 
+                                       
+### 标签                                    
+PostgreSQL , MADlib , PivotalR       
+                                                                                                          
+----                                                                                                    
+                                                                                                             
+## 背景      
+PivotalR是R的一个包, 这个包提供了将R翻译成SQL语句的能力, 即对大数据进行挖掘的话. 用户将大数据存储在数据库中, 例如PostgreSQL , Greenplum.   
+  
+用户在R中使用R的语法即可, 不需要直接访问数据库, 因为PivotalR 会帮你翻译成SQL语句, 并且返回结果给R.  
+  
+这个过程不需要传输原始数据到R端, 所以可以完成R不能完成的任务(因为R是数据在内存中的运算, 如果数据量超过内存会有问题)  
+  
+PivotalR还封装了MADlib, 里面包含了大量的机器学习的函数, 回归分析的函数等.  
+  
+  
+这个包的说明 :   
+  
+```  
+PivotalR-package   
+An R font-end to PostgreSQL and Greenplum database, and wrapper  
+for in-database parallel and distributed machine learning open-source  
+library MADlib  
+  
+Description  
+PivotalR is a package that enables users of R, the most popular open source statistical programming  
+language and environment to interact with the Pivotal (Greenplum) Database as well as Pivotal  
+HD/HAWQ for Big Data analytics. It does so by providing an interface to the operations on tables/views  
+in the database. These operations are almost the same as those of data.frame. Thus the  
+users of R do not need to learn SQL when they operate on the objects in the database. The latest  
+code is available at https://github.com/madlib-internal/PivotalR. A training video and a  
+quick-start guide are available at http://zimmeee.github.io/gp-r/#pivotalr.  
+  
+Details  
+Package: PivotalR  
+Type: Package  
+Version: 0.1.17  
+Date: 2014-09-15  
+License: GPL (>= 2)  
+Depends: methods, DBI, RPostgreSQL  
+  
+This package enables R users to easily develop, refine and deploy R scripts that leverage the parallelism  
+and scalability of the database as well as in-database analytics libraries to operate on big  
+data sets that would otherwise not fit in R memory - all this without having to learn SQL because  
+the package provides an interface that they are familiar with.  
+  
+The package also provides a wrapper for MADlib. MADlib is an open-source library for scalable  
+in-database analytics. It provides data-parallel implementations of mathematical, statistical and  
+machine-learning algorithms for structured and unstructured data. The number of machine learning  
+algorithms that MADlib covers is quickly increasing.  
+  
+As an R front-end to the PostgreSQL-like databases, this package minimizes the amount of data  
+transferred between the database and R. All the big data is stored in the database. The user enters  
+their familiar R syntax, and the package translates it into SQL queries and sends the SQL query into  
+database for parallel execution. The computation result, which is small (if it is as big as the original  
+data, what is the point of big data analytics?), is returned to R to the user.  
+  
+On the other hand, this package also gives the usual SQL users the access of utilizing the powerful  
+analytics and graphics functionalities of R. Although the database itself has difficulty in plotting,  
+the result can be analyzed and presented beautifully with R.  
+  
+This current version of PivotalR provides the core R infrastructure and data frame functions as well  
+as over 50 analytical functions in R that leverage in-database execution. These include  
+  
+* Data Connectivity - db.connect, db.disconnect, db.Rquery  
+* Data Exploration - db.data.frame, subsets  
+* R language features - dim, names, min, max, nrow, ncol, summary etc  
+* Reorganization Functions - merge, by (group-by), samples  
+* Transformations - as.factor, null replacement  
+* Algorithms - linear regression and logistic regression wrappers for MADlib  
+  
+Note  
+This package is differernt from PL/R, which is another way of using R with PostgreSQL-like  
+databases. PL/R enables the users to run R scripts from SQL. In the parallel Greenplum database,  
+one can use PL/R to implement parallel algorithms.  
+  
+However, PL/R still requires non-trivial knowledge of SQL to use it effectively. It is mostly limited  
+to explicitly parallel jobs. And for the end user, it is still a SQL interface.  
+  
+This package does not require any knowledge of SQL, and it works for both explicitly and implicitly  
+parallel jobs by employing the open-source MADlib library. It is much more scalable. And for the  
+end user, it is a pure R interface with the conventional R syntax.  
+  
+Author(s)  
+Author: Predictive Analytics Team at Pivotal Inc. <[email protected]>, with contributions from  
+Data Scientist Team at Pivotal Inc.  
+Maintainer: Caleb Welton, Pivotal Inc. <[email protected]>  
+  
+References  
+[1] MADlib website, http://madlib.net  
+[2] MADlib user docs, http://doc.madlib.net/master  
+[3] MADlib Wiki page, http://github.com/madlib/madlib/wiki  
+[4] MADlib contribution guide, https://github.com/madlib/madlib/wiki/Contribution-Guide  
+[5] MADlib on GitHub, https://github.com/madlib/madlib  
+  
+See Also  
+madlib.lm Linear regression  
+madlib.glm Linear, logistic and multinomial logistic regressions  
+madlib.summary summary of a table in the database.  
+```  
+  
+Examples  
+  
+```  
+## Not run:  
+## get the help for the package  
+help("PivotalR-package")  
+## get help for a function  
+help(madlib.lm)  
+## create multiple connections to different databases  
+db.connect(port = 5433) # connection 1, use default values for the parameters  
+db.connect(dbname = "test", user = "qianh1", password = "", host =  
+"remote.machine.com", madlib = "madlib07", port = 5432) # connection 2  
+db.list() # list the info for all the connections  
+## list all tables/views that has "ornst" in the name  
+db.objects("ornst")  
+## list all tables/views  
+db.objects(conn.id = 1)  
+## create a table and the R object pointing to the table  
+## using the example data that comes with this package  
+delete("abalone", conn.id = cid)  
+x <- as.db.data.frame(abalone, "abalone")  
+## OR if the table already exists, you can create the wrapper directly  
+## x <- db.data.frame("abalone")  
+dim(x) # dimension of the data table  
+names(x) # column names of the data table  
+madlib.summary(x) # look at a summary for each column  
+lk(x, 20) # look at a sample of the data  
+## look at a sample sorted by id column  
+lookat(sort(x, decreasing = FALSE, x$id), 20)  
+lookat(sort(x, FALSE, NULL), 20) # look at a sample ordered randomly  
+## linear regression Examples --------  
+## fit one different model to each group of data with the same sex  
+fit1 <- madlib.lm(rings ~ . - id | sex, data = x)  
+fit1 # view the result  
+lookat(mean((x$rings - predict(fit1, x))^2)) # mean square error  
+## plot the predicted values v.s. the true values  
+ap <- x$rings # true values  
+ap$pred <- predict(fit1, x) # add a column which is the predicted values  
+## If the data set is very big, you do not want to load all the  
+## data points into R and plot. We can just plot a random sample.  
+random.sample <- lk(sort(ap, FALSE, "random"), 1000) # sort randomly  
+plot(random.sample) # plot a random sample  
+## fit a single model to all data treating sex as a categorical variable ---------  
+y <- x # make a copy, y is now a db.data.frame object  
+y$sex <- as.factor(y$sex) # y becomes a db.Rquery object now  
+fit2 <- madlib.lm(rings ~ . - id, data = y)  
+fit2 # view the result  
+lookat(mean((y$rings - predict(fit2, y))^2)) # mean square error  
+## logistic regression Examples --------  
+## fit one different model to each group of data with the same sex  
+fit3 <- madlib.glm(rings < 10 ~ . - id | sex, data = x, family = "binomial")  
+fit3 # view the result  
+## the percentage of correct prediction  
+lookat(mean((x$rings < 10) == predict(fit3, x)))  
+## fit a single model to all data treating sex as a categorical variable ----------  
+y <- x # make a copy, y is now a db.data.frame object  
+y$sex <- as.factor(y$sex) # y becomes a db.Rquery object now  
+fit4 <- madlib.glm(rings < 10 ~ . - id, data = y, family = "binomial")  
+fit4 # view the result  
+## the percentage of correct prediction  
+lookat(mean((y$rings < 10) == predict(fit4, y)))  
+## Group by Examples --------  
+## mean value of each column except the "id" column  
+lk(by(x[,-1], x$sex, mean))  
+## standard deviation of each column except the "id" column  
+lookat(by(x[,-1], x$sex, sd))  
+## Merge Examples --------  
+## create two objects with different rows and columns  
+key(x) <- "id"  
+y <- x[1:300, 1:6]  
+z <- x[201:400, c(1,2,4,5)]  
+## get 100 rows  
+m <- merge(y, z, by = c("id", "sex"))  
+lookat(m, 20)  
+## operator Examples --------  
+y <- x$length + x$height + 2.3  
+z <- x$length * x$height / 3  
+lk(y < z, 20)  
+## ------------------------------------------------------------------------  
+## Deal with NULL values  
+delete("null_data")  
+x <- as.db.data.frame(null.data, "null_data")  
+## OR if the table already exists, you can create the wrapper directly  
+## x <- db.data.frame("null_data")  
+dim(x)  
+names(x)  
+## ERROR, because of NULL values  
+fit <- madlib.lm(sf_mrtg_pct_assets ~ ., data = x)  
+## remove NULL values  
+y <- x # make a copy  
+for (i in 1:10) y <- y[!is.na(y[i]),]  
+dim(y)  
+fit <- madlib.lm(sf_mrtg_pct_assets ~ ., data = y)  
+fit  
+## Or we can replace all NULL values  
+x[is.na(x)] <- 45  
+## End(Not run)  
+```  
+  
+安装,使用 :   
+  
+```  
+> install.packages("PivotalR")  
+> library(PivotalR)  
+Loading required package: Matrix  
+Attaching package: ‘PivotalR’  
+The following objects are masked from ‘package:stats’:  
+    sd, var  
+The following object is masked from ‘package:base’:  
+    cbind  
+```  
+  
+## 参考  
+1\. http://blog.pivotal.io/data-science-pivotal/products/introducing-r-for-big-data-with-pivotalr  
+  
+2\. http://cran.r-project.org/web/packages/PivotalR/PivotalR.pdf  
+  
+3\. https://github.com/pivotalsoftware/PivotalR  
@@ -4,6 +4,7 @@
 ##### 20150429_02.md   [《PostgreSQL 垃圾回收原理以及如何预防膨胀 - How to prevent object bloat in PostgreSQL》](20150429_02.md)  
 ##### 20150429_01.md   [《PostgreSQL Oracle 兼容性之 - 事件触发器实现类似Oracle的回收站功能》](20150429_01.md)  
 ##### 20150419_01.md   [《PostgreSQL 9.5 new feature - BRIN (block range index) index》](20150419_01.md)  
+##### 20150414_01.md   [《PivotalR between R & PostgreSQL-like Databases(for exp : Greenplum, hadoop access by hawq)》](20150414_01.md)  
 ##### 20150409_02.md   [《PostgreSQL 9.5 使用 import foreign schema 语法一键创建外部表》](20150409_02.md)  
 ##### 20150409_01.md   [《PostgreSQL 行安全策略 - PostgreSQL 9.5 new feature - can define row security policy for table》](20150409_01.md)  
 ##### 20150407_02.md   [《PostgreSQL aggregate function 4 : Hypothetical-Set Aggregate Functions》](20150407_02.md)  
 
@@ -0,0 +1,49 @@
+## 一张图看懂MADlib能干什么  
+                                          
+### 作者                                             
+digoal                                     
+                                      
+### 日期                                                                                                         
+2015-11-11                                   
+                                         
+### 标签                                      
+PostgreSQL , MADlib , PivotalR         
+                                                                                                            
+----                                                                                                      
+                                                                                                               
+## 背景        
+MADlib最初是由pivotal的一些数据科学家贡献的开源数据挖掘库，现已加入阿帕奇孵化器项目。  
+  
+MADlib能干什么呢？看一张图就明白了，以下取自  
+  
+http://user2014.stat.ucla.edu/files/PivotalR_user2014/userR2014_PivotalR.pdf  
+  
+![pic](20151111_01_pic_001.png)  
+  
+回归分析，决策树，随机森林，贝叶斯分类，向量机，风险模型，KMEAN聚集，文本挖掘，数据校验，。。。等。  
+  
+一个线性回归的例子，对应上图  
+  
+```  
+supervised learning -> generalized linear models -> linear regression  
+```  
+  
+![pic](20151111_01_pic_002.png)  
+   
+如果你是R的数据科学家，并且不习惯使用SQL的话，使用pivotalR的R包就可以了，左边是R的写法。右边对应的是SQL。  
+  
+![pic](20151111_01_pic_003.png)  
+  
+话说今天要预测每个时间点的11.11销售额，可以用到它了。  
+  
+PostgreSQL用户来搞数据挖掘有天然优势。  
+  
+![pic](20151111_01_pic_004.png)  
+  
+madlib的使用手册：  
+  
+http://doc.madlib.net/latest/index.html  
+  
+pivotalR使用手册  
+  
+https://cran.r-project.org/web/packages/PivotalR/PivotalR.pdf  
@@ -2,4 +2,5 @@
 ----  
 ##### 20151130_02.md   [《安装iozone on CentOS 7 x64》](20151130_02.md)  
 ##### 20151130_01.md   [《PostgreSQL 安全警钟长鸣》](20151130_01.md)  
+##### 20151111_01.md   [《一张图看懂MADlib能干什么》](20151111_01.md)  
 ##### 20151109_01.md   [《PostgreSQL snapshot too old补丁, 防止数据库膨胀》](20151109_01.md)