|
| 1 | +## PivotalR between R & PostgreSQL-like Databases(for exp : Greenplum, hadoop access by hawq) |
| 2 | + |
| 3 | +### 作者 |
| 4 | +digoal |
| 5 | + |
| 6 | +### 日期 |
| 7 | +2015-04-14 |
| 8 | + |
| 9 | +### 标签 |
| 10 | +PostgreSQL , MADlib , PivotalR |
| 11 | + |
| 12 | +---- |
| 13 | + |
| 14 | +## 背景 |
| 15 | +PivotalR是R的一个包, 这个包提供了将R翻译成SQL语句的能力, 即对大数据进行挖掘的话. 用户将大数据存储在数据库中, 例如PostgreSQL , Greenplum. |
| 16 | + |
| 17 | +用户在R中使用R的语法即可, 不需要直接访问数据库, 因为PivotalR 会帮你翻译成SQL语句, 并且返回结果给R. |
| 18 | + |
| 19 | +这个过程不需要传输原始数据到R端, 所以可以完成R不能完成的任务(因为R是数据在内存中的运算, 如果数据量超过内存会有问题) |
| 20 | + |
| 21 | +PivotalR还封装了MADlib, 里面包含了大量的机器学习的函数, 回归分析的函数等. |
| 22 | + |
| 23 | + |
| 24 | +这个包的说明 : |
| 25 | + |
| 26 | +``` |
| 27 | +PivotalR-package |
| 28 | +An R font-end to PostgreSQL and Greenplum database, and wrapper |
| 29 | +for in-database parallel and distributed machine learning open-source |
| 30 | +library MADlib |
| 31 | + |
| 32 | +Description |
| 33 | +PivotalR is a package that enables users of R, the most popular open source statistical programming |
| 34 | +language and environment to interact with the Pivotal (Greenplum) Database as well as Pivotal |
| 35 | +HD/HAWQ for Big Data analytics. It does so by providing an interface to the operations on tables/views |
| 36 | +in the database. These operations are almost the same as those of data.frame. Thus the |
| 37 | +users of R do not need to learn SQL when they operate on the objects in the database. The latest |
| 38 | +code is available at https://github.com/madlib-internal/PivotalR. A training video and a |
| 39 | +quick-start guide are available at http://zimmeee.github.io/gp-r/#pivotalr. |
| 40 | + |
| 41 | +Details |
| 42 | +Package: PivotalR |
| 43 | +Type: Package |
| 44 | +Version: 0.1.17 |
| 45 | +Date: 2014-09-15 |
| 46 | +License: GPL (>= 2) |
| 47 | +Depends: methods, DBI, RPostgreSQL |
| 48 | + |
| 49 | +This package enables R users to easily develop, refine and deploy R scripts that leverage the parallelism |
| 50 | +and scalability of the database as well as in-database analytics libraries to operate on big |
| 51 | +data sets that would otherwise not fit in R memory - all this without having to learn SQL because |
| 52 | +the package provides an interface that they are familiar with. |
| 53 | + |
| 54 | +The package also provides a wrapper for MADlib. MADlib is an open-source library for scalable |
| 55 | +in-database analytics. It provides data-parallel implementations of mathematical, statistical and |
| 56 | +machine-learning algorithms for structured and unstructured data. The number of machine learning |
| 57 | +algorithms that MADlib covers is quickly increasing. |
| 58 | + |
| 59 | +As an R front-end to the PostgreSQL-like databases, this package minimizes the amount of data |
| 60 | +transferred between the database and R. All the big data is stored in the database. The user enters |
| 61 | +their familiar R syntax, and the package translates it into SQL queries and sends the SQL query into |
| 62 | +database for parallel execution. The computation result, which is small (if it is as big as the original |
| 63 | +data, what is the point of big data analytics?), is returned to R to the user. |
| 64 | + |
| 65 | +On the other hand, this package also gives the usual SQL users the access of utilizing the powerful |
| 66 | +analytics and graphics functionalities of R. Although the database itself has difficulty in plotting, |
| 67 | +the result can be analyzed and presented beautifully with R. |
| 68 | + |
| 69 | +This current version of PivotalR provides the core R infrastructure and data frame functions as well |
| 70 | +as over 50 analytical functions in R that leverage in-database execution. These include |
| 71 | + |
| 72 | +* Data Connectivity - db.connect, db.disconnect, db.Rquery |
| 73 | +* Data Exploration - db.data.frame, subsets |
| 74 | +* R language features - dim, names, min, max, nrow, ncol, summary etc |
| 75 | +* Reorganization Functions - merge, by (group-by), samples |
| 76 | +* Transformations - as.factor, null replacement |
| 77 | +* Algorithms - linear regression and logistic regression wrappers for MADlib |
| 78 | + |
| 79 | +Note |
| 80 | +This package is differernt from PL/R, which is another way of using R with PostgreSQL-like |
| 81 | +databases. PL/R enables the users to run R scripts from SQL. In the parallel Greenplum database, |
| 82 | +one can use PL/R to implement parallel algorithms. |
| 83 | + |
| 84 | +However, PL/R still requires non-trivial knowledge of SQL to use it effectively. It is mostly limited |
| 85 | +to explicitly parallel jobs. And for the end user, it is still a SQL interface. |
| 86 | + |
| 87 | +This package does not require any knowledge of SQL, and it works for both explicitly and implicitly |
| 88 | +parallel jobs by employing the open-source MADlib library. It is much more scalable. And for the |
| 89 | +end user, it is a pure R interface with the conventional R syntax. |
| 90 | + |
| 91 | +Author(s) |
| 92 | +Author: Predictive Analytics Team at Pivotal Inc. <[email protected]>, with contributions from |
| 93 | +Data Scientist Team at Pivotal Inc. |
| 94 | +Maintainer: Caleb Welton, Pivotal Inc. <[email protected]> |
| 95 | + |
| 96 | +References |
| 97 | +[1] MADlib website, http://madlib.net |
| 98 | +[2] MADlib user docs, http://doc.madlib.net/master |
| 99 | +[3] MADlib Wiki page, http://github.com/madlib/madlib/wiki |
| 100 | +[4] MADlib contribution guide, https://github.com/madlib/madlib/wiki/Contribution-Guide |
| 101 | +[5] MADlib on GitHub, https://github.com/madlib/madlib |
| 102 | + |
| 103 | +See Also |
| 104 | +madlib.lm Linear regression |
| 105 | +madlib.glm Linear, logistic and multinomial logistic regressions |
| 106 | +madlib.summary summary of a table in the database. |
| 107 | +``` |
| 108 | + |
| 109 | +Examples |
| 110 | + |
| 111 | +``` |
| 112 | +## Not run: |
| 113 | +## get the help for the package |
| 114 | +help("PivotalR-package") |
| 115 | +## get help for a function |
| 116 | +help(madlib.lm) |
| 117 | +## create multiple connections to different databases |
| 118 | +db.connect(port = 5433) # connection 1, use default values for the parameters |
| 119 | +db.connect(dbname = "test", user = "qianh1", password = "", host = |
| 120 | +"remote.machine.com", madlib = "madlib07", port = 5432) # connection 2 |
| 121 | +db.list() # list the info for all the connections |
| 122 | +## list all tables/views that has "ornst" in the name |
| 123 | +db.objects("ornst") |
| 124 | +## list all tables/views |
| 125 | +db.objects(conn.id = 1) |
| 126 | +## create a table and the R object pointing to the table |
| 127 | +## using the example data that comes with this package |
| 128 | +delete("abalone", conn.id = cid) |
| 129 | +x <- as.db.data.frame(abalone, "abalone") |
| 130 | +## OR if the table already exists, you can create the wrapper directly |
| 131 | +## x <- db.data.frame("abalone") |
| 132 | +dim(x) # dimension of the data table |
| 133 | +names(x) # column names of the data table |
| 134 | +madlib.summary(x) # look at a summary for each column |
| 135 | +lk(x, 20) # look at a sample of the data |
| 136 | +## look at a sample sorted by id column |
| 137 | +lookat(sort(x, decreasing = FALSE, x$id), 20) |
| 138 | +lookat(sort(x, FALSE, NULL), 20) # look at a sample ordered randomly |
| 139 | +## linear regression Examples -------- |
| 140 | +## fit one different model to each group of data with the same sex |
| 141 | +fit1 <- madlib.lm(rings ~ . - id | sex, data = x) |
| 142 | +fit1 # view the result |
| 143 | +lookat(mean((x$rings - predict(fit1, x))^2)) # mean square error |
| 144 | +## plot the predicted values v.s. the true values |
| 145 | +ap <- x$rings # true values |
| 146 | +ap$pred <- predict(fit1, x) # add a column which is the predicted values |
| 147 | +## If the data set is very big, you do not want to load all the |
| 148 | +## data points into R and plot. We can just plot a random sample. |
| 149 | +random.sample <- lk(sort(ap, FALSE, "random"), 1000) # sort randomly |
| 150 | +plot(random.sample) # plot a random sample |
| 151 | +## fit a single model to all data treating sex as a categorical variable --------- |
| 152 | +y <- x # make a copy, y is now a db.data.frame object |
| 153 | +y$sex <- as.factor(y$sex) # y becomes a db.Rquery object now |
| 154 | +fit2 <- madlib.lm(rings ~ . - id, data = y) |
| 155 | +fit2 # view the result |
| 156 | +lookat(mean((y$rings - predict(fit2, y))^2)) # mean square error |
| 157 | +## logistic regression Examples -------- |
| 158 | +## fit one different model to each group of data with the same sex |
| 159 | +fit3 <- madlib.glm(rings < 10 ~ . - id | sex, data = x, family = "binomial") |
| 160 | +fit3 # view the result |
| 161 | +## the percentage of correct prediction |
| 162 | +lookat(mean((x$rings < 10) == predict(fit3, x))) |
| 163 | +## fit a single model to all data treating sex as a categorical variable ---------- |
| 164 | +y <- x # make a copy, y is now a db.data.frame object |
| 165 | +y$sex <- as.factor(y$sex) # y becomes a db.Rquery object now |
| 166 | +fit4 <- madlib.glm(rings < 10 ~ . - id, data = y, family = "binomial") |
| 167 | +fit4 # view the result |
| 168 | +## the percentage of correct prediction |
| 169 | +lookat(mean((y$rings < 10) == predict(fit4, y))) |
| 170 | +## Group by Examples -------- |
| 171 | +## mean value of each column except the "id" column |
| 172 | +lk(by(x[,-1], x$sex, mean)) |
| 173 | +## standard deviation of each column except the "id" column |
| 174 | +lookat(by(x[,-1], x$sex, sd)) |
| 175 | +## Merge Examples -------- |
| 176 | +## create two objects with different rows and columns |
| 177 | +key(x) <- "id" |
| 178 | +y <- x[1:300, 1:6] |
| 179 | +z <- x[201:400, c(1,2,4,5)] |
| 180 | +## get 100 rows |
| 181 | +m <- merge(y, z, by = c("id", "sex")) |
| 182 | +lookat(m, 20) |
| 183 | +## operator Examples -------- |
| 184 | +y <- x$length + x$height + 2.3 |
| 185 | +z <- x$length * x$height / 3 |
| 186 | +lk(y < z, 20) |
| 187 | +## ------------------------------------------------------------------------ |
| 188 | +## Deal with NULL values |
| 189 | +delete("null_data") |
| 190 | +x <- as.db.data.frame(null.data, "null_data") |
| 191 | +## OR if the table already exists, you can create the wrapper directly |
| 192 | +## x <- db.data.frame("null_data") |
| 193 | +dim(x) |
| 194 | +names(x) |
| 195 | +## ERROR, because of NULL values |
| 196 | +fit <- madlib.lm(sf_mrtg_pct_assets ~ ., data = x) |
| 197 | +## remove NULL values |
| 198 | +y <- x # make a copy |
| 199 | +for (i in 1:10) y <- y[!is.na(y[i]),] |
| 200 | +dim(y) |
| 201 | +fit <- madlib.lm(sf_mrtg_pct_assets ~ ., data = y) |
| 202 | +fit |
| 203 | +## Or we can replace all NULL values |
| 204 | +x[is.na(x)] <- 45 |
| 205 | +## End(Not run) |
| 206 | +``` |
| 207 | + |
| 208 | +安装,使用 : |
| 209 | + |
| 210 | +``` |
| 211 | +> install.packages("PivotalR") |
| 212 | +> library(PivotalR) |
| 213 | +Loading required package: Matrix |
| 214 | +Attaching package: ‘PivotalR’ |
| 215 | +The following objects are masked from ‘package:stats’: |
| 216 | + sd, var |
| 217 | +The following object is masked from ‘package:base’: |
| 218 | + cbind |
| 219 | +``` |
| 220 | + |
| 221 | +## 参考 |
| 222 | +1\. http://blog.pivotal.io/data-science-pivotal/products/introducing-r-for-big-data-with-pivotalr |
| 223 | + |
| 224 | +2\. http://cran.r-project.org/web/packages/PivotalR/PivotalR.pdf |
| 225 | + |
| 226 | +3\. https://github.com/pivotalsoftware/PivotalR |
0 commit comments