-
Notifications
You must be signed in to change notification settings - Fork 3
Reproducibility #22
Comments
Comments from earlier thread: @jhollist says: The Reproducibility problem strikes home for me. I am a recent convert to the R Markdown/knitr/pandoc/makefile tool set and am quite enamored of it; however, many of my colleagues often point out that I am the unusual one and in spite of my proselytizing they are very unlikely to switch away from using Word. We could certainly make some progress by continuing to encourage others to try R Markdown/knitr/... and incorporating the same into undergraduate and graduate education, but that means we are at least a generation away from seeing significant changes. I wonder if this group could make some progress towards making the existing tool set used by most scientists more reproducible. It seems that tackling reproducibility from the MS Word side could have the greatest impact. This would be similar to the way the DataUP project approached the problem of trying to get more scientists to better manage data and submit to DataONE. They worked with Microsoft Research and developed DataUp to work directly with Excel and have since moved most of that directly to the cloud. Not sure I am suggesting that route, just using it as a somewhat relevant example. Given that I am most certainly on the extreme novice side of development (even more so with this group!) I have no ideas on how we might develop something for Word that could make it part of a reproducible workflow or even if that is possible. But seeing that Word and Office in general are moving to the cloud it seems like incorporating reproducible analysis via something like OpenCPU is more feasible than it ever has been (forgive me if I am talking nonsense here). In any event, having been reminded multiple times from several of my co-workers that they just aren't going to spend the time hacking that I do, it seems that to really increase the reproducibility of science we need to address the problem where much of that science is actually happening. And unfortunately, that isn't exclusively with tools we think of as reproducible (i.e R, python, etc.). |
From @Edild I personally don't like knitr or sweave to write research papers. Some thoughts: I like LaTeX more the markdown. It's a little more complicated, but gives you much much more flexibility. Also some journals accept LaTeX format, though many insist of submitting a .doc file (grr....) For a research paper code changes a lot and develops with time. With many code not used in the final paper. My workflow/structure/setup is something like this: /data -> raw data files /src/load.R R Code produces Figures which are then stored into /report/fig, and then are included into the LaTeX file. Generally I first develop code and then write the paper. If I publish, I just put my folder in the supplement and reproducers just need to change one path in load.R (But this is also explained in a README file). So this is my workflow explained in a few lines, hope it is understandable. I am interested about your thoughts regarding my workflow and your experience in writing scientific papers with markdown... |
From @mfenner I'm not very interested extending Microsoft Word or Excel - only to the extend that I can import/export from the tools I use. For me reproducibility is very much linked to automation, and I just don't see how this can be easily done in those applications. Markdown, Github, Pandoc, Travis, etc. might look geeky now, but I'm happy to go that direction. |
I would like to contribute that reproducibility is not limited to generating reports. In all likelihood, data analysis will soon move to cloud based infrastructures where tools and principles from markdown/knitr will become more accessible natural part of the analysis process, even for users relying on a GUI. However, much more challenging than weaving results into a document is software versioning. In My experience is that in practice almost no R script or document over 2 years old can be reproduced with current version of R and packages. I personally think addressing this problem is at least as important as weaving tools. |
Echoing some themes here:
I believe the major obstacles to reproducibility are a. people not sharing any code to begin with, As much as I love knitr/sweave, I don't think it solves any of these issues. It does not address (a) well because it is simply not the easiest route to code sharing for most users. Much easier to ask folks to provide a script, preferably as a plain text file at a permanent URL with the minimal metadata discussed in mozillascience/code-research-object#2. The software versioning issue was one of the primary challenges the NESCent informatics team identified when trying to replicate my paper (and cboettig/prosecutors-fallacy#1 and cboettig/prosecutors-fallacy#2) and I believe other papers as well in their exercise. So even if knitr was widely used in publications, (which admittedly does solve (a)), this problem would remain. I don't have many ideas on how to address this one effectively, but would love to hear more thoughts. The local paths problem is really just a data archiving issue and one I think we can address well with APIs to data publication repositories, particularly those with good support in the API for pre-release / private sharing of data. This should mean that a user can replace the local paths with remote paths to archived data well before publication. |
@cboettig b) Minimum information would be c) Yes, a fixed url would be a soultion. # load.R
#####
### Setup project structure
#####
## Project Path
## You have to change this!
prj <- "/home/edisz/Documents/Uni/Projects/mesocosm_methods/review/"
## Subfolder paths
datadir <- file.path(prj, "data") # data
srcdir <- file.path(prj, "src") # source code
cachedir <- file.path(prj, "cache") # caching objects |
I'm not sure if this was touched upon recently, but one key revelation I had when creating reproducibility of projects was the different "levels" of reproducibility. Here is what I mean by that: Workflow: (these can obviously be collapsed at different points, etc.). I have a personal workflow that separates out these different steps. I can then reproduce a file from different points--for example, I might want to play with the code for graphics, but don't want to re-pull the data every time I do it. I've played around with the idea of there being different "levels" of reproducibility (i.e. level I, level II, etc.). I think this is also a concept that a lot of scientists haven't teased apart yet. Having this framework standardized could be really really helpful. Also, a lot of my thinking on this was formed by using the projecttemplate package (https://github.com/johnmyleswhite/ProjectTemplate). Could we leverage that? |
Great observation, we are musing on those exact thoughts here ropensci-archive/reproducibility-guide#5 and ropensci-archive/reproducibility-guide#4 |
I think one thing we should really tackle, if possible, is the issue of reproducibility. Outside of our expert/super user bubble, regular scientists rarely use the suite of tools that we rely on every day. There are hardly any papers that are simply .Rmds to reproduce entire papers (sans journal's own style).
What are those roadblocks and what parts of that pipeline can we streamline with the higher levels tools that Hadley is known to write?
Moving from #18
The text was updated successfully, but these errors were encountered: