Goals
The near term goal is to provide an exploration space where individuals could check out our R offerings without having to install any software. Just spin up an R Studio Server browser session with them logged into the Synapse client and download the entity that they were previously looking at on the web. Then play around with the SageBio-curated data and run simple training tutorials. We provide an R Studio session in a sandbox that gets wiped out after each user session.
A longer term goal would be able to use the lessons we learn from this prototype to see whether it may be part of the larger solution for computation via Synapse.
Assumptions
- The RStudio folks are willing to help a little. The might help more than a little if any of the work we propose make RStudio better for everyone.
- We don't have sprint time for this, it would be a 15% project.
- If we can create a compelling new user experience, it might also help promote the Bioconduction Cloud AMI, which would be a nice thing to do.
Constraints and Other Info
- A bonus feature would be if they use a shared Synapse cache, it should take less time for the download to occur, but we have to think about the implications of them looking at other datasets already in the cache and bypassing the use agreement logic.
- The crux of the issue is that to use R Studio Server, without major changes, the "RStudio login" has to match up with some UNIX account on the machine one to one, (if five users want to use R Studio Server on the same machine, there needs to be 5 separate UNIX accounts).
- When you use RStudio Server, you can write stuff to your home directory, so its important to clean up afterwards if someone else was going to reuse that same home directory. We can't do a simple round-robin over a fixed set of accounts without cleaning them up between use.
- RStudio Server is written in C+. We don't want to write or support a Synapse C+ client so we should look for solutions that minimize the amount of functionality Synapse-specific we add to RStudio Server as C++ code.
- RStudio Server also uses GWT. Need more detail here, but it may be the case that some of our web functionality could transfer over.
- RStudio Server has no tests. Hundreds of users download the development version on a regular basis and that is how they current test - wait for bug reports from users.
- The RStudio folks have only received one code contribution from an outside developer and they rejected it because the developer did not really understand the inner workings of RStudio Server. The take away is that if we have a contribution that we think should become part of the core project, we should have a design review with them before we code it.
Proposal
Here's an idea for a quick and dirty approach to web-based R for Synapse. If we make the implementation of the auth-plugin for RStudio sound simple and generic enough, perhaps they would be willing to write it for us?
- we have a pool of EC2 hosts running RStudio server behind a load balancer with CNAME rstudio.sagebase.org (early on we start out with just a single host, no load balancer, but still use the CNAME)
- our services and the custom RStudio auth plugin share a secret key for use in HMAC-SHA1 computation
- if someone on Synapse wants to use RStudio, we redirect them to https://rstudio.sagebase.org:8787/auth-sign-in?securityToken=XXXXXXXX&expires=1317916890&signature=rucSbH0yNEcP9oM2XNlouVI3BH4%3D (assumption: not sure whether the securityToken is the Synapse sessionToken or batch API key, either way we may want to re-encrypt it if we are displaying it in clear text in the https url)
- this url must be used within 5 minutes to start an RStudio session
- the RStudio auth plugin re-computes the signature on securityToken=XXXXXXXX&expires=1317916890
- if it doesn’t match, the user gets a helpful error message
- if it does match, but the url has expired, the user gets a helpful error message
- if it does match and the url is not expired
- we use a stable method to compute a unix username from the Synapse securityToken no longer than 32 characters
- useradd and make the home directory (TODO: RStudio Server doesn't run as root so it may contact some daemon of ours running on the box to do the things requiring root privilege)
- put a default .Rprofile in it to
- load and log them into the R Synapse client
library(synapseClient); sessionToken(securityToken)
- add a package installation directory under the user's home dir if they want to install their own R packages
- perhaps configure utilization the shared cache of Synapse data (if we can sort out the data security issues)
- load and log them into the R Synapse client
- When the users logs out, nuke that unix account and the home directory.