Using renv to handle R package dependencies

Jake · October 5, 2021, 5:30pm

There is a third option! Nix eats most package management systems and provides pinned alternatives to minimum-version requirements. In any case, is this a good subject for a day-long hackathon? I feel like I would need a bit of collaboration describing the requirements of study packages before providing a good example of how Nix can help.

Mark · October 5, 2021, 6:02pm

Also true in the organization I work in as the security staff has not signed off on any containerized application. At this point, I do not need it, just showing that this is not an isolated incident.

For the lucky few that get to work in a *nix environment.

Christian_Reich · October 5, 2021, 6:47pm

Friends:

I know I will make myself unpopular here, but I will repeat saying what I think the issue is:

I don’t think the problem is whether some data scientist schmock is defaulting on our expectations. The problem is that we are living in a federated network, were individual sites have to receive a call to conduct a certain analytic, and that call comes from the outside. If that is an R package there is no control over what that package does. Even a good R coder can’t easily reverse engineer its content, and it would be a horrible job to begin with. From a security perspective this is not feasible. Individual docker images per study - the same thing. The reason security departments don’t accept docker is that they can’t control what’s inside. renv or nix don’t address this problem, on the contrary, they make packages assemble at run time, requiring code to be downloaded from outside sources again.

For now, we have been flying under the radar, essentially exploiting the trust network among OHDSI collaborators. When a package is originated from @schuemie I can rely it is not a virus working on behalf of the North Korean government. But as we mature and get more traction from impactful institutions, we will less and less be able to do that. And the moment our informal system, God beware, gets compromised, the loss of trust will be instantaneous and significant.

I can see only two solutions to that problem:

We create some kind of a trusted source, where all executables are originating, which is under tight quality control and protection
We implement an execution engine like Arachne, or some other core functionality at the site, that will take its instruction from a configuration file, but otherwise has all functionality built in.

Nobody wants to be limited in creating solutions. But the question is: Can we not successively add all these eventualities? What are examples of specifications that can only be handled with a completely open computer language?

Jake · October 5, 2021, 7:10pm

Oops, scratch this quoted paragraph below. I didn’t read the above post I’m replying to closely enough the first time to realize that access to virtualization /is/ a serious limitation, regardless of any actual improvements in security gained by building in this method.

-Thanks to virtualization, I don’t see this as a serious limitation. It is possible to build nix projects on windows by mounting the project in a virtual volume, building inside of a basic ephemeral nix container, and copying the result to the output. The project would be able to create executables, executable within a *nix environment, possibly a docker image which is ALSO a build output of the project.-

I also don’t want to squash Christian’s important contribution here, so here is a plug to go back up and read his post after you read mine to focus on the important larger-picture issue. I think part of a trusted source /has/ to be auditable, which means locally binary reproducible.

Mark · October 5, 2021, 7:17pm

It’s not a problem with virtualization, it is ops sec plans. If one works in an organization that the security team only has experience in windows security, they can be very resistant to even allow a virtualized *nix to run anywhere on the server.

schuemie · October 6, 2021, 6:41am

My experience is: no. The list of modifications truly appears endless. Just some recent examples:

A cohort study with an unusual follow-up time definition (on-treatment, but truncated at 365 days)
A study where exposure cohorts are programmatically derived (because there are too many to generate them manually)
A study using custom covariates (e.g. prior outcomes, defined through cohorts).

And yes, we could add all of these to our skeletons (again, having more developers would be helpful), but then the next required modifications will be different from these, so its an endless game, with the software getting more and more complex.

With more developers, we might get to a situation where we could cover 80% of studies with our standardized approach (e.g. as implemented in the skeletons). For those studies, you could do a single trusted install, and exchange specifications only (e.g. in JSON format).

Did I mention we need more developers?

Mark · October 6, 2021, 1:59pm

One could always build an official Maven or Ivy repository of all of the executable code. As many independent build tools can access these style repos, it should not be an issue for the build side security.

jposada · October 6, 2021, 6:49pm

Hi Christian,

Thank you for your toughtful comments. I do not think they are unpopular. We have very different evironments.

Let me try to answer some of the implicit questions you posed on your comment.

Regarding this

I already showed examples in this thread where we can see that that is not an issue anymore.

There are plenty of tools to scan images to reveal and evaluate what is inside from a security perspective. The tools are not that different that the ones used to evaluate any other software package.

Regarding this

There are many ways for an organization to be verified and be a trusted resource to publish docker images. We can have an official docker image on Dockerhub and have all the studies use that as the base image. Every study makes the dockerfile explicit on the code repository. It is a very transparent process. Examples of organizations that are verified and publishing verified docker images are Oracle, Google, etc…

https://hub.docker.com/_/oracle-database-enterprise-edition

Here you can see that there is a check that says verified publisher = no trojan horse

Hi @Mark,

Regarding this

Docker images repositories act exaclty as maven repos but with all the goodies of a fully reproducible environment not only code.

Mark · October 7, 2021, 8:05pm

Not true for the organization I work for; No Docker images are allowed. I understand the benefits, and determents, of containers, but this is not my call.

We are not in any federated studies, so maybe our needs are meaningless in this conversation at this time.

jposada · October 8, 2021, 8:10pm

And that is the very reason that for now we should support both, renv and Docker. And you are right I should have said that that is not a technical issue anymore., is more an organization issue at this point.

SCYou · October 9, 2021, 1:50am

I’m second to this opinion.