Continuing the discussion from Need some short video tutorials on ETLs:
Martijn,
Which issues would you say are top priority? Seems like a couple things that are unmarked are really just enhancements. e.g. Consider sorting columns alphabetically
Well, actually a lot of them are important:
-
Not being able to delete arrows is the most annoying thing by far
-
Currently we’re already running into backward-compatibility issues, so changing the file format from binary to something like XML would be next on my list
-
Being able to switch from CDMv5 to CDMv4 should be fairly easy to implement (both models are already there under the hood)
-
Updating the scan report and/or CDM data would be very helpful in a lot of situations, although having a readable file format would already help.
-
Handling data sets with large number of tables / fields better would help some folks (e.g. allowing filtering, searching and sorting of table/field names)
Might be helpful to break down some of these and add a priority label on github.
Also, some of these might not require much Java experience while others would. So perhaps helping people get a sense of what they can work on without being a Java Ninja-coder would be helpful.
I clean up the issues a bit, and added ‘help wanted’ and ‘priority’ labels. I’m still learning all this stuff. Thanks Jon!
@schuemie, I’d love to help hack on WhiteRabbit a bit, especially since I just added 3 new issues to GitHub, but my Java skills are a bit rusty.
I see a build.xml file in the repo, but when I run ant on it, it only spits out WhiteRabbit and not RabbitInAHat.
If you could update the Readme on how to get started with hacking on WhiteRabbit and building both targets, I’d be happy to try to help with this project.
Thanks.
That would be great! I just modified the ANT file, now it should create both jars.
I’m a bit unsure as to what you need to get started. Do you also use Eclipse? Basically both applications use the Swing GUI (and you’ll see I’m a bit rusty with that as well). The main class for WhiteRabbit is org.ohdsi.whiteRabbit.WhiteRabbitMain, for RabbitInAHat it’s org.ohdsi.rabbitInAHat.RabbitInAHatMain. Both application have some application-specific code, but share a lot of code as well.
Could you take a look at the code, and let me know where you need input from me?
Hi there, I am jumping in on this a little late. I am new to Whiterabbit/RiaH and trying to figure out the different tools. Along with being able to generate an ETL document as a template in RiaH, there is also the option to generate ETL test framework in R.
@schuemie, I am wondering if you have any clarifications on the purpose/use of this function. Also, has anyone else been using it as a tool?
Previously there was mention of video tutorials for Whiterabbit/RiaH, have those been created and shared?
Thanks for the help!
Video tutorials are still on our to-do list. For now, the only documentation of WhiteRabbit and RiaH is on the Wiki.
The ETL test framework is still under development. I’m currently using it myself, and others in my team will start using it soon. The framework is for efficiently writing unit tests for the ETL, to make sure the ETL implementation is doing what it is supposed to be doing. It will allow you to quickly write a test that both creates data in the source schema, and a testing function to see whether the right data in the CDM schema is created.
I will start to document this feature somewhere in the next weeks (in the Wiki)
Thank you for the timely response. We have read through the documentation and started to play with a few of the functionalities. We need some clarification on a few topics. In the instructions for generating the test data, it states “The SQL assumes that the data schema already exists, and will first remove any records that might be in the tables”.
Does this mean the source schema has to already exist?
Does there have to be data in the source schema initially? If not, where is the test data generated from? If so, do we understand correctly that the initial data is removed from the tables and replaced by the test data?
Here’s what I’m supposing is your situation: You have your data in source format somewhere in a database. You have created an ETL process that will extract from the source database, transform it into CDM format, and load it into a CDM schema.
To make sure that ETL process is doing what it is supposed to do you can use the unit test framework.This means you will need to create a new, empty database with exactly the same structure as your source database, and a new empty database where a test CDM database will live. Lets call these the Test Source DB and Test CDM DB.
Using the framework, you can populate the Test Source DB (that is what the insertSql
is for that will be generated when you run your R script). You can then run your ETL process as you would on your real source data, and populate the test CDM DB instead. On the Test CDM DB you can then run the testSql
that is also generated by the R script. This will create a new table called test_results
with the results for each of the tests.
Obviously, you do not want to run insertSql
or testSql
on your real source and test database.
Hope this helps. I will try to clarify the instructions a bit on this point.