Here you have the opportunity to see the codes of some Projects I worked on
The MIRACUM Project
MIRACUM stands for Medical Informatics in Research and Care in University Medicine. It is one of four consortia funded by the German Federal Ministry of Education and Research as part of the Medical Informatics Initiative (MII). From 2018 to 2022, the focus was on establishing data integration centers at German university hospitals. The objective was to build a common data model at all participating university medicines and hospitals in Germany to be able to put data together to facilitate federated analyses. My role in this project as a research assistant at the university medicine of Freiburg was to carry out a federated analysis on a platform called Datashield in the framework of a use case. This analysis resulted in a published scientific paper.
The 4CE Project
4CE stands for Consortium for Clinical Characterization of COVID-19 by EHR. It is an international consortium for electronic health record (EHR) data-driven studies of the COVID-19 pandemic. The goal of this effort—led by the i2b2 international academics users group—is to inform doctors, epidemiologists and the public about COVID-19 patients with data acquired through the health care process. The project was the initiative of the Harvard Medical school and gathered many university medicine or hospitals worldwide to put together their COVID Data and carry out medical analyses. As a research assistant at the University Medicine of Freiburg, I was responsible to extract data from the hospital Information system and prepare them to a specific common data model validated by the whole consortium to allow analyses working for the data of all participating members of the consortium. The specifications or descriptions of the data to be prepared was made available through an excel File and each member was responsible to provide the expected data to the specific format. Out of this project many scientific papers were published. Due to confidentiality reasons, I won't show you the python and SQL codes that first extracted the data from the Hospital Information System but just the one that transformed the data extracted to the specific desired format afterwards. At the right hand side, you have the possibility to download the file that describe how the data had to be prepared as well as the python codes that I wrote to do it.
Example of Processing of Unstructured dataset
The goal of this project: First of all generate a structured dataset from the unstructured dataset university_towns.txt. Secondly convert data from the file City_Zhvi_AllHomes.csv which is currently displayed in a monthly basis to a quarterly basis. The transformed dataframe should be a dataframe with columns going from 2000q1 to 2016q3, and should have a multi-index in the shape of ["State","RegionName"]. And finally I will run a t test to check whether the hypothesis according to which University towns have their mean housing prices less effected by recessions is true.
Hypothesis: University towns have their mean housing prices less effected by recessions. Run a t-test to compare the ratio of the mean price of houses in university towns the quarter before the recession starts compared to the recession bottom. (price_ratio=quarter_before_recession/recession_bottom
The following data files are available for this assignment:
From the Zillow research data site there is housing data for the United States. In particular the datafile for all homes at a city level, City_Zhvi_AllHomes.csv, has median home sale prices at a fine grained level.
From the Wikipedia page on college towns is a list of university towns in the United States which has been copy and pasted into the file university_towns.txt.
From Bureau of Economic Analysis, US Department of Commerce, the GDP over time of the United States in current dollars (use the chained value in 2009 dollars), in quarterly intervals, in the file gdplev.xls. For this assignment, only look at GDP data from the first quarter of 2000 onward.
Implementing a Datawarehouse based on the Datavault Approach
This project is an example of implementing a Datawarehouse based on the Datavault approach
The python files have to be executed in this order:
. 01_Create_Database : the files creates a relational database corresponding to the data at hand
. 02_Extract_Clean_Data: Extract the raw data and make some preprocessing . 03_Load_Data : once the preprocessings are done, this script then loads the data into the relational database created in step 1
. 04_Load_business_data : this file does some calculations that will feed the table containing business informations that can be used for quick insights from the managerial perspective
. 05_Performance insights: shows quick performance insights
You can have a look of the relational datawarehouse created in the file Datawarehouse diagramm.png
The input folder contains initial data. Because of the size, the data has been zipped. So one needs to unzip first this data before using. The Cleaned Data folder contains the processed data after running the 02_Extract_Clean_Data python code. For the same reason mentioned above one of the data has been submitted as zipped file.
All paths in codes have been written as relative paths assuming the all the scripts are located in the same folder as it is in the repository