Skip to content

Commit d04bcbb

Browse files
committed
init MIMIC-IV-ED build duckdb
1 parent e398252 commit d04bcbb

2 files changed

Lines changed: 214 additions & 0 deletions

File tree

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# DuckDB
2+
3+
The script in this folder creates the schema for MIMIC-IV-ED and
4+
loads the data into the appropriate tables for
5+
[DuckDB](https://duckdb.org/).
6+
DuckDB, like SQLite, is serverless and
7+
stores all information in a single file.
8+
Unlike SQLite, an OLTP database,
9+
DuckDB is an OLAP database, and therefore optimized for analytical queries.
10+
This will result in faster queries for researchers using MIMIC-IV-ED
11+
with DuckDB compared to SQLite.
12+
To learn more, please read their ["why duckdb"](https://duckdb.org/docs/why_duckdb)
13+
page.
14+
15+
The instructions to load MIMIC-IV-ED into a DuckDB
16+
only require:
17+
1. DuckDB to be installed and
18+
2. Your computer to have a POSIX-compliant terminal shell,
19+
which is already found by default on any Mac OSX, Linux, or BSD installation.
20+
21+
To use these instructions on Windows,
22+
you need a Unix command line environment,
23+
which you can obtain by either installing
24+
[Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
25+
or [Cygwin](https://www.cygwin.com/).
26+
27+
## Set-up
28+
29+
### Quick overview
30+
31+
1. [Install](https://duckdb.org/docs/installation/) the CLI version of DuckDB
32+
2. [Download](https://physionet.org/content/mimic-iv-ed/2.2/) the MIMIC-IV-ED files
33+
3. Create DuckDB database and load data
34+
35+
### Install DuckDB
36+
37+
Follow instructions on their website to
38+
[install](https://duckdb.org/docs/installation/)
39+
the CLI version of DuckDB.
40+
41+
You will need to place the `duckdb` binary in a folder on your environment path,
42+
e.g. `/usr/local/bin`.
43+
44+
### Download MIMIC-IV-ED files
45+
46+
Download the CSV files for [MIMIC-IV-ED](https://physionet.org/content/mimic-iv-ed/2.2/)
47+
by any method you wish.
48+
These instructions were tested with MIMIC-IV-ED v2.2.
49+
50+
The CSV files should be a folder structure as follows:
51+
52+
```
53+
mimic_data_dir
54+
ed
55+
diagnosis.csv.gz
56+
...
57+
vitalsign.csv.gz
58+
```
59+
60+
The CSV files can be uncompressed (end in `.csv`) or compressed (end in `.csv.gz`).
61+
62+
The easiest way to download them is to open a terminal then run:
63+
64+
```
65+
wget -r -N -c -np --user YOURUSERNAME --ask-password https://physionet.org/files/mimic-iv-ed/2.2/
66+
```
67+
68+
Replace `YOURUSERNAME` with your physionet username.
69+
70+
This will make you `mimic_data_dir` be `physionet.org/files/mimic-iv-ed/2.2`.
71+
72+
# Create DuckDB database and load data
73+
74+
The last step requires creating a DuckDB database and
75+
loading the data into it.
76+
77+
You can do all of this will one shell script, `import_duckdb.sh`,
78+
located in this repository.
79+
80+
See the help for it below:
81+
82+
```sh
83+
$ ./import_duckdb.sh -h
84+
./import_duckdb.sh:
85+
USAGE: ./import_duckdb.sh mimic_data_dir [output_db]
86+
WHERE:
87+
mimic_data_dir directory that contains csv.gz or csv files
88+
output_db: optional filename for duckdb file (default: mimic4_ed.db)
89+
$
90+
```
91+
92+
The script will print out progress as it goes. It should only take a few seconds to load.
93+
94+
# Help
95+
96+
Please see the [issues page](https://github.com/MIT-LCP/mimic-code/issues) to discuss other issues you may be having.
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
#!/bin/sh
2+
3+
# Copyright (c) 2023 MIT Laboratory for Computational Physiology
4+
# Copyright (c) 2021 Thomas Ward <thomas@thomasward.com>
5+
#
6+
# Permission is hereby granted, free of charge, to any person obtaining a copy
7+
# of this software and associated documentation files (the "Software"), to deal
8+
# in the Software without restriction, including without limitation the rights
9+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
# copies of the Software, and to permit persons to whom the Software is
11+
# furnished to do so, subject to the following conditions:
12+
#
13+
# The above copyright notice and this permission notice shall be included in all
14+
# copies or substantial portions of the Software.
15+
#
16+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22+
# SOFTWARE.
23+
24+
yell () { echo "$0: $*" >&2; }
25+
die () { yell "$*"; exit 111; }
26+
try () { "$@" || die "Exiting. Failed to run: \"$*\""; }
27+
28+
usage () {
29+
die "
30+
USAGE: ./import_duckdb.sh mimic_data_dir [output_db]
31+
WHERE:
32+
mimic_data_dir directory that contains csv.tar.gz or csv files
33+
output_db: optional filename for duckdb file (default: mimic4_ed.db)\
34+
"
35+
}
36+
37+
# Print help if requested
38+
echo "$0 $* " | grep -Eq " -h | --help " && usage
39+
40+
# rename CLI positional args to more friendly variable names
41+
MIMIC_DIR=$1
42+
# allow optional specification of duckdb name, otherwise default to mimic4_ed.db
43+
OUTFILE=mimic4_ed.db
44+
if [ -n "$2" ]; then
45+
OUTFILE=$2
46+
fi
47+
48+
49+
# basic error checking before running
50+
if [ -z "$MIMIC_DIR" ]; then
51+
yell "Please specify a mimic data directory"
52+
die "Usage: ./import_duckdb.sh mimic_data_dir [output_db]"
53+
elif [ ! -d "$MIMIC_DIR" ]; then
54+
yell "Specified directory \"$MIMIC_DIR\" does not exist."
55+
die "Usage: ./import_duckdb.sh mimic_data_dir [output_db]"
56+
elif [ -n "$3" ]; then
57+
yell "import.sh takes a maximum of two arguments."
58+
die "Usage: ./import_duckdb.sh mimic_data_dir [output_db]"
59+
elif [ -s "$OUTFILE" ]; then
60+
yell "File \"$OUTFILE\" already exists."
61+
read -p "Continue? (y/d/n) 'y' continues, 'd' deletes original file, 'n' stops: " yn
62+
case $yn in
63+
[Yy]* ) ;; # OK
64+
[Nn]* ) exit;;
65+
[Dd]* ) rm "$OUTFILE";;
66+
* ) die "Unrecognized input.";;
67+
esac
68+
fi
69+
70+
# we will copy the postgresql create.sql file, and apply regex
71+
# to fix the following issues:
72+
# 1. Remove optional precision value from TIMESTAMP(NN) -> TIMESTAMP
73+
# duckdb does not support this.
74+
export REGEX_TIMESTAMP='s/TIMESTAMP\([0-9]+\)/TIMESTAMP/g'
75+
# 2. Remove NOT NULL constraint from mimiciv_hosp.microbiologyevents.spec_type_desc
76+
# as there is one (!) zero-length string which is treated as a NULL by the import.
77+
export REGEX_SPEC_TYPE='s/spec_type_desc(.+)NOT NULL/spec_type_desc\1/g'
78+
# 3. Remove NOT NULL constraint from mimiciv_hosp.prescriptions.drug
79+
# as there are zero-length strings which are treated as NULLs by the import.
80+
export REGEX_DRUG='s/drug +(VARCHAR.+)NOT NULL/drug \1/g'
81+
82+
# use sed + above regex to create tables within db
83+
sed -r -e "${REGEX_TIMESTAMP}" ../postgres/create.sql | \
84+
sed -r -e "${REGEX_SPEC_TYPE}" | \
85+
sed -r -e "${REGEX_DRUG}" | \
86+
duckdb "$OUTFILE"
87+
88+
# goal: get path from find, e.g., ./1.0/icu/d_items
89+
# and return database table name for it, e.g., mimiciv_icu.d_items
90+
make_table_name () {
91+
# strip leading directories (e.g., ./icu/hello.csv.gz -> hello.csv.gz)
92+
BASENAME=${1##*/}
93+
# strip suffix (e.g., hello.csv.gz -> hello; hello.csv -> hello)
94+
TABLE_NAME=${BASENAME%%.*}
95+
# strip basename (e.g., ./icu/hello.csv.gz -> ./icu)
96+
PATHNAME=${1%/*}
97+
# strip leading directories from PATHNAME (e.g. ./icu -> icu)
98+
DIRNAME=${PATHNAME##*/}
99+
TABLE_NAME="mimiciv_$DIRNAME.$TABLE_NAME"
100+
}
101+
102+
103+
# load data into database
104+
find "$MIMIC_DIR" -type f -name '*.csv???' | sort | while IFS= read -r FILE; do
105+
make_table_name "$FILE"
106+
107+
# skip directories which we do not expect in mimic-iv-ed
108+
# avoids syntax errors if mimic-iv in the same dir
109+
case $DIRNAME in
110+
(ed) ;; # OK
111+
(*) continue;
112+
esac
113+
echo "Loading $FILE .. \c"
114+
try duckdb "$OUTFILE" <<-EOSQL
115+
COPY $TABLE_NAME FROM '$FILE' (HEADER);
116+
EOSQL
117+
echo "done!"
118+
done && echo "Successfully finished loading data into $OUTFILE."

0 commit comments

Comments
 (0)