auc evaluation

staskh · staskh · commit a196cc5ed155 · 2025-06-15T13:42:12.000+03:00
diff --git a/R_REVIEW.md b/R_REVIEW.md
@@ -7,6 +7,13 @@
       (length(na.omit(diffs))*n/60)
 ```
 
+## AUC
+
+```
+        day = rep(data_ip[[2]], 1440/dt0),
+```
+Generate sequence of days repeated 1440/dt0, while it has to have each day repeated by 1440/dt0 and followed by the next
+
 ## CGMS2DayByDay
 
 [ndays = ceiling(as.double(difftime(max(tr), min(tr), units = "days")) + 1)](https://github.com/irinagain/iglu/blob/82e4d1a39901847881d5402d1ac61b3e678d2a5e/R/utils.R#L208) has to be ndays = ceiling(as.double(difftime(max(tr), min(tr), units = "days")))`
diff --git a/notebooks/auc_evaluation.ipynb b/notebooks/auc_evaluation.ipynb
@@ -18,7 +18,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -51,7 +51,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [
     {
@@ -153,7 +153,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -167,7 +167,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [
     {
@@ -194,6 +194,104 @@
     "print(f\"rpy2 version: {version('rpy2')}\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test on synthetic data\n",
+    "\n",
+    "- Samples - every 5 min\n",
+    "- duration - 1h\n",
+    "- values [80,120] repeated for sampling duration\n",
+    "\n",
+    "Expected hourly AUC = 100 mg.h/dL"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>id</th>\n",
+       "      <th>hourly_auc</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>subject1</td>\n",
+       "      <td>102.222222</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "         id  hourly_auc\n",
+       "1  subject1  102.222222"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "hours = 1\n",
+    "dt0 = 5\n",
+    "samples = int(hours*60/dt0)\n",
+    "times = pd.date_range('2020-01-01', periods=samples, freq=f\"{dt0}min\")\n",
+    "glucose_values = [80,120]* int(samples/2)\n",
+    "\n",
+    "syntheticdata = pd.DataFrame({\n",
+    "    'id': ['subject1'] * samples,\n",
+    "    'time': times,\n",
+    "    'gl': glucose_values\n",
+    "})\n",
+    "\n",
+    "synthetic_iglu_auc_results  = iglu_py.auc(syntheticdata)\n",
+    "synthetic_iglu_auc_results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Note:** Incorrect AUC calculation is a result of CGMS2DayByDay function bugs:\n",
+    "- one sample shift in interpolation - results in 11 samples instead of 12\n",
+    "- actual_dates returns 2 dates instead of one\n",
+    "\n",
+    "Additional suspicious code is in AUC itself: `day = rep(data_ip[[2]], 1440/dt0),` - IMHO it resample sequential gl to different days, instead of sequential sampling for each day before sampling for the next \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test on example data  "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 6,
@@ -280,6 +378,7 @@
     }
    ],
    "source": [
+    "test_data = \"../tests/data/example_data_5_subject.csv\"\n",
     "# load test data into DF\n",
     "df = pd.read_csv(test_data, index_col=0)\n",
     "\n",
@@ -298,12 +397,41 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Lets try to run AUC on simulated data with easily calculatable AUC"
+    "## Conclusions \n",
+    "IGLU AUC calculations are substantially differ from expected ranges suggested by ChatGPT\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# IGLU_PYTHON results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Add project directory to PYTHONPATH\n",
+    "import os\n",
+    "import sys\n",
+    "import pandas as pd\n",
+    "sys.path.append(os.path.abspath('..'))\n",
+    "import iglu_python\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test on synthetic data"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
@@ -333,72 +461,46 @@
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
-       "      <th>1</th>\n",
+       "      <th>0</th>\n",
        "      <td>subject1</td>\n",
-       "      <td>102.222222</td>\n",
+       "      <td>100.0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
        "         id  hourly_auc\n",
-       "1  subject1  102.222222"
+       "0  subject1       100.0"
       ]
      },
-     "execution_count": 18,
+     "execution_count": 8,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "hours = 1\n",
-    "dt0 = 5\n",
-    "samples = int(hours*60/dt0)\n",
-    "times = pd.date_range('2020-01-01', periods=samples, freq=f\"{dt0}min\")\n",
-    "glucose_values = [80,120]* int(samples/2)\n",
-    "\n",
-    "data = pd.DataFrame({\n",
-    "    'id': ['subject1'] * samples,\n",
-    "    'time': times,\n",
-    "    'gl': glucose_values\n",
-    "})\n",
-    "\n",
-    "iglu_auc_results  = iglu_py.auc(data)\n",
-    "iglu_auc_results"
+    "synthetic_iglu_auc_results  = iglu_python.auc(syntheticdata)\n",
+    "synthetic_iglu_auc_results"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Conclusions \n",
-    "IGLU AUC calculations are substantially differ from expected ranges suggested by ChatGPT\n"
+    "**Note:** Result match expected"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# IGLU_PYTHON results"
+    "## Test on Example data"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Add project directory to PYTHONPATH\n",
-    "import os\n",
-    "import sys\n",
-    "\n",
-    "sys.path.append(os.path.abspath('..'))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
@@ -501,14 +603,9 @@
     }
    ],
    "source": [
-    "import pandas as pd\n",
-    "\n",
-    "import iglu_python\n",
-    "\n",
     "# load test data into DF\n",
     "df = pd.read_csv(test_data, index_col=0)\n",
     "\n",
-    "iglu_python.IGLU_R_COMPATIBLE = False\n",
     "iglu_python_auc_results = iglu_python.auc(df)\n",
     "iglu_python_auc_results = iglu_python_auc_results.round(0)\n",
     "\n",
@@ -518,88 +615,15 @@
     "iglu_python_auc_results['Difference to IGLU(%)'] = ((iglu_python_auc_results['IGLU PYTHON AUC (mg*h/dL)'] - iglu_python_auc_results['IGLU AUC (mg*h/dL)']) / iglu_python_auc_results['IGLU AUC (mg*h/dL)'] * 100).round(1)\n",
     "iglu_python_auc_results['Difference to ChatGPt(%)'] = ((iglu_python_auc_results['IGLU PYTHON AUC (mg*h/dL)'] - iglu_python_auc_results['ChatGPT AUC (mg*h/dL)']) / iglu_python_auc_results['ChatGPT AUC (mg*h/dL)'] * 100).round(1)\n",
     "\n",
-    "\n",
-    "\n",
-    "display(iglu_python_auc_results)\n",
-    "\n",
-    "\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 21,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<div>\n",
-       "<style scoped>\n",
-       "    .dataframe tbody tr th:only-of-type {\n",
-       "        vertical-align: middle;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe tbody tr th {\n",
-       "        vertical-align: top;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe thead th {\n",
-       "        text-align: right;\n",
-       "    }\n",
-       "</style>\n",
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "  <thead>\n",
-       "    <tr style=\"text-align: right;\">\n",
-       "      <th></th>\n",
-       "      <th>id</th>\n",
-       "      <th>hourly_auc</th>\n",
-       "    </tr>\n",
-       "  </thead>\n",
-       "  <tbody>\n",
-       "    <tr>\n",
-       "      <th>0</th>\n",
-       "      <td>subject1</td>\n",
-       "      <td>100.0</td>\n",
-       "    </tr>\n",
-       "  </tbody>\n",
-       "</table>\n",
-       "</div>"
-      ],
-      "text/plain": [
-       "         id  hourly_auc\n",
-       "0  subject1       100.0"
-      ]
-     },
-     "execution_count": 21,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "hours = 1\n",
-    "dt0 = 5\n",
-    "samples = int(hours*60/dt0)\n",
-    "times = pd.date_range('2020-01-01', periods=samples, freq=f\"{dt0}min\")\n",
-    "glucose_values = [80,120]* int(samples/2)\n",
-    "\n",
-    "data = pd.DataFrame({\n",
-    "    'id': ['subject1'] * samples,\n",
-    "    'time': times,\n",
-    "    'gl': glucose_values\n",
-    "})\n",
-    "\n",
-    "iglu_python.IGLU_R_COMPATIBLE = True\n",
-    "iglu_python_auc_results = iglu_python.auc(data)\n",
-    "iglu_python_auc_results"
+    "display(iglu_python_auc_results)\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Conclusions  \n",
-    "IGLU_PYTHON AUC calculations are close to IGLU calculations (-5%), and closer to  suggested by ChatGPT\n",
+    "IGLU_PYTHON AUC calculations are close to IGLU calculations (-0.5%)\n",
     "\n"
    ]
   }