more details in Discrepancies notebook

staskh · staskh · commit 96f9c5efc34d · 2025-06-15T15:01:00.000+03:00
diff --git a/iglu_r_discrepancies.ipynb b/iglu_r_discrepancies.ipynb
@@ -19,6 +19,7 @@
     "\n",
     "import pandas as pd\n",
     "import rpy2.robjects as ro\n",
+    "import iglu_py\n",
     "from iglu_py import bridge"
    ]
   },
@@ -80,6 +81,13 @@
     "    return result\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Simple test "
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -182,18 +190,19 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "(2, 288)\n",
-      "[Timestamp('2020-01-01 00:00:00'), Timestamp('2020-01-02 00:00:00')]\n",
-      "5.0\n",
+      "gd2d.shape=(2, 288)         \t/ expected (1,288)\n",
+      "actual_dates=[Timestamp('2020-01-01 00:00:00'), Timestamp('2020-01-02 00:00:00')]     \t/ expected [Timestamp('2020-01-01 00:00:00')]\n",
+      "dt0=5.0\n",
+      "gd2d[:,0:5]=\n",
       "[[155. 160. 165.  nan  nan]\n",
-      " [ nan  nan  nan  nan  nan]]\n"
+      " [ nan  nan  nan  nan  nan]]      \t/ expected [[150. 155. 160. 165. nan]]\n"
      ]
     }
    ],
@@ -204,11 +213,10 @@
     "actual_dates = r_result['actual_dates']\n",
     "dt0 = r_result['dt0']\n",
     "\n",
-    "print(gd2d.shape)       # expected (1,288)\n",
-    "print(actual_dates)     # expected [datetime.date(2020, 1, 1)]\n",
-    "print(dt0)              # expected 5\n",
-    "\n",
-    "print(gd2d[:,0:5])      # expected [[150. 155. 160. 165. nan]]\n",
+    "print(f\"gd2d.shape={gd2d.shape}         \\t/ expected (1,288)\")       # expected (1,288)\n",
+    "print(f\"actual_dates={actual_dates}     \\t/ expected [Timestamp('2020-01-01 00:00:00')]\")     # expected [datetime.date(2020, 1, 1)]\n",
+    "print(f\"dt0={dt0}\")              # expected 5\n",
+    "print(f\"gd2d[:,0:5]=\\n{gd2d[:,0:5]}      \\t/ expected [[150. 155. 160. 165. nan]]\")      # expected [[150. 155. 160. 165. nan]]\n",
     "\n",
     "\n",
     "\n"
@@ -218,14 +226,99 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Note:** gd2d.shape is (2, 288) instead of (1, 288) and gd2d[0,:] has only 3 non-nan values instead of expected 4\n",
+    "**Note:** gd2d.shape is (2, 288) instead of (1, 288) and gd2d[0,:] has only 3 non-nan values instead of expected 4"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Impact  \n",
+    "\n",
+    "While these discrepancies may appear minor, they can significantly impact certain metric calculations.\n",
+    "\n",
+    "For example, when calculating AUC on synthetic data (shown below), we expect a result of 100, \n",
+    "but the AUC metric returns 102.2222 due to these interpolation differences."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>id</th>\n",
+       "      <th>hourly_auc</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>subject1</td>\n",
+       "      <td>102.222222</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "         id  hourly_auc\n",
+       "1  subject1  102.222222"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "hours = 1\n",
+    "dt0 = 5\n",
+    "samples = int(hours*60/dt0)\n",
+    "times = pd.date_range('2020-01-01', periods=samples, freq=f\"{dt0}min\")\n",
+    "glucose_values = [80,120]* int(samples/2)\n",
+    "\n",
+    "syntheticdata = pd.DataFrame({\n",
+    "    'id': ['subject1'] * samples,\n",
+    "    'time': times,\n",
+    "    'gl': glucose_values\n",
+    "})\n",
     "\n",
+    "synthetic_iglu_auc_results  = iglu_py.auc(syntheticdata)\n",
+    "synthetic_iglu_auc_results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## UTC timezone \n",
     "Now, lets try to localize to UTC timezone. "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
@@ -265,12 +358,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Midday test\n",
     "Lets try with a 4 measurement at 10am. On 5 min grid, 10am measurement has to be 10*(60/5)=120 position. "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
@@ -356,7 +450,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [
     {
@@ -402,12 +496,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Midnight test with UTC\n",
+    "\n",
     "Lets look now on data that spans two consecutive days"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 19,
    "metadata": {},
    "outputs": [
     {
@@ -525,7 +621,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 22,
    "metadata": {},
    "outputs": [
     {
@@ -537,6 +633,9 @@
       "5.0\n",
       "[[155. 160. 165.  nan  nan]\n",
       " [155. 160. 165.  nan  nan]\n",
+      " [ nan  nan  nan  nan  nan]]\n",
+      "[[ nan  nan  nan  nan 150.]\n",
+      " [ nan  nan  nan  nan  nan]\n",
       " [ nan  nan  nan  nan  nan]]\n"
      ]
     }
@@ -552,26 +651,28 @@
     "print(actual_dates)     # expected [datetime.date(2020, 1, 1)]\n",
     "print(dt0)              # expected 5\n",
     "\n",
-    "print(gd2d[:,0:5])      # expected [[150. 155. 160. 165. nan]]"
+    "print(gd2d[:,0:5])      # expected [[150. 155. 160. 165. nan]]\n",
+    "print(gd2d[:,283:])"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Note:** gd2d.shape is (3,288) instead of expected (2,288) and start date shifted to 2019-12-31"
+    "**Note:** gd2d.shape is (3,288) instead of expected (2,288), second day sample shifted to teh first day and start date shifted to 2019-12-31"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Cross over midnight with UTC\n",
     "Lets test two-days records that cross over midnight  "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 17,
    "metadata": {},
    "outputs": [
     {
@@ -689,7 +790,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 18,
    "metadata": {},
    "outputs": [
     {
@@ -699,8 +800,10 @@
       "(2, 288)\n",
       "[Timestamp('2019-12-31 00:00:00'), Timestamp('2020-01-01 00:00:00')]\n",
       "5.0\n",
-      "[[ nan  nan  nan  nan  nan]\n",
-      " [175. 180. 185.  nan  nan]]\n"
+      "[[ nan  nan  nan 150. 155. 160. 165. 170.]\n",
+      " [ nan  nan  nan  nan  nan  nan  nan  nan]]\n",
+      "[[ nan  nan  nan  nan  nan  nan  nan  nan]\n",
+      " [175. 180. 185.  nan  nan  nan  nan  nan]]\n"
      ]
     }
    ],
@@ -715,14 +818,15 @@
     "print(actual_dates)     # expected [datetime.date(2020, 1, 1)]\n",
     "print(dt0)              # expected 5\n",
     "\n",
-    "print(gd2d[:,0:5])      # expected [[150. 155. 160. 165. nan]]"
+    "print(gd2d[:,280:]) \n",
+    "print(gd2d[:,:8])      # expected [[150. 155. 160. 165. nan]]"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Note:** Now we have (as expected) gd2d.shape==(2, 288), but midnight measurement shifted to a previous day."
+    "**Note:** Now we have (as expected) gd2d.shape==(2, 288), but midnight measurement shifted to a previous day and 2020-01-02 disappeared from actual dates"
    ]
   },
   {
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "iglu_python"
-version = "0.1.5"
+version = "0.1.6"
 description = "Python implementation of the iglu package for continuous glucose monitoring data analysis"
 readme = "README.md"
 requires-python = ">=3.11"