| Title: | Impute Missing Glucose Values in CGM Data |
|---|---|
| Description: | Imputes missing glucose values in repeated-measures continuous glucose monitoring (CGM) data. Workflows create time-series features from raw timestamps, support model selection, and return the user's original columns plus an imputed glucose column. Methods include multiple imputation by chained equations (MICE; Azur et al. (2011) <doi:10.1002/mpr.329>), Random Forest regression (Breiman (2001) <doi:10.1023/A:1010933404324>), k-nearest-neighbor regression (Zhang (2016) <doi:10.21037/atm.2016.03.37>), XGBoost (Chen and Guestrin (2016) <doi:10.1145/2939672.2939785>), LightGBM (Ke et al. (2017) <https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision>), and ARIMA forecasting with the forecast framework (Hyndman and Khandakar (2008) <doi:10.18637/jss.v027.i03>). A Python-compatible backend uses 'reticulate' to call 'pandas', 'scikit-learn', 'statsmodels', Python 'xgboost', and optional Python 'lightgbm'. |
| Authors: | Shubh Saraswat [cre, aut, cph] (ORCID: <https://orcid.org/0009-0009-2359-1484>), Hasin Shahed Shad [aut], Xiaohua Douglas Zhang [aut] (ORCID: <https://orcid.org/0000-0002-2486-7931>) |
| Maintainer: | Shubh Saraswat <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.0.2 |
| Built: | 2026-05-29 21:56:16 UTC |
| Source: | https://github.com/zhanglabuky/cgmmissingdatar |
A small multi-subject CGM dataset intended for real missing-value imputation examples. It contains 50 deterministic missing glucose values.
CGMExmplDat10PctCGMExmplDat10Pct
A data frame with 500 rows and 5 variables:
Numeric subject identifier.
Synthetic sex of the subject.
Laboratory Observed Result for Glucose (numeric), with deterministic missing values.
Raw timestamp in yyyy:mm:dd:hh:nn format.
Synthetic age in years.
Synthetic HbA1c value.
data("CGMExmplDat10Pct")data("CGMExmplDat10Pct")
A small multi-subject CGM dataset intended for real missing-value imputation examples. It contains 50 deterministic missing glucose values.
CGMExmplDat5PctCGMExmplDat5Pct
A data frame with 500 rows and 5 variables:
Numeric subject identifier.
Synthetic sex of the subject.
Laboratory Observed Result for Glucose (numeric), with deterministic missing values.
Raw timestamp in yyyy:mm:dd:hh:nn format.
Synthetic age in years.
Synthetic HbA1c value.
data("CGMExmplDat5Pct")data("CGMExmplDat5Pct")
Launches a Shiny app for uploading a CGM data file, selecting the target,
subject, timestamp, and feature columns, running
run_missing_glucose_imputation(), previewing the imputed data, and
downloading the completed data as a CSV file.
run_app()run_app()
Invisibly returns the result of shiny::runApp().
## Not run: # Run the CGMmissingDataR Shiny app run_app() ## End(Not run)## Not run: # Run the CGMmissingDataR Shiny app run_app() ## End(Not run)
Imputes missing glucose values in continuous glucose monitoring (CGM) data.
The function handles both explicit missing glucose values already coded as
NA and implicit missing readings caused by timestamp gaps. Before
imputation, each subject is regularized to an equal interval_minutes
timestamp grid; missing timestamp gaps are converted into explicit rows with
target_col = NA, then imputed using the selected backend and final
imputation method.
run_missing_glucose_imputation( data, target_col, feature_cols = NULL, id_col = "USUBJID", time_col = "Time", time_format = "yyyy:mm:dd:hh:nn", time_unit = "minute", models = "auto", rf_n_estimators = 200, knn_k = 7, xgb_nrounds = 300, lgb_nrounds = 400, n_threads = 1L, arima_order = c(4L, 1L, 0L), seed = 42, lag_k = c(1L, 2L, 3L), add_rollmean = TRUE, roll_window = 3L, interval_minutes = 5L, missing_warning_threshold = 0.2, study_start = NULL, study_end = NULL, use_arima_if_missing_leq = 0.05, arima_min_history = 20L, imputer_backend = c("mice", "sklearn"), export = FALSE )run_missing_glucose_imputation( data, target_col, feature_cols = NULL, id_col = "USUBJID", time_col = "Time", time_format = "yyyy:mm:dd:hh:nn", time_unit = "minute", models = "auto", rf_n_estimators = 200, knn_k = 7, xgb_nrounds = 300, lgb_nrounds = 400, n_threads = 1L, arima_order = c(4L, 1L, 0L), seed = 42, lag_k = c(1L, 2L, 3L), add_rollmean = TRUE, roll_window = 3L, interval_minutes = 5L, missing_warning_threshold = 0.2, study_start = NULL, study_end = NULL, use_arima_if_missing_leq = 0.05, arima_min_history = 20L, imputer_backend = c("mice", "sklearn"), export = FALSE )
data |
A data.frame, an object coercible to data.frame, or a path to a CSV file. |
target_col |
Single character string: target glucose column with
missing values to impute. Python default name is |
feature_cols |
Optional character vector of base feature columns. If
|
id_col |
Character string: subject identifier column. Python default
name is |
time_col |
Character string: raw timestamp column. Python default name
is |
time_format |
Retained for compatibility with the old R function. The Python-engine path uses pandas timestamp parsing. |
time_unit |
Retained for compatibility with the old R function and not used by the strict Python-engine path. |
models |
Final real-imputation method selector. Use |
rf_n_estimators |
Integer number of Random Forest trees. Used when
|
knn_k |
Integer number of nearest neighbors. Used when
|
xgb_nrounds |
Integer number of XGBoost boosting rounds. Used when
|
lgb_nrounds |
Integer number of LightGBM boosting rounds. Used when
|
n_threads |
Integer number of model-fitting threads for engines that
support thread controls. The default |
arima_order |
Integer vector of length 3. Python default is
|
seed |
Integer seed for reproducible MICE, tree-based models, and the Python-compatible backend. Default is 42. |
lag_k |
Integer vector of target lags to compute. Python default is
|
add_rollmean |
Logical: add rolling mean of prior target values. Python
always adds this; setting |
roll_window |
Integer rolling mean window. Python default is 3. |
interval_minutes |
Expected spacing, in minutes, between consecutive CGM
readings. The default is |
missing_warning_threshold |
Numeric value between 0 and 1. If the
missingness rate in |
study_start |
Optional study start timestamp. If supplied, the function reports subjects whose first observed CGM timestamp occurs after this time. Leading study time is not imputed. |
study_end |
Optional study end timestamp. If supplied, the function reports subjects whose last observed CGM timestamp occurs before this time. Trailing study time is not imputed. |
use_arima_if_missing_leq |
Numeric missing-rate threshold used only when
|
arima_min_history |
Minimum number of prior observations required before fitting ARIMA for a missing segment. Python default is 20. |
imputer_backend |
One of |
export |
Logical; if |
The imputation workflow first parses and sorts timestamps within each subject.
Each subject is regularized to an equal interval_minutes grid. If a reading
is missing because the timestamp is absent from the input data, a new row is
inserted and the target glucose value is set to NA. These inserted missing
values are then imputed using the same workflow as explicit NA values. The
deterministic interval grid is controlled by this package; CGManalyzer's
equal-interval helper is called internally for workflow consistency.
Internally, the function creates time features, lag features, and rolling-mean
features to support imputation. MICE first completes the target and feature
matrix. The selected final method then fills the missing glucose positions in
imputed_glucose_value: either by segmentwise ARIMA or by a supervised model
trained on observed glucose values and the MICE-completed feature matrix.
These engineered columns are used only during model fitting and are removed
from the returned data frame.
imputed_glucose_value is returned as a continuous numeric model estimate.
Users who require whole-number glucose values for reporting can round this
column after imputation.
Missingness warnings are based on the data after timestamp-gap
regularization, so both explicit NA glucose values and rows created from
timestamp gaps contribute to the reported missingness rate. The function also
warns when long contiguous missing blocks of at least 12 or 24 hours are
detected. If study_start or study_end is supplied, leading or trailing
study-period coverage gaps are reported but are not imputed.
A data.frame containing the original user-supplied columns plus
imputed_glucose_value, the completed glucose column. The original target
column is left unchanged, so values that were originally missing or created
from timestamp gaps remain NA in target_col, while their completed
values are stored in imputed_glucose_value.
data("CGMExmplDat5Pct") out <- run_missing_glucose_imputation( CGMExmplDat5Pct, target_col = "LBORRES", feature_cols = c("AGE", "hba1c"), id_col = "USUBJID", time_col = "Time", imputer_backend = "mice" ) head(subset(out, is.na(LBORRES)))data("CGMExmplDat5Pct") out <- run_missing_glucose_imputation( CGMExmplDat5Pct, target_col = "LBORRES", feature_cols = c("AGE", "hba1c"), id_col = "USUBJID", time_col = "Time", imputer_backend = "mice" ) head(subset(out, is.na(LBORRES)))
This function is deprecated. Use
run_missing_glucose_imputation() for real missing glucose values.
This function implements missingness benchmarking by masking the target column at various rates and evaluating imputation and predictive performance of MICE, Random Forest, and KNN methods. Additionally, it includes LAG features of the target variable to assess their impact on imputation and prediction. The function returns a data.frame summarizing the Mask Rate, Method, MRD (Mean Relative Difference), and Masked Count for each method and mask rate.
run_missingness_benchmark( data, target_col, feature_cols = NULL, id_col = "USUBJID", time_col = "TimeSeries", mask_rates = c(0.05, 0.1, 0.2, 0.3, 0.4), mask_type = c("random", "block"), rf_n_estimators = 400, knn_k = 7, seed = 42, lag_k = c(1, 2, 3), add_rollmean = TRUE, roll_window = 3 )run_missingness_benchmark( data, target_col, feature_cols = NULL, id_col = "USUBJID", time_col = "TimeSeries", mask_rates = c(0.05, 0.1, 0.2, 0.3, 0.4), mask_type = c("random", "block"), rf_n_estimators = 400, knn_k = 7, seed = 42, lag_k = c(1, 2, 3), add_rollmean = TRUE, roll_window = 3 )
data |
A data.frame (or object coercible to data.frame), OR a path to a CSV file. |
target_col |
Single character string: name of the outcome column to mask/impute (e.g., "LBORRES", "Glucose"). |
feature_cols |
Character vector of base feature columns (excluding the target).
If NULL, uses all columns except |
id_col |
Character string: subject identifier column used for LAG features (default "USUBJID"). |
time_col |
Character string: time-ordering column used for LAG features (default "TimeSeries"). |
mask_rates |
Numeric vector in (0, 1): fraction of rows to mask (default 0.05, 0.10, 0.20, 0.30, 0.40). |
mask_type |
One of |
rf_n_estimators |
Integer: number of trees for random forest (default 400). |
knn_k |
Integer: number of neighbors for kNN (default 7). |
seed |
Integer: random seed used for MICE and models (default 42). |
lag_k |
Integer vector of lags to compute on the target (default c(1,2,3)). |
add_rollmean |
Logical: add rolling mean feature of prior target values (default TRUE). |
roll_window |
Integer: rolling window length for rollmean (default 3). |
LAG features are computed using data.table::shift() (fast lag/lead). The rolling mean
is computed with data.table::frollmean() using align="right" and fill=NA.
A data.frame with columns: MaskRate, Method, MRD, MaskedCount.