| Title: | National Institutes of Health Brain Development Cohorts Data Hub Tools |
|---|---|
| Description: | A suite of functions to work with data from the National Institutes of Health Brain Development Cohorts Data Hub. The package provides tools to create, clean, process, and filter datasets and associated metadata. These utilities are intended to simplify reproducible data-preparation for future research. |
| Authors: | Le Zhang [aut, cre] (ORCID: <https://orcid.org/0009-0008-0205-2150>), Janosch Linkersdoerfer [aut] (ORCID: <https://orcid.org/0000-0002-1577-1233>) |
| Maintainer: | Le Zhang <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.1.0 |
| Built: | 2026-05-23 09:26:20 UTC |
| Source: | https://github.com/cran/NBDCtools |
This function allows users to add custom metadata (data dictionary, levels table, sessions table) to the package environment. This can be useful for users who want to use their own metadata instead of the ones provided by the package, or for testing and development purposes.
add_custom_metadata(dd = NULL, levels = NULL, sessions = NULL)add_custom_metadata(dd = NULL, levels = NULL, sessions = NULL)
dd |
data frame. Custom data dictionary. Should have the same structure as the data dictionary provided by the package. |
levels |
data frame. Custom levels table. |
sessions |
data frame. Custom sessions table. |
The custom metadata will be stored in the package environment can be
accessed with any function that contains the "release" argument with
release = "custom". For example,
get_dd(study = "abcd", release = "custom").
The default value for dd, levels, and sessions is NULL.
If any of them is not NULL, it will be added to the package environment.
If the argument is NULL, the corresponding metadata will not be added.
invisible TRUE.
add_custom_metadata( dd = tibble::tibble( name = "var1", table_name = "table1", identifier_columns = "participant_id" ) ) get_dd(study = "abcd", release = "custom")add_custom_metadata( dd = tibble::tibble( name = "var1", table_name = "table1", identifier_columns = "participant_id" ) ) get_dd(study = "abcd", release = "custom")
This function renames columns in a data frame to another type of column name specified in the data dictionary.
For example, this can be used to convert the ABCD column names introduced in
the 6.0 release to the previously used column names. If you instead want to
convert the column names in a file, use convert_names_file().
Note: Please use this function with caution and make sure that the data in
the converted column is equivalent to the data in the original column. Also,
please make sure that the names can be mapped one-to-one. Some variables in
the ABCD data dictionary have been collapsed from previous releases and thus
might have multiple names in the name_to column that map to a single name
(see skip_sep_check argument below).
convert_names_data( data, dd, name_from = "name", name_to, ignore_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()), skip_sep_check = FALSE )convert_names_data( data, dd, name_from = "name", name_to, ignore_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()), skip_sep_check = FALSE )
data |
tibble. The input data frame with columns to be renamed. |
dd |
tibble. The data dictionary table. One can use |
name_from |
character. The column name type in the data dictionary that
the columns in |
name_to |
character. The column name type in the data dictionary
that the columns in |
ignore_cols |
character vector. The columns to ignore (Default: identifier columns used in ABCD and HBCD). |
skip_sep_check |
logical. Whether to skip the check for
For columns with multiple names, it the recommended to use functions like
If |
tibble. The data with renamed column names.
## Not run: # rename columns to previous ABCD names used by NDA convert_names_data( data, dd = get_dd("abcd"), name_from = "name", name_to = "name_nda" ) # rename columns to Stata names convert_names_data( data, dd = get_dd("abcd"), name_from = "name", name_to = "name_stata" ) ## End(Not run)## Not run: # rename columns to previous ABCD names used by NDA convert_names_data( data, dd = get_dd("abcd"), name_from = "name", name_to = "name_nda" ) # rename columns to Stata names convert_names_data( data, dd = get_dd("abcd"), name_from = "name", name_to = "name_stata" ) ## End(Not run)
This function replaces all matched column names in a file with another type of column name specified in the data dictionary.
For example, this function can be used to convert script files that specified
previously used column names to the the ABCD column names introduced in the
6.0 release. If you instead want to convert the column names in a data frame,
use convert_names_data().
Note: Please use this function with caution and make sure that the data in
the converted column is equivalent to the data in the original column. Also,
please make sure that the names can be mapped one-to-one. Some variables in
the ABCD data dictionary have been collapsed from previous releases and thus
might have multiple names in the name_from column that map to a single name
(see skip_sep_check argument below).
convert_names_file( file_in, file_out = NULL, dd, name_from, name_to, skip_sep_check = FALSE )convert_names_file( file_in, file_out = NULL, dd, name_from, name_to, skip_sep_check = FALSE )
file_in |
character. The input file path. |
file_out |
character. The output file path. If not provided, defaults to the input file path with a "_converted" suffix. |
dd |
tibble. The data dictionary table. One can use |
name_from |
character. The column name type in the data dictionary that
the columns in |
name_to |
character. The column name type in the data dictionary
that the columns in |
skip_sep_check |
logical. Whether to skip the check for
For columns with multiple names, it the recommended to use functions like
If |
The function uses word boundaries to match the names in the file.
It Uses regex word boundaries (\\b) to
ensure exact word matches. This prevents partial matches within larger
words. For example, matching "age" will not match "cage" or "page".
The data dictionary is big from get_dd(), so the function
would loop through all the names in the data dictionary.
If there are only a few names to replace,
it is the best to trim the data dictionary to only those names
before using this function.
character. The path to the output file with converted names, invisible.
## Not run: convert_names_file( file_in = "analysis_script.R", dd = get_dd("abcd"), name_from = "name_nda", name_to = "name" ) # Specify custom output file convert_names_file( file_in = "analysis_script.py", file_out = "analysis_script_new.py", dd = get_dd("abcd"), name_from = "name_nda", name_to = "name" ) ## End(Not run)## Not run: convert_names_file( file_in = "analysis_script.R", dd = get_dd("abcd"), name_from = "name_nda", name_to = "name" ) # Specify custom output file convert_names_file( file_in = "analysis_script.py", file_out = "analysis_script_new.py", dd = get_dd("abcd"), name_from = "name_nda", name_to = "name" ) ## End(Not run)
Creates a Brain Imaging Data Structure (BIDS) JSON sidecar file from the metadata (data dictionary and levels table). Returns the JSON object or writes it to a file.
create_bids_sidecar_data( data, study, release = "latest", var_coding = "values", metadata_description = "Dataset exported using NBDCtools", path_out = NULL, pretty = TRUE )create_bids_sidecar_data( data, study, release = "latest", var_coding = "values", metadata_description = "Dataset exported using NBDCtools", path_out = NULL, pretty = TRUE )
data |
tibble. The raw data or data with labels, see
|
study |
character. NBDC study (One of |
release |
character. Release version (Default: |
var_coding |
character. the variable coding, one of "values", "labels".
If the data is processed with |
metadata_description |
string, the description of the metadata |
path_out |
character. the path to the output file.
If |
pretty |
logical. Whether to pretty print the json. |
If you have a labelled dataset, and want to create data specific
BIDS sidecar with variable levels from the data, please use
create_bids_sidecar_data().
If you want to create a BIDS sidecar without the underlying data,
please use create_bids_sidecar_metadata().
the json object or the path to the json file
create_bids_sidecar_metadata()
## Not run: data |> create_bids_sidecar_data() data |> create_bids_sidecar_data(path_out = "data.json") ## End(Not run)## Not run: data |> create_bids_sidecar_data() data |> create_bids_sidecar_data(path_out = "data.json") ## End(Not run)
Generates a Brain Imaging Data Structure (BIDS) JSON sidecar using metadata tables (data dictionary and levels) without requiring the underlying data.
create_bids_sidecar_metadata( dd, levels, vars = NULL, tables = NULL, metadata_description = "Dataset exported using NBDCtools", path_out = NULL, pretty = TRUE )create_bids_sidecar_metadata( dd, levels, vars = NULL, tables = NULL, metadata_description = "Dataset exported using NBDCtools", path_out = NULL, pretty = TRUE )
dd |
tibble, Data dictionary metadata, see |
levels |
tibble, Levels metadata corresponding to |
vars |
character vector, variable names to include. |
tables |
character vector, table names to include. |
metadata_description |
string, the description of the metadata |
path_out |
character. the path to the output file.
If |
pretty |
logical. Whether to pretty print the json. |
If you have a labelled dataset, and want to create data specific
BIDS sidecar with variable levels from the data, please use
create_bids_sidecar_data().
If you want to create a BIDS sidecar without the underlying data,
please use create_bids_sidecar_metadata().
Either a JSON string (when path_out is NULL) or the output
file path (invisibly) after writing the JSON sidecar to disk.
create_bids_sidecar_metadata( dd = get_dd("abcd"), levels = get_levels("abcd"), tables = c("ph_y_mctq") )create_bids_sidecar_metadata( dd = get_dd("abcd"), levels = get_levels("abcd"), tables = c("ph_y_mctq") )
This high-level function simplifies the process of creating a dataset from
the ABCD or HBCD Study data by allowing users to create an analysis-ready
dataset in a single step. It executes the lower-level functions provided in
the NBDCtools package in sequence to load, join, and transform the data.
The function expects study data to be stored as one .parquet or .tsv file
per database table within a specified directory, provided as dir_data.
Variables specified in vars and tables will be full-joined together,
while variables specified in vars_add and tables_add will be left-joined
to these variables. For more details, see join_tabulated().
In addition to the main create_dataset() function, there are two
study-specific variations:
create_dataset_abcd(): for the ABCD study.
create_dataset_hbcd(): for the HBCD study.
They have the same arguments as the create_dataset() function, except
that the study argument is set to the respective study by default, and
should not be set by the user.
create_dataset( dir_data, study, vars = NULL, tables = NULL, vars_add = NULL, tables_add = NULL, release = "latest", format = "parquet", bypass_ram_check = FALSE, ignore_version_mismatch = FALSE, categ_to_factor = TRUE, add_labels = TRUE, value_to_label = FALSE, value_to_na = FALSE, time_to_hms = FALSE, bind_shadow = FALSE, ... ) create_dataset_abcd(...) create_dataset_hbcd(...)create_dataset( dir_data, study, vars = NULL, tables = NULL, vars_add = NULL, tables_add = NULL, release = "latest", format = "parquet", bypass_ram_check = FALSE, ignore_version_mismatch = FALSE, categ_to_factor = TRUE, add_labels = TRUE, value_to_label = FALSE, value_to_na = FALSE, time_to_hms = FALSE, bind_shadow = FALSE, ... ) create_dataset_abcd(...) create_dataset_hbcd(...)
dir_data |
character. Path to the directory with the data files in
|
study |
character. NBDC study (One of |
vars |
character (vector). Name(s) of variable(s) to be joined.
(Default: |
tables |
character (vector). Name(s) of table(s) to be joined (Default:
|
vars_add |
character (vector). Name(s) of additional variable(s) to be
left-joined to the variables selected in |
tables_add |
character (vector). Name(s) of additional table(s) to be
left-joined to the variables selected in |
release |
character. Release version (Default: |
format |
character. Data format (One of |
bypass_ram_check |
logical. If This argument is only used for the ABCD study, as the HBCD data is small enough to be loaded without RAM issues with most personal computers. As HBCD data grows in the future, this may change. |
ignore_version_mismatch |
logical. Whether to ignore version mismatch
between data files and metadata (dd, levels, etc)
and proceed with joining anyway (default: The function performs a version check by reading a specific file from the
|
categ_to_factor |
logical. Whether to convert categorical
variables to factors class, see |
add_labels |
logical. Whether to adds variable and value labels to the
variables, see |
value_to_label |
logical. Whether to convert the categorical
variables' numeric values to labels, see |
value_to_na |
logical. Whether to convert categorical
missingness/non-response codes to |
time_to_hms |
logical. Whether to convert time variables to
|
bind_shadow |
logical. Whether to bind the shadow matrix to the
dataset (Default: |
... |
additional arguments passed to downstream functions after
the |
This high-level function executes the different steps in the following order:
Read the data/shadow matrix using join_tabulated().
Convert categorical variables to factors using transf_factor().
Add labels to the variables and values using transf_label().
Convert categorical variables' numeric values to labels using
transf_value_to_label().
Convert categorical missingness/non-response codes to NA using
transf_value_to_na().
Convert time variables to hms class using transf_time_to_hms().
If bind_shadow and the study is "HBCD", replace the missing values
in the shadow due to joining multiple
datasets using shadow_replace_binding_missing().
Bind the shadow matrix to the data using shadow_bind_data().
Not all steps are executed by default. The above order represents the maximal order of execution.
bind_shadowIf bind_shadow is TRUE, the shadow matrix will be added to the data using
shadow_bind_data().
HBCD study: For the HBCD study, this function uses the shadow matrix
from the dir_data directory by default (the HBCD Study releases a
_shadow.parquet/_shadow.tsv file per table that accompanies the data).
Alternatively, one can set naniar_shadow = TRUE as part of the ...
arguments to use naniar::as_shadow() to create a shadow matrix from the
data.
ABCD study: The ABCD Study does not currently release shadow
matrices. If bind_shadow is set to TRUE, the function will create the
shadow matrix from the data using naniar::as_shadow(); no extra
naniar_shadow = TRUE argument is needed.
A tibble with the analysis-ready dataset.
## Not run: # most common use case create_dataset( dir_data = "6_0/data", study = "abcd", vars = c("var1", "var2", "var3") ) # to handle with tagged missingness create_dataset( dir_data = "1_0/data", study = "hbcd", vars = c("var1", "var2", "var3"), value_to_na = TRUE ) # to bind shadow matrices to the data create_dataset( dir_data = "1_0/data/", study = "hbcd", vars = c("var1", "var2", "var3"), bind_shadow = TRUE ) # to use the additional arguments # for example in `value_to_na` option, the underlying function # `transf_value_to_na()` has 2 more arguments, # which can be passed to the `create_dataset()` function create_dataset( dir_data = "6_0/data", study = "abcd", vars = c("var1", "var2", "var3"), value_to_na = TRUE, missing_codes = c("999", "888", "777", "666", "555", "444", "333", "222"), ignore_col_pattern = "__dk$|__dk__l$" ) # use study specific functions create_dataset_abcd( dir_data = "6_0/data", vars = c("var1", "var2", "var3") ) ## End(Not run)## Not run: # most common use case create_dataset( dir_data = "6_0/data", study = "abcd", vars = c("var1", "var2", "var3") ) # to handle with tagged missingness create_dataset( dir_data = "1_0/data", study = "hbcd", vars = c("var1", "var2", "var3"), value_to_na = TRUE ) # to bind shadow matrices to the data create_dataset( dir_data = "1_0/data/", study = "hbcd", vars = c("var1", "var2", "var3"), bind_shadow = TRUE ) # to use the additional arguments # for example in `value_to_na` option, the underlying function # `transf_value_to_na()` has 2 more arguments, # which can be passed to the `create_dataset()` function create_dataset( dir_data = "6_0/data", study = "abcd", vars = c("var1", "var2", "var3"), value_to_na = TRUE, missing_codes = c("999", "888", "777", "666", "555", "444", "333", "222"), ignore_col_pattern = "__dk$|__dk__l$" ) # use study specific functions create_dataset_abcd( dir_data = "6_0/data", vars = c("var1", "var2", "var3") ) ## End(Not run)
This function filters out columns that are empty.
filter_empty_cols( data, id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()) )filter_empty_cols( data, id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()) )
data |
tibble. The data to be filtered. |
id_cols |
character (vector). The names of the ID columns to be excluded from the filtering (Default: identifier columns used in ABCD and HBCD). |
A tibble with the filtered data.
data <- tibble::tibble( participant_id = c("sub-001", "sub-002", "sub-003"), session_id = c("ses-001", "ses-001", "ses-002"), var1 = c(NA, NA, NA), var2 = c(NA, NA, 2), var3 = c(NA, NA, 3) ) filter_empty_cols(data)data <- tibble::tibble( participant_id = c("sub-001", "sub-002", "sub-003"), session_id = c("ses-001", "ses-001", "ses-002"), var1 = c(NA, NA, NA), var2 = c(NA, NA, 2), var3 = c(NA, NA, 3) ) filter_empty_cols(data)
This function filters out rows that are empty
filter_empty_rows( data, id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()) )filter_empty_rows( data, id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()) )
data |
tibble. The data to be filtered. |
id_cols |
character (vector). The names of the ID columns to be excluded from the filtering (Default: identifier columns used in ABCD and HBCD). |
A tibble with the filtered data.
data <- tibble::tibble( participant_id = c("sub-001", "sub-002", "sub-003"), session_id = c("ses-001", "ses-001", "ses-002"), var1 = c(NA, NA, 1), var2 = c(NA, NA, 2), var3 = c(NA, NA, 3) ) filter_empty_rows(data)data <- tibble::tibble( participant_id = c("sub-001", "sub-002", "sub-003"), session_id = c("ses-001", "ses-001", "ses-002"), var1 = c(NA, NA, 1), var2 = c(NA, NA, 2), var3 = c(NA, NA, 3) ) filter_empty_rows(data)
Given a (set of) condition(s), filters the events included in an ABCD dataset. Conditions can be specified as a vector of strings, where each string can be one of the following conditions:
"core": events for the ABCD core study
"annual": annual events for the ABCD core study
"mid_year": mid-year events for the ABCD core study
"substudy": events for ABCD substudies
"covid": events for the COVID substudy
"sdev": events for the Social Development substudy
"even": even-numbered events
"odd": odd-numbered events
numerical expressions like >2 or <=5 to filter events by number
any other string to be used as filter for the session_id column
The conditions can be combined with logical "and" or "or".
filter_events_abcd(data, conditions, connect = "and")filter_events_abcd(data, conditions, connect = "and")
data |
tibble. The data to be filtered. |
conditions |
character (vector). The events to keep. |
connect |
character. Whether to connect the conditions with |
A tibble with the filtered data.
data <- tibble::tribble( ~session_id, ~study, ~type, "ses-00S", "core", "screener", "ses-00M", "core", "mid-year", "ses-00A", "core", "even", "ses-01M", "core", "mid-year", "ses-01A", "core", "odd", "ses-02M", "core", "mid-year", "ses-02A", "core", "even", "ses-03M", "core", "mid-year", "ses-03A", "core", "odd", "ses-04M", "core", "mid-year", "ses-04A", "core", "even", "ses-05M", "core", "mid-year", "ses-05A", "core", "odd", "ses-06M", "core", "mid-year", "ses-06A", "core", "even", "ses-C01", "substudy", "covid", "ses-C02", "substudy", "covid", "ses-C03", "substudy", "covid", "ses-C04", "substudy", "covid", "ses-C05", "substudy", "covid", "ses-C06", "substudy", "covid", "ses-C07", "substudy", "covid", "ses-S01", "substudy", "sdev", "ses-S02", "substudy", "sdev", "ses-S03", "substudy", "sdev", "ses-S04", "substudy", "sdev", "ses-S05", "substudy", "sdev" ) # ABCD core study events filter_events_abcd(data, c("core")) # COVID substudy events filter_events_abcd(data, c("covid")) # imaging events filter_events_abcd(data, c("annual", "even")) # mid-years before year 5 filter_events_abcd(data, c("mid_year", "<5")) # COVID or Social Development substudy events filter_events_abcd(data, c("covid", "sdev"), connect = "or")data <- tibble::tribble( ~session_id, ~study, ~type, "ses-00S", "core", "screener", "ses-00M", "core", "mid-year", "ses-00A", "core", "even", "ses-01M", "core", "mid-year", "ses-01A", "core", "odd", "ses-02M", "core", "mid-year", "ses-02A", "core", "even", "ses-03M", "core", "mid-year", "ses-03A", "core", "odd", "ses-04M", "core", "mid-year", "ses-04A", "core", "even", "ses-05M", "core", "mid-year", "ses-05A", "core", "odd", "ses-06M", "core", "mid-year", "ses-06A", "core", "even", "ses-C01", "substudy", "covid", "ses-C02", "substudy", "covid", "ses-C03", "substudy", "covid", "ses-C04", "substudy", "covid", "ses-C05", "substudy", "covid", "ses-C06", "substudy", "covid", "ses-C07", "substudy", "covid", "ses-S01", "substudy", "sdev", "ses-S02", "substudy", "sdev", "ses-S03", "substudy", "sdev", "ses-S04", "substudy", "sdev", "ses-S05", "substudy", "sdev" ) # ABCD core study events filter_events_abcd(data, c("core")) # COVID substudy events filter_events_abcd(data, c("covid")) # imaging events filter_events_abcd(data, c("annual", "even")) # mid-years before year 5 filter_events_abcd(data, c("mid_year", "<5")) # COVID or Social Development substudy events filter_events_abcd(data, c("covid", "sdev"), connect = "or")
Given a vector of ID/events (concatenated like
"{participant_id}_{session_id}"), or a dataframe
with participant_id and session_id columns,
this function filters the data to keep or alternatively
remove the rows for the given ID/events.
filter_id_events(data, id_events, revert = FALSE)filter_id_events(data, id_events, revert = FALSE)
data |
tibble. The data to be filtered. |
id_events |
character (vector) or dataframe. (Vector of) ID/event(s)
or a dataframe with |
revert |
logical. Whether to revert the filter, i.e., to keep only rows
NOT matching the |
A tibble with the filtered data.
data <- tibble::tribble( ~participant_id, ~session_id, "sub-001", "ses-001", "sub-001", "ses-002", "sub-002", "ses-001", "sub-002", "ses-002", "sub-003", "ses-001", "sub-003", "ses-002" ) # filter using a vector of ID/events filter_id_events( data, id_events = c("sub-001_ses-001", "sub-003_ses-002") ) # filter using a dataframe with participant_id and session_id data_filter <- tibble::tibble( participant_id = c("sub-001", "sub-003"), session_id = c("ses-001", "ses-002") ) filter_id_events( data, id_events = data_filter ) # revert filter filter_id_events( data, id_events = c("sub-001_ses-001", "sub-003_ses-002"), revert = TRUE )data <- tibble::tribble( ~participant_id, ~session_id, "sub-001", "ses-001", "sub-001", "ses-002", "sub-002", "ses-001", "sub-002", "ses-002", "sub-003", "ses-001", "sub-003", "ses-002" ) # filter using a vector of ID/events filter_id_events( data, id_events = c("sub-001_ses-001", "sub-003_ses-002") ) # filter using a dataframe with participant_id and session_id data_filter <- tibble::tibble( participant_id = c("sub-001", "sub-003"), session_id = c("ses-001", "ses-002") ) filter_id_events( data, id_events = data_filter ) # revert filter filter_id_events( data, id_events = c("sub-001_ses-001", "sub-003_ses-002"), revert = TRUE )
Retrieves data dictionary for a given study and release version. Allows for
filtering by variables and tables. Wrapper around
get_metadata().
In addition to the main get_dd() function, there are two
study-specific variations:
get_dd_abcd(): for the ABCD study.
get_dd_hbcd(): for the HBCD study.
They have the same arguments as the get_dd() function, except
that the study argument is set to the respective study by default, and
should not be set by the user.
get_dd(study, release = "latest", vars = NULL, tables = NULL) get_dd_abcd(...) get_dd_hbcd(...)get_dd(study, release = "latest", vars = NULL, tables = NULL) get_dd_abcd(...) get_dd_hbcd(...)
study |
character. The study name. One of "abcd" or "hbcd". |
release |
character. Release version (Default: |
vars |
character (vector). Vector with the names of variables to be included. |
tables |
character (vector). Vector with the names of tables to be included. |
... |
Additional arguments passed to the underlying
|
Data frame with the data dictionary.
get_dd("abcd") get_dd("hbcd", release = "1.0") get_dd("abcd", vars = c("ab_g_dyn__visit_dtt", "ab_g_dyn__visit_age")) get_dd("abcd", tables = "ab_g_dyn") get_dd_abcd() get_dd_hbcd(release = "1.0")get_dd("abcd") get_dd("hbcd", release = "1.0") get_dd("abcd", vars = c("ab_g_dyn__visit_dtt", "ab_g_dyn__visit_age")) get_dd("abcd", tables = "ab_g_dyn") get_dd_abcd() get_dd_hbcd(release = "1.0")
Retrieves the identifier columns for a given study and release version.
In addition to the main get_id_cols() function, there are two
study-specific variations:
get_id_cols_abcd(): for the ABCD study.
get_id_cols_hbcd(): for the HBCD study.
They have the same arguments as the get_id_cols() function, except
that the study argument is set to the respective study by default, and
should not be set by the user.
get_id_cols(study, release = "latest") get_id_cols_abcd(...) get_id_cols_hbcd(...)get_id_cols(study, release = "latest") get_id_cols_abcd(...) get_id_cols_hbcd(...)
study |
character. The study name. One of "abcd" or "hbcd". |
release |
character. Release version (Default: |
... |
Additional arguments passed to the underlying
|
character vector with the identifier columns.
get_id_cols("abcd") get_id_cols("hbcd") get_id_cols_abcd(release = "6.0") get_id_cols_hbcd(release = "1.0")get_id_cols("abcd") get_id_cols("hbcd") get_id_cols_abcd(release = "6.0") get_id_cols_hbcd(release = "1.0")
Retrieves levels table for a given study and release version. Allows for
filtering by variables and tables. Wrapper around
get_metadata().
In addition to the main get_levels() function, there are two
study-specific variations:
get_levels_abcd(): for the ABCD study.
get_levels_hbcd(): for the HBCD study.
They have the same arguments as the get_levels() function, except
that the study argument is set to the respective study by default, and
should not be set by the user.
get_levels(study, release = "latest", vars = NULL, tables = NULL) get_levels_abcd(...) get_levels_hbcd(...)get_levels(study, release = "latest", vars = NULL, tables = NULL) get_levels_abcd(...) get_levels_hbcd(...)
study |
character. The study name. One of "abcd" or "hbcd". |
release |
character. Release version (Default: |
vars |
character (vector). Vector with the names of variables to be included. |
tables |
character (vector). Vector with the names of tables to be included. |
... |
Additional arguments passed to the underlying
|
Data frame with the levels table.
get_levels("abcd") get_levels("hbcd", release = "1.0") get_levels("abcd", vars = c("ab_g_dyn__visit_type")) get_levels("abcd", tables = "ab_g_dyn") get_levels_abcd(release = "6.0") get_levels_hbcd()get_levels("abcd") get_levels("hbcd", release = "1.0") get_levels("abcd", vars = c("ab_g_dyn__visit_type")) get_levels("abcd", tables = "ab_g_dyn") get_levels_abcd(release = "6.0") get_levels_hbcd()
Retrieves metadata (data dictionary, levels table, event map) for a given study and release version. Allows for filtering by variables and tables.
get_metadata( study, release = "latest", vars = NULL, tables = NULL, type = "dd" )get_metadata( study, release = "latest", vars = NULL, tables = NULL, type = "dd" )
study |
character. The study name. One of "abcd" or "hbcd". |
release |
character. Release version (Default: |
vars |
character (vector). Vector with the names of variables to be included. |
tables |
character (vector). Vector with the names of tables to be included. |
type |
character. Type of metadata to retrieve. One of |
Data frame with the metadata.
get_metadata("abcd", type = "levels") get_metadata("hbcd", release = "1.0") get_metadata("abcd", vars = c("ab_g_dyn__visit_dtt", "ab_g_dyn__visit_age")) get_metadata("abcd", tables = "ab_g_dyn") get_metadata("abcd", tables = "ab_g_dyn") get_metadata("abcd", type = "sessions")get_metadata("abcd", type = "levels") get_metadata("hbcd", release = "1.0") get_metadata("abcd", vars = c("ab_g_dyn__visit_dtt", "ab_g_dyn__visit_age")) get_metadata("abcd", tables = "ab_g_dyn") get_metadata("abcd", tables = "ab_g_dyn") get_metadata("abcd", type = "sessions")
These functions retrieve all release version number(s) for a given study, or the latest release version number.
get_releases(study) get_releases_abcd() get_releases_hbcd() get_latest_release(study) get_latest_release_abcd() get_latest_release_hbcd()get_releases(study) get_releases_abcd() get_releases_hbcd() get_latest_release(study) get_latest_release_abcd() get_latest_release_hbcd()
study |
character. The study name. One of "abcd" or "hbcd". |
character. The latest release version number(s) of the specified study.
get_releases("abcd") get_releases("hbcd") get_latest_release("abcd") get_latest_release("hbcd")get_releases("abcd") get_releases("hbcd") get_latest_release("abcd") get_latest_release("hbcd")
Retrieves the sessions table for a given study and release version. Wrapper
around get_metadata().
In addition to the main get_sessions() function, there are two
study-specific variations:
get_sessions_abcd(): for the ABCD study.
get_sessions_hbcd(): for the HBCD study.
They have the same arguments as the get_sessions() function, except
that the study argument is set to the respective study by default, and
should not be set by the user.
get_sessions(study, release = "latest") get_sessions_abcd(...) get_sessions_hbcd(...)get_sessions(study, release = "latest") get_sessions_abcd(...) get_sessions_hbcd(...)
study |
character. The study name. One of "abcd" or "hbcd". |
release |
character. Release version (Default: |
... |
Additional arguments passed to the underlying
|
Data frame with the sessions table.
get_sessions("abcd") get_sessions("hbcd") get_sessions_abcd(release = "6.0") get_sessions_hbcd(release = "1.0")get_sessions("abcd") get_sessions("hbcd") get_sessions_abcd(release = "6.0") get_sessions_hbcd(release = "1.0")
Joins selected variables and/or whole tables from the tabulated data/shadow
files into a single data frame. Expects the data files to be stored in one
directory in .parquet or .tsv format, with one file per table following
the naming convention of the respective NBDC dataset (from the ABCD or HBCD
studies). Typically, this will be the rawdata/phenotype/ directory within
a BIDS dataset downloaded from the NBDC Data Hub.
Variables specified in vars and tables will be full-joined together,
i.e., all rows will be kept, even if they do not have a value for all
columns. Variables specified in vars_add will be left-joined to the
variables selected in vars and tables, i.e., only the values for already
existing rows will be added and no new rows will be created. This is useful
for adding variables to the dataset that are important for a given analysis
but are not the main variables of interest (e.g., design/nesting or
demographic information). By left-joining these variables, one avoids
creating new rows that contain only missing values for the main variables of
interest selected using vars and tables. If the same variables are
specified in vars/tables and vars_add/tables_add, the variables in
vars_add/tables_add will be ignored.
In addition to the main join_tabulated() function, there are two
study-specific variations:
join_tabulated_abcd(): for the ABCD study.
join_tabulated_hbcd(): for the HBCD study.
They have the same arguments as the join_tabulated() function, except
that the study argument is set to the respective study by default, and
should not be set by the user.
join_tabulated( dir_data, study, vars = NULL, tables = NULL, vars_add = NULL, tables_add = NULL, release = "latest", format = "parquet", shadow = FALSE, remove_empty_rows = TRUE, bypass_ram_check = FALSE, ignore_version_mismatch = FALSE ) join_tabulated_abcd(...) join_tabulated_hbcd(...)join_tabulated( dir_data, study, vars = NULL, tables = NULL, vars_add = NULL, tables_add = NULL, release = "latest", format = "parquet", shadow = FALSE, remove_empty_rows = TRUE, bypass_ram_check = FALSE, ignore_version_mismatch = FALSE ) join_tabulated_abcd(...) join_tabulated_hbcd(...)
dir_data |
character. Path to the directory with the data files in
|
study |
character. NBDC study (One of |
vars |
character (vector). Name(s) of variable(s) to be joined.
(Default: |
tables |
character (vector). Name(s) of table(s) to be joined (Default:
|
vars_add |
character (vector). Name(s) of additional variable(s) to be
left-joined to the variables selected in |
tables_add |
character (vector). Name(s) of additional table(s) to be
left-joined to the variables selected in |
release |
character. Release version (Default: |
format |
character. Data format (One of |
shadow |
logical. Whether to join the shadow matrix
instead of the data table (default: |
remove_empty_rows |
logical. Whether to filter out rows that have
all values missing in the joined variables, except for the
ID columns (default: |
bypass_ram_check |
logical. If This argument is only used for the ABCD study, as the HBCD data is small enough to be loaded without RAM issues with most personal computers. As HBCD data grows in the future, this may change. |
ignore_version_mismatch |
logical. Whether to ignore version mismatch
between data files and metadata (dd, levels, etc)
and proceed with joining anyway (default: The function performs a version check by reading a specific file from the
|
... |
Additional arguments passed to the underlying function
Note: Turning this parameter to |
A tibble of data or shadow matrix with the joined variables.
## Not run: join_tabulated( dir_data = "path/to/data/", vars = c("var_1", "var_2", "var_3"), tables = c("table_1", "table_2"), study = "abcd", release = "6.0" ) ## End(Not run)## Not run: join_tabulated( dir_data = "path/to/data/", vars = c("var_1", "var_2", "var_3"), tables = c("table_1", "table_2"), study = "abcd", release = "6.0" ) ## End(Not run)
Reads in a .tsv or .csv file with correctly formatted column types.
Uses readr::read_tsv()/readr::read_csv() internally and specifies the
column types explicitly using the col_types argument utilizing information
from the data dictionary. Returns only the identifier columns and the columns
specified in the data dictionary, i.e., all columns in the file that are not
specified in the data dictionary are ignored.
read_dsv_formatted(file, dd, action = "warn")read_dsv_formatted(file, dd, action = "warn")
file |
character. Path to the |
dd |
tibble. Data dictionary specifying the column types. Only columns specified in the data dictionary are read. |
action |
character. What to do if there are columns in the file that are
not specified in the data dictionary (One of |
WHY THIS IS IMPORTANT: readr::read_tsv()/readr::read_csv() (like
other commands to load text files in R or other programming languages) by
default infers the column types from the data. This doesn't always work
perfectly. For example, it may interpret a column with only integers as a
double, or a column with only dates as a character. Sometimes a column may
even be read in completely empty because, by default,
readr::read_tsv()/readr::read_csv() only considers the first 1000 rows
when inferring the data type and interprets the column as an empty logical
vector if those rows are all empty. The NBDC datasets store categorical
data as integers formatted as character. By default,
readr::read_tsv()/readr::read_csv() may interpret them as numeric. By
specifying the column types explicitly based on what is defined in the
data dictionary, we can avoid these issues.
GENERAL RECOMMENDATION: Other file formats like .parquet correctly
store the column types and don't need to be handled explicitly. They also
offer other advantages like faster reading speed and smaller file sizes. As
such, these formats should generally be preferred over .tsv/.csv files.
However, if you have to work with .tsv/.csv files, this function can help
you avoid common pitfalls.
A tibble with the data/shadow matrix read from the .tsv
or .csv file.
## Not run: dd <- NBDCtools::get_dd("abcd", "6.0") read_tsv_formatted("path/to/file.tsv", dd) ## End(Not run)## Not run: dd <- NBDCtools::get_dd("abcd", "6.0") read_tsv_formatted("path/to/file.tsv", dd) ## End(Not run)
This function binds the shadow matrix to the data.
shadow_bind_data( data, shadow = NULL, naniar_shadow = FALSE, id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()), suffix = "_shadow" )shadow_bind_data( data, shadow = NULL, naniar_shadow = FALSE, id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()), suffix = "_shadow" )
data |
tibble. The data. |
shadow |
tibble. The shadow matrix. If |
naniar_shadow |
logical. Whether to use |
id_cols |
character. The columns to join by (the identifier column(s))
in the data and shadow matrices (Default: identifier columns used in ABCD and
HBCD).
In |
suffix |
character. The suffix to add to the shadow columns.
Default is If |
If naniar_shadow = FASLE and shadow is provided, the two dataframes
must have the same columns, order of the columns does not matter, but
ID columns must be the same in both dataframes. If there are extra
rows in the shadow matrix, they will be ignored.
NBDC releases HBCD data with shadow matrices, which can be used for
the shadow argument. To work with ABCD data, the option for
now is to use naniar_shadow = TRUE, which will create a shadow matrix
from the data using naniar::as_shadow().
a dataframe of the data matrix with shadow columns. It will be 2x the size of the original data matrix.
shadow <- tibble::tibble( participant_id = c("1", "2", "3"), session_id = c("1", "2", "3"), var1 = c("Unknown", NA, NA), var2 = c("Wish not to answer", NA, NA) ) data <- tibble::tibble( participant_id = c("1", "2", "3"), session_id = c("1", "2", "3"), var1 = c(NA, NA, 1), var2 = c(NA, 2, NA) ) shadow_bind_data(data, shadow) if (requireNamespace("naniar", quietly = TRUE)) { shadow_bind_data(data, naniar_shadow = TRUE) }shadow <- tibble::tibble( participant_id = c("1", "2", "3"), session_id = c("1", "2", "3"), var1 = c("Unknown", NA, NA), var2 = c("Wish not to answer", NA, NA) ) data <- tibble::tibble( participant_id = c("1", "2", "3"), session_id = c("1", "2", "3"), var1 = c(NA, NA, 1), var2 = c(NA, 2, NA) ) shadow_bind_data(data, shadow) if (requireNamespace("naniar", quietly = TRUE)) { shadow_bind_data(data, naniar_shadow = TRUE) }
This function replaces the missing values in the shadow matrices.
This is done by checking if the values in
shadow matrices are both NA. If they are, the value in the shadow
matrix is replaced with Missing due to joining.
shadow_replace_binding_missing( data, shadow, id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()), replacement = "Missing due to joining" )shadow_replace_binding_missing( data, shadow, id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()), replacement = "Missing due to joining" )
data |
tibble. The data. |
shadow |
tibble. The shadow matrix. |
id_cols |
character (vector). The possible unique identifier columns.
The data does not need to have all of these columns, but if they are
present, they will be used to identify unique rows (Default: identifier
columns used in ABCD and HBCD).
For example, the ABCD data usually has only |
replacement |
character. The value to replace the missing values with. |
Data and shadow requirements: The two dataframes must have the same columns and the same number of rows. They must have the same column names, but the order of the columns does not matter. It is recommended to use the same column order and the same row order (by ID columns) in both dataframes, which saves some processing time.
A tibble of the shadow matrix with missing values replaced.
shadow <- tibble::tibble( participant_id = c("1", "2", "3"), session_id = c("1", "2", "3"), var1 = c("Unknown", NA, NA), var2 = c("Wish not to answer", NA, NA) ) data <- tibble::tibble( participant_id = c("1", "2", "3"), session_id = c("1", "2", "3"), var1 = c(NA, NA, 1), var2 = c(NA, 2, NA) ) shadow_replace_binding_missing(data, shadow)shadow <- tibble::tibble( participant_id = c("1", "2", "3"), session_id = c("1", "2", "3"), var1 = c("Unknown", NA, NA), var2 = c("Wish not to answer", NA, NA) ) data <- tibble::tibble( participant_id = c("1", "2", "3"), session_id = c("1", "2", "3"), var1 = c(NA, NA, 1), var2 = c(NA, 2, NA) ) shadow_replace_binding_missing(data, shadow)
Based on the specifications in the data dictionary, transforms all categorical columns to factor.
transf_factor(data, study, release = "latest")transf_factor(data, study, release = "latest")
data |
tibble. The data to be transformed. Columns are expected to be in the data dictionary. If not, they will be skipped. |
study |
character. NBDC study (One of |
release |
character. Release version (Default: |
A tibble with the transformed data.
## Not run: transf_factor(data, study = "abcd") ## End(Not run)## Not run: transf_factor(data, study = "abcd") ## End(Not run)
This function can add variable labels and value labels to the data. The variable labels are descriptive information about the column, and the value labels are the levels of the factor variables.
transf_label( data, study, release = "latest", add_var_label = TRUE, add_value_label = TRUE, id_cols_labels = c(participant_id = "Participant identifier", session_id = "Event identifier", run_id = "Run identifier") )transf_label( data, study, release = "latest", add_var_label = TRUE, add_value_label = TRUE, id_cols_labels = c(participant_id = "Participant identifier", session_id = "Event identifier", run_id = "Run identifier") )
data |
tibble. The data to be transformed. |
study |
character. NBDC study (One of |
release |
character. Release version (Default: |
add_var_label |
logical. Whether to add variable labels (Default:
|
add_value_label |
logical. Whether to add value labels (Default:
|
id_cols_labels |
named character vector. A named vector of labels for the identifier columns, with the names being the column names and the values being the labels. |
At least one of add_var_label or add_value_label must be set to TRUE.
If both are FALSE, an error will be raised.
The transf_factor() function has a convert_text argument,
which will convert text columns to unordered factors. When one uses
a type transformed data to add labels, the text-factor columns
will not have labels at variable level.
A tibble with the labelled data.
transf_factor() for transforming categorical columns to factors.
## Not run: transf_label(data) ## End(Not run)## Not run: transf_label(data) ## End(Not run)
hms formatThis function converts time columns to hms format.
transf_time_to_hms(data, study, release = "latest")transf_time_to_hms(data, study, release = "latest")
data |
tibble. The data to be converted. |
study |
character. NBDC study (One of |
release |
character. Release version (Default: |
The input data with time columns are expected to have character format
of "HH:MM:SS". If it is not in this format, the function will return NA
for that row.
A tibble with time columns converted to hms format.
## Not run: transf_time_to_hms(data) ## End(Not run)## Not run: transf_time_to_hms(data) ## End(Not run)
Converts the values of categorical/factor columns (e.g., "1", "2") to
their labels (e.g., "Male", "Female"). The value labels will be set to
the values.
transf_value_to_label(data, transf_sess_id = FALSE)transf_value_to_label(data, transf_sess_id = FALSE)
data |
tibble. The labelled dataset |
transf_sess_id |
logical. Whether to transform the |
The data must be type transformed and labelled. See
transf_factor() and transf_label() for details.
data <- data |> transf_factor() |> transf_label()
A tibble with factor columns transformed to labels.
## Not run: transf_value_to_label(data) transf_value_to_label(data, value_to_na = TRUE) ## End(Not run)## Not run: transf_value_to_label(data) transf_value_to_label(data, value_to_na = TRUE) ## End(Not run)
NA
This function converts the missing codes in the dataset to NA
in all factor columns. Example of missing codes are 999, 888, 777, etc.
transf_value_to_na( data, missing_codes = c("999", "888", "777", "666", "555", "444", "333", "222"), ignore_col_pattern = "__dk$|__dk__l$", id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()) )transf_value_to_na( data, missing_codes = c("999", "888", "777", "666", "555", "444", "333", "222"), ignore_col_pattern = "__dk$|__dk__l$", id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()) )
data |
tibble. The labelled dataset and type converted data. |
missing_codes |
character vector. The missing codes to be converted to NA |
ignore_col_pattern |
character. A regex pattern to ignore columns that should not be converted to NA. |
id_cols |
character vector. The names of the ID columns to be excluded from the conversion (Default: identifier columns used in ABCD and HBCD). |
This function works the best with ABCD data where the missing codes
are strictly defined. For HBCD data, the missing codes are still
under discussion. The function may work, but for some undecided future
missing codes, the function may not work as expected.
In case of HBCD data or other aribitrary missing codes that one wishes
to convert to NA, it is recommended to use the
sjmisc::set_na_if() function instead.
The data must be type transformed and labelled. See
transf_factor() and transf_label() for details.
data <- data |> transf_factor() |> transf_label()
A tibble of the dataset with missing codes converted to NA
## Not run: data <- data |> transf_factor() |> transf_label() transf_value_to_na(data) ## End(Not run)## Not run: data <- data |> transf_factor() |> transf_label() transf_value_to_na(data) ## End(Not run)