R 패키지 메타데이터와 수집 신호를 모아 봅니다.
첫 화면에서 판단해야 할 수집 신호를 먼저 배치합니다.
DESCRIPTION에서 감지한 backend 관련 package입니다.
기본 메타데이터를 작은 카드와 토큰으로 압축합니다.
RcppRcppArmadillo| Package | Type | Spec |
|---|---|---|
| geosphere CRAN · 0.1.1 · 2026-05-30 | Imports | geosphere |
| igraph CRAN · 0.1.1 · 2026-05-30 | Imports | igraph |
| mcclust CRAN · 0.1.1 · 2026-05-30 | Imports | mcclust |
| Rcpp CRAN · 0.1.1 · 2026-05-30 | Imports | Rcpp |
| RecordLinkage CRAN · 0.1.1 · 2026-05-30 | Imports | RecordLinkage |
| stringr CRAN · 0.1.1 · 2026-05-30 | Imports | stringr |
| utils CRAN · 0.1.1 · 2026-05-30 | Imports | utils |
| Rcpp CRAN · 0.1.1 · 2026-05-30 | LinkingTo | Rcpp |
| RcppArmadillo CRAN · 0.1.1 · 2026-05-30 | LinkingTo | RcppArmadillo |
| 검색 결과가 없습니다. | ||
| Package | Type | Spec |
|---|---|---|
| 표시할 dependency edge가 없습니다. | ||
| 검색 결과가 없습니다. | ||
NEWS code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} multilink 0.1.1 Updated package for submission to CRAN.README code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} pre > code.sourceCode { white-space: pre; position: relative; } pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { color: #008000; } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { color: #008000; font-weight: bold; } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ multilink multilink is an R package which implements methodology presented in the manuscript “Multifile Partitioning for Record Linkage and Duplicate Detection” by Serge Aleshin-Guendel and Mauricio Sadinle, published in the Journal of the American Statistical Association and available on arXiv . It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates. Installation You can install the development version of multilink from GitHub with: install.packages ( "devtools" ) devtools :: install_github ( "aleshing/multilink" )Help for package multilink const macros = { "\\R": "\\textsf{R}", "\\mbox": "\\text", "\\code": "\\texttt"}; function processMathHTML() { var l = document.getElementsByClassName('reqn'); for (let e of l) { katex.render(e.textContent, e, { throwOnError: false, macros }); } return; } Package {multilink} Contents create_comparison_data dup_data dup_data_small find_bayes_estimate gibbs_sampler initialize_partition multilink no_dup_data no_dup_data_small reduce_comparison_data relabel_bayes_estimate specify_prior Title: Multifile Record Linkage and Duplicate Detection Version: 0.1.1 Description: Implementation of the methodology of Aleshin-Guendel & Sadinle (2022) < doi:10.1080/01621459.2021.2013242 >. It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates. Depends: R (≥ 3.5.0) License: GPL-3 Encoding: UTF-8 LazyData: true RoxygenNote: 7.1.2 URL: https://github.com/aleshing/multilink BugReports: https://github.com/aleshing/multilink/issues Imports: igraph, RecordLinkage, Rcpp, utils, mcclust, geosphere, stringr LinkingTo: Rcpp, RcppArmadillo NeedsCompilation: yes Packaged: 2023-06-08 20:25:20 UTC; sergealeshin-guendel Author: Serge Aleshin-Guendel [aut, cre] Maintainer: Serge Aleshin-Guendel <saleshinguendel@gmail.com> Repository: CRAN Date/Publication: 2023-06-09 14:20:07 UTC Create Comparison Data Description Create comparison data for all pairs of records, except for those records in files which are assumed to have no duplicates. Usage create_comparison_data( records, types, breaks, file_sizes, duplicates, verbose = TRUE ) Arguments records A data.frame containing the records to be linked, where each column of records is a field to be compared. If there are multiple files, records should be obtained by stacking the files on top of each other so that records[1:file_sizes[1], ] contains the records for file 1 , records[(file_sizes[1] + 1):(file_sizes[1] + file_sizes[2]), ] contains the records for file 2 , and so on. Missing values should be coded as NA . types A character vector, indicating the comparison to be used for each field (i.e. each column of records ). The options are: "bi" for binary comparisons, "nu" for numeric comparisons (absolute difference), "lv" for string comparisons (normalized Levenshtein distance), "lv_sep" for string comparisons (normalized Levenshtein distance) where each string may contain multiple spellings separated by the "|" character. We assume that fields using options "bi" , "lv" , and "lv_sep" are of class character , and fields using the "nu" option are of class numeric . For fields using the "lv_sep" option, for each record pair the normalized Levenshtein distance is computed between each possible spelling, and the minimum normalized Levenshtein distance between spellings is then used as the comparison for that record pair. breaks A list , the same length as types , indicating the break points used to compute disagreement levels for each fields' comparisons. If types[f]="bi" , breaks[[f]] is ignored (and thus can be set to NA ). See Details for more information on specifying this argument. file_sizes A numeric vector indicating the size of each file. duplicates A numeric vector indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file. verbose A logical indicator of whether progress messages should be print (default TRUE ). Details The purpose of this function is to construct comparison vectors for each pair of records. In order to construct these vectors, one needs to specify the types and breaks arguments. The types argument specifies how each field should be compared, and the breaks argument specifies how to discretize these comparisons. Currently, the types argument supports three types of field comparisons: binary, absolute difference, and the normalized Levenshtein distance. Please contact the package maintainer if you need a new type of comparison to be supported. The breaks argument should be a list , with with one element for each field. If a field is being compared with a binary comparison, i.e. types[f]="bi" , then the corresponding element of breaks should be NA , i.e. breaks[[f]]=NA . If a field is being compared with a numeric or string comparison, then the corresponding element of breaks should be a vector of cut points used to discretize the comparisons. To give more detail, suppose you pass in cut points breaks[[f]]=c(cut_1, ...,cut_L) . These cut points discretize the range of the comparisons into L+1 intervals: I_0=(-\infty, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, \infty] . The raw comparisons, which lie in [0,\infty) for numeric comparisons and [0,1] for string comparisons, are then replaced with indicators of which interval the comparisons lie in. The interval I_0 corresponds to the lowest level of disagreement for a comparison, while the interval I_L corresponds to the highest level of disagreement for a comparison. Value a list containing: record_pairs A data.frame , where each row contains the pair of records being compared in the corresponding row of comparisons . The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e. record_pairs[i, 1] < record_pairs[i, 2] for each row i . comparisons A logical matrix, where each row contains the comparisons for the record pair in the corresponding row of record_pairs . Comparisons are in the same order as the columns of records , and are represented by L + 1 columns of TRUE/FALSE indicators, where L + 1 is the number of disagreement levels for the field based on breaks . K The number of files, assumed to be of class numeric . file_sizes A numeric vector of length K , indicating the size of each file. duplicates A numeric vector of length K , indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file. field_levels A numeric vector indicating the number of disagreement levels for each field. file_labels An integer vector of length sum(file_sizes) , where file_labels[i] indicates which file record i is in. fp_matrix An integer matrix, where fp_matrix[k1, k2] is a label for the file pair (k1, k2) . Note that fp_matrix[k1, k2] = fp_matrix[k2, k1] . rp_to_fp A logical matrix that indicates which record pairs belong to which file pairs. rp_to_fp[fp, rp] is TRUE if the records record_pairs[rp, ] belong to the file pair fp , and is FALSE otherwise. Note that fp is given by the labeling in fp_matrix . ab An integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair. file_sizes_not_included A numeric vector of 0 s. This element is non-zero when reduce_comparison_data is used. ab_not_included A numeric vector of 0 s. This element is non-zero when reduce_comparison_data is used. labels NA . This element is not NA when reduce_comparison_data is used. pairs_to_keep NA . This element is not NA when reduce_comparison_data is used. cc 0 . This element is non-zero when reduce_comparison_data is used. References Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association . [doi: 10.1080/01621459.2021.2013242 ][ arXiv ] Examples ## Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <Create comparison data for all pairs of records, except for those records in files which are assumed to have no duplicates.
create_comparison_data( records, types, breaks, file_sizes, duplicates, verbose = TRUE )## Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) ## Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))A dataset containing 867 simulated records from 3 files with no duplicate records in each file.
dup_dataExtracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.data(dup_data) # There are 500 entities represented in the records length(unique(dup_data$IDs))A dataset containing 96 simulated records from 3 files with no duplicate records in each file, subset from dup_data.
dup_data_smallExtracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.data(dup_data_small) # There are 96 entities represented in the records length(unique(dup_data_small$IDs))Find the (approximate) Bayes estimate of a partition based on MCMC samples of the partition and a specified loss function.
find_bayes_estimate( partitions, burn_in, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = nrow(partitions), verbose = TRUE )# Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Specify the prior prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA) # Find initialization for the matching (this step is optional) # The following line corresponds to only keeping pairs of records as # potential matches in the initialization for which neither gname nor fname # disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42) # Run the Gibbs sampler results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12) # Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA) # Run the Gibbs sampler results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)Run a Gibbs sampler to explore the posterior distribution of partitions of records.
gibbs_sampler( comparison_list, prior_list, n_iter = 2000, Z_init = 1:sum(comparison_list$file_sizes), seed = 70, single_likelihood = FALSE, chaperones_info = NA, verbose = TRUE )# Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Specify the prior prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA) # Find initialization for the matching (this step is optional) # The following line corresponds to only keeping pairs of records as # potential matches in the initialization for which neither gname nor fname # disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42) # Run the Gibbs sampler results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42) # Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA) # Run the Gibbs sampler results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42)Generate an initialization for the partition in the case when it is assumed there are no duplicates in all files (so that the partition is a matching).
initialize_partition(comparison_list, pairs_to_keep, seed = NA)# Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Find initialization for the matching # The following line corresponds to only keeping pairs of records as # potential matches in the initialization for which neither gname nor fname # disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)The multilink package implements the methodology of Aleshin-Guendel & Sadinle (2022). It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates.
# Here we demonstrate an example workflow with the small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Specify the prior prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA) # Find initialization for the matching (this step is optional) # The following line corresponds to only keeping pairs of records as # potential matches in the initialization for which neither gname nor fname # disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42) # Run the Gibbs sampler results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # The number of clusters in the full estimate length(unique(full_estimate)) # The number of entities represented in the records length(unique(no_dup_data_small$IDs)) # Find which record pairs are truly coreferent based on IDs true_links <- no_dup_data_small$IDs[comparison_list$record_pairs[, 1]] == no_dup_data_small$IDs[comparison_list$record_pairs[, 2]] # Find which record pairs are in the same clusters in the full estimate full_estimate_links <- full_estimate[comparison_list$record_pairs[, 1]] == full_estimate[comparison_list$record_pairs[, 2]] # Find the number of true matches in the full estimate true_matches <- sum(full_estimate_links & true_links) # Precision of the full estimate true_matches / sum(full_estimate_links) # Recall of the full estimate true_matches / sum(true_links) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12) # The partial estimate abstains from making decisions for how many records? sum(partial_estimate == -1) # For the records which decisions were made for in the partial estimate, # there are how many clusters? length(unique(partial_estimate)) # Abstain rate of partial_estimate sum(partial_estimate == -1) / length(partial_estimate) # Relabel records where we abstained partial_estimate[which(partial_estimate == -1)] <- length(partial_estimate) + which(partial_estimate == -1) # Find which record pairs are in the same clusters in the full estimate partial_estimate_links <- partial_estimate[comparison_list$record_pairs[, 1]] == partial_estimate[comparison_list$record_pairs[, 2]] # Find the number of true matches in the partial estimate true_matches_A <- sum(partial_estimate_links & true_links) # Precision of the partial estimate true_matches_A / sum(partial_estimate_links) # Here we demonstrate an example workflow with the small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA) # Run the Gibbs sampler results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # The number of clusters in the full estimate (including records records # determined not to be candidate matches to any other records using # reduce_comparison_data) length(unique(full_estimate)) + sum(reduced_comparison_list$file_sizes_not_included) # The number of entities represented in the records length(unique(dup_data_small$IDs)) # Find which record pairs are truly coreferent based on IDs true_links <- dup_data_small$IDs[comparison_list$record_pairs[, 1]] == dup_data_small$IDs[comparison_list$record_pairs[, 2]] # Focus on the record pairs that were candidate matches true_links_reduced <- true_links[reduced_comparison_list$pairs_to_keep] # Calculate the number of prior false non-matches based on the indexing # scheme used prior_fnm <- nrow(comparison_list$record_pairs[true_links & (!reduced_comparison_list$pairs_to_keep), ]) # Find which record pairs are in the same clusters in the full estimate full_estimate_links <- full_estimate[reduced_comparison_list$record_pairs[, 1]] == full_estimate[reduced_comparison_list$record_pairs[, 2]] # Find the number of true matches in the full estimate true_matches <- sum(full_estimate_links & true_links_reduced) # Precision of the full estimate true_matches / sum(full_estimate_links) # Recall of the full estimate true_matches / (sum(true_links_reduced) + prior_fnm) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12) # The partial estimate abstains from making decisions for how many records? sum(partial_estimate == -1) # For the records which decisions were made for in the partial estimate, # there are how many clusters? (including records determined not to be # candidate matches to any other records using reduce_comparison_data) length(unique(partial_estimate)) + sum(reduced_comparison_list$file_sizes_not_included) # Abstain rate of partial_estimat (excluding records determined not # to be candidate matches to any other records using reduce_comparison_data) sum(partial_estimate == -1) / length(partial_estimate) # Relabel records where we abstained partial_estimate[which(partial_estimate == -1)] <- length(partial_estimate) + which(partial_estimate == -1) # Find which record pairs are in the same clusters in the full estimate partial_estimate_links <- partial_estimate[reduced_comparison_list$record_pairs[, 1]] == partial_estimate[reduced_comparison_list$record_pairs[, 2]] # Find the number of true matches in the partial estimate true_matches_A <- sum(partial_estimate_links & true_links_reduced) # Precision of the partial estimate true_matches_A / sum(partial_estimate_links) # Relabel the full and partial Bayes estimates full_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, full_estimate) partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, partial_estimate) # Add columns to the records corresponding to their full and partial # Bayes estimates dup_data_small$records <- cbind(dup_data_small$records, full_estimate_id = full_estimate_relabel$link_id, partial_estimate_id = partial_estimate_relabel$link_id)A dataset containing 730 simulated records from 3 files with no duplicate records in each file.
no_dup_dataExtracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.data(no_dup_data) # There are 500 entities represented in the records length(unique(no_dup_data$IDs))A dataset containing 71 simulated records from 3 files with no duplicate records in each file, subset from no_dup_data.
no_dup_data_smallExtracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.data(no_dup_data_small) # There are 71 entities represented in the records length(unique(no_dup_data_small$IDs))Use indexing to reduce the number of record pairs that are potential matches.
reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)# Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)Relabel the Bayes estimate of a partition, for use after using indexing to reduce the number of record pairs that are potential matches.
relabel_bayes_estimate(reduced_comparison_list, bayes_estimate)# Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA) # Run the Gibbs sampler results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12) # Relabel the full and partial Bayes estimates full_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, full_estimate) partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, partial_estimate) # Add columns to the records corresponding to their full and partial # Bayes estimates dup_data_small$records <- cbind(dup_data_small$records, full_estimate_id = full_estimate_relabel$link_id, partial_estimate_id = partial_estimate_relabel$link_id)Specify the prior distributions for the m and u parameters of the models for comparison data among matches and non-matches, and the partition.
specify_prior( comparison_list, mus = NA, nus = NA, flat = 0, alphas = NA, dup_upper_bound = NA, dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = NA, n_prior_pars = NA )# Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Specify the prior prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA) # Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)| Repository | Version | Published | First seen | Last seen | Docs |
|---|---|---|---|---|---|
| CRAN | 0.1.1 | 2026-05-29 | 2026-05-30 |
표시할 OSV 데이터가 없습니다.
표시할 OpenAlex 데이터가 없습니다.