multilink

R 패키지 메타데이터와 수집 신호를 모아 봅니다.

Packages / CRAN / multilink

multilink

v0.1.1
multilink
Repository CRANLicense GPL-3Lifecycle activeNeeds compilation yes
DOI
10.32614/CRAN.package.multilink

Core Signals

첫 화면에서 판단해야 할 수집 신호를 먼저 배치합니다.

0
표시할 핵심 신호가 없습니다.

Supported Backends

DESCRIPTION에서 감지한 backend 관련 package입니다.

0
backend package 신호가 없습니다.

Quick Facts

기본 메타데이터를 작은 카드와 토큰으로 압축합니다.

profile
Repository
CRAN
Version
0.1.1
License
GPL-3
Lifecycle
active
Needs compilation
yes
Last observed
2026-05-30
CRAN
cran.r-project.org/package=multilink

Build fields

LinkingTo
2
RcppRcppArmadillo

수집 소스별 패키지 정보

1개 소스
CRAN
0.1.1
2026-05-30
License
GPL-3
Depends
R (>= 3.5.0)
Imports
igraph, RecordLinkage, Rcpp, utils, mcclust, geosphere, stringr
LinkingTo
Rcpp, RcppArmadillo
Needs compilation
yes
Lifecycle
active
Last observed
2026-05-30 10:45:11

이 패키지가 의존하는 패키지

5개 표시전체 9개
PackageTypeSpec
geosphere
CRAN · 0.1.1 · 2026-05-30
Importsgeosphere
igraph
CRAN · 0.1.1 · 2026-05-30
Importsigraph
mcclust
CRAN · 0.1.1 · 2026-05-30
Importsmcclust
Rcpp
CRAN · 0.1.1 · 2026-05-30
ImportsRcpp
RecordLinkage
CRAN · 0.1.1 · 2026-05-30
ImportsRecordLinkage
1 / 2

이 패키지를 쓰는 패키지

0개 표시전체 0개
PackageTypeSpec
표시할 dependency edge가 없습니다.
1 / 1

패키지 페이지

All links
27
Repository
CRAN
Version
0.1.1
Collected
2026-05-29 04:11:04
Package page
https://cran.r-project.org/web/packages/multilink/index.html
DOI
10.32614/CRAN.package.multilink
CRAN checks
https://cran.r-project.org/web/checks/check_results_multilink.html
README
https://cran.r-project.org/web/packages/multilink/readme/README.html
NEWS
https://cran.r-project.org/web/packages/multilink/news/news.html
Reference HTML
https://cran.r-project.org/web/packages/multilink/refman/multilink.html
Reference PDF
https://cran.r-project.org/web/packages/multilink/multilink.pdf
Source package
https://cran.r-project.org/src/contrib/multilink_0.1.1.tar.gz
Archive
https://CRAN.R-project.org/src/contrib/Archive/multilink
Page fields
Author
Serge Aleshin-Guendel [aut, cre]
BugReports
https://github.com/aleshing/multilink/issues
CRAN Checks
multilink results
DOI
10.32614/CRAN.package.multilink
License
GPL-3
LinkingTo
Rcpp , RcppArmadillo
Maintainer
Serge Aleshin-Guendel <saleshinguendel at gmail.com>
Materials
README , NEWS
NeedsCompilation
yes
Old Sources
multilink archive
Package Source
multilink_0.1.1.tar.gz
Published
2023-06-09
Reference Manual
multilink.html , multilink.pdf
URL
https://github.com/aleshing/multilink
Version
0.1.1
Windows Binaries
r-devel: multilink_0.1.1.zip , r-release: multilink_0.1.1.zip , r-oldrel: multilink_0.1.1.zip
MacOS Binaries
r-release (arm64): multilink_0.1.1.tgz , r-oldrel (arm64): multilink_0.1.1.tgz , r-release (x86_64): multilink_0.1.1.tgz , r-oldrel (x86_64): multilink_0.1.1.tgz
Version
0.1.1
LinkingTo
Rcpp , RcppArmadillo
Published
2023-06-09
DOI
10.32614/CRAN.package.multilink
Author
Serge Aleshin-Guendel [aut, cre]
Maintainer
Serge Aleshin-Guendel <saleshinguendel at gmail.com>
BugReports
https://github.com/aleshing/multilink/issues
License
GPL-3
URL
https://github.com/aleshing/multilink
NeedsCompilation
yes
Materials
README , NEWS
CRAN Checks
multilink results
Reference Manual
multilink.html , multilink.pdf
Package Source
multilink_0.1.1.tar.gz
Windows Binaries
r-devel: multilink_0.1.1.zip , r-release: multilink_0.1.1.zip , r-oldrel: multilink_0.1.1.zip
MacOS Binaries
r-release (arm64): multilink_0.1.1.tgz , r-oldrel (arm64): multilink_0.1.1.tgz , r-release (x86_64): multilink_0.1.1.tgz , r-oldrel (x86_64): multilink_0.1.1.tgz
Old Sources
multilink archive
Page sections 3
Documentation
Heading
Documentation
Links
[{"label":"multilink.html","section":"","type":"","url":"https://cran.r-project.org/web/packages/multilink/refman/multilink.html"},{"label":"multilink.pdf","section":"","type":"","url":"https://cran.r-project.org/web/packages/multilink/multilink.pdf"}]
Text
Reference manual: multilink.html , multilink.pdf
Downloads
Heading
Downloads
Links
[{"label":"multilink_0.1.1.tar.gz","section":"","type":"","url":"https://cran.r-project.org/src/contrib/multilink_0.1.1.tar.gz"},{"label":"multilink_0.1.1.zip","section":"","type":"","url":"https://cran.r-project.org/bin/windows/contrib/4.7/multilink_0.1.1.zip"},{"label":"multilink_0.1.1.zip","section":"","type":"","url":"https://cran.r-project.org/bin/windows/contrib/4.6/multilink_0.1.1.zip"},{"label":"multilink_0.1.1.zip","section":"","type":"","url":"https://cran.r-project.org/bin/windows/contrib/4.5/multilink_0.1.1.zip"},{"label":"multilink_0.1.1.tgz","section":"","type":"","url":"https://cran.r-project.org/bin/macosx/sonoma-arm64/contrib/4.6/multilink_0.1.1.tgz"},{"label":"multilink_0.1.1.tgz","section":"","type":"","url":"https://cran.r-project.org/bin/macosx/big-sur-arm64/contrib/4.5/multilink_0.1.1.tgz"},{"label":"multilink_0.1.1.tgz","section":"","type":"","url":"https://cran.r-project.org/bin/macosx/big-sur-x86_64/contrib/4.6/multilink_0.1.1.tgz"},{"label":"multilink_0.1.1.tgz","section":"","type":"","url":"https://cran.r-project.org/bin/macosx/big-sur-x86_64/contrib/4.5/multilink_0.1.1.tgz"},{"label":"multilink archive","section":"","type":"","url":"https://CRAN.R-project.org/src/contrib/Archive/multilink"}]
Text
Package source: multilink_0.1.1.tar.gz Windows binaries: r-devel: multilink_0.1.1.zip , r-release: multilink_0.1.1.zip , r-oldrel: multilink_0.1.1.zip macOS binaries: r-release (arm64): multilink_0.1.1.tgz , r-oldrel (arm64): multilink_0.1.1.tgz , r-release (x86_64): multilink_0.1.1.tgz , r-oldrel (x86_64): multilink_0.1.1.tgz Old sources: multilink archive
Linking
Heading
Linking
Links
[{"label":"https://CRAN.R-project.org/package=multilink","section":"","type":"","url":"https://CRAN.R-project.org/package=multilink"}]
Text
Please use the canonical form https://CRAN.R-project.org/package=multilink to link to this page.
Materials 2
Documentation 2
Downloads 9
All page links 27

패키지 문서 원문

4 artifacts
field
NEWS
CRAN · 0.1.1 · Materials · text/html · 790 · 2026-05-07
Title
NEWS
Label
NEWS
Text content
Text content
NEWS code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} multilink 0.1.1 Updated package for submission to CRAN.
field
README
CRAN · 0.1.1 · Materials · text/html · 5,505 · 2026-05-07
Title
README
Label
README
Text content
Text content
README code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} pre > code.sourceCode { white-space: pre; position: relative; } pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { color: #008000; } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { color: #008000; font-weight: bold; } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ multilink multilink is an R package which implements methodology presented in the manuscript “Multifile Partitioning for Record Linkage and Duplicate Detection” by Serge Aleshin-Guendel and Mauricio Sadinle, published in the Journal of the American Statistical Association and available on arXiv . It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates. Installation You can install the development version of multilink from GitHub with: install.packages ( "devtools" ) devtools :: install_github ( "aleshing/multilink" )
reference_manual_html
Reference manual HTML
CRAN · 0.1.1 · Documentation · text/html · 79,175 · 2026-05-07
Title
Help for package multilink
Label
Reference manual HTML
Text content
Text content
Help for package multilink const macros = { "\\R": "\\textsf{R}", "\\mbox": "\\text", "\\code": "\\texttt"}; function processMathHTML() { var l = document.getElementsByClassName('reqn'); for (let e of l) { katex.render(e.textContent, e, { throwOnError: false, macros }); } return; } Package {multilink} Contents create_comparison_data dup_data dup_data_small find_bayes_estimate gibbs_sampler initialize_partition multilink no_dup_data no_dup_data_small reduce_comparison_data relabel_bayes_estimate specify_prior Title: Multifile Record Linkage and Duplicate Detection Version: 0.1.1 Description: Implementation of the methodology of Aleshin-Guendel & Sadinle (2022) < doi:10.1080/01621459.2021.2013242 >. It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates. Depends: R (≥ 3.5.0) License: GPL-3 Encoding: UTF-8 LazyData: true RoxygenNote: 7.1.2 URL: https://github.com/aleshing/multilink BugReports: https://github.com/aleshing/multilink/issues Imports: igraph, RecordLinkage, Rcpp, utils, mcclust, geosphere, stringr LinkingTo: Rcpp, RcppArmadillo NeedsCompilation: yes Packaged: 2023-06-08 20:25:20 UTC; sergealeshin-guendel Author: Serge Aleshin-Guendel [aut, cre] Maintainer: Serge Aleshin-Guendel <saleshinguendel@gmail.com> Repository: CRAN Date/Publication: 2023-06-09 14:20:07 UTC Create Comparison Data Description Create comparison data for all pairs of records, except for those records in files which are assumed to have no duplicates. Usage create_comparison_data( records, types, breaks, file_sizes, duplicates, verbose = TRUE ) Arguments records A data.frame containing the records to be linked, where each column of records is a field to be compared. If there are multiple files, records should be obtained by stacking the files on top of each other so that records[1:file_sizes[1], ] contains the records for file 1 , records[(file_sizes[1] + 1):(file_sizes[1] + file_sizes[2]), ] contains the records for file 2 , and so on. Missing values should be coded as NA . types A character vector, indicating the comparison to be used for each field (i.e. each column of records ). The options are: "bi" for binary comparisons, "nu" for numeric comparisons (absolute difference), "lv" for string comparisons (normalized Levenshtein distance), "lv_sep" for string comparisons (normalized Levenshtein distance) where each string may contain multiple spellings separated by the "|" character. We assume that fields using options "bi" , "lv" , and "lv_sep" are of class character , and fields using the "nu" option are of class numeric . For fields using the "lv_sep" option, for each record pair the normalized Levenshtein distance is computed between each possible spelling, and the minimum normalized Levenshtein distance between spellings is then used as the comparison for that record pair. breaks A list , the same length as types , indicating the break points used to compute disagreement levels for each fields' comparisons. If types[f]="bi" , breaks[[f]] is ignored (and thus can be set to NA ). See Details for more information on specifying this argument. file_sizes A numeric vector indicating the size of each file. duplicates A numeric vector indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file. verbose A logical indicator of whether progress messages should be print (default TRUE ). Details The purpose of this function is to construct comparison vectors for each pair of records. In order to construct these vectors, one needs to specify the types and breaks arguments. The types argument specifies how each field should be compared, and the breaks argument specifies how to discretize these comparisons. Currently, the types argument supports three types of field comparisons: binary, absolute difference, and the normalized Levenshtein distance. Please contact the package maintainer if you need a new type of comparison to be supported. The breaks argument should be a list , with with one element for each field. If a field is being compared with a binary comparison, i.e. types[f]="bi" , then the corresponding element of breaks should be NA , i.e. breaks[[f]]=NA . If a field is being compared with a numeric or string comparison, then the corresponding element of breaks should be a vector of cut points used to discretize the comparisons. To give more detail, suppose you pass in cut points breaks[[f]]=c(cut_1, ...,cut_L) . These cut points discretize the range of the comparisons into L+1 intervals: I_0=(-\infty, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, \infty] . The raw comparisons, which lie in [0,\infty) for numeric comparisons and [0,1] for string comparisons, are then replaced with indicators of which interval the comparisons lie in. The interval I_0 corresponds to the lowest level of disagreement for a comparison, while the interval I_L corresponds to the highest level of disagreement for a comparison. Value a list containing: record_pairs A data.frame , where each row contains the pair of records being compared in the corresponding row of comparisons . The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e. record_pairs[i, 1] < record_pairs[i, 2] for each row i . comparisons A logical matrix, where each row contains the comparisons for the record pair in the corresponding row of record_pairs . Comparisons are in the same order as the columns of records , and are represented by L + 1 columns of TRUE/FALSE indicators, where L + 1 is the number of disagreement levels for the field based on breaks . K The number of files, assumed to be of class numeric . file_sizes A numeric vector of length K , indicating the size of each file. duplicates A numeric vector of length K , indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file. field_levels A numeric vector indicating the number of disagreement levels for each field. file_labels An integer vector of length sum(file_sizes) , where file_labels[i] indicates which file record i is in. fp_matrix An integer matrix, where fp_matrix[k1, k2] is a label for the file pair (k1, k2) . Note that fp_matrix[k1, k2] = fp_matrix[k2, k1] . rp_to_fp A logical matrix that indicates which record pairs belong to which file pairs. rp_to_fp[fp, rp] is TRUE if the records record_pairs[rp, ] belong to the file pair fp , and is FALSE otherwise. Note that fp is given by the labeling in fp_matrix . ab An integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair. file_sizes_not_included A numeric vector of 0 s. This element is non-zero when reduce_comparison_data is used. ab_not_included A numeric vector of 0 s. This element is non-zero when reduce_comparison_data is used. labels NA . This element is not NA when reduce_comparison_data is used. pairs_to_keep NA . This element is not NA when reduce_comparison_data is used. cc 0 . This element is non-zero when reduce_comparison_data is used. References Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association . [doi: 10.1080/01621459.2021.2013242 ][ arXiv ] Examples ## Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <
section
multilink.pdf
CRAN · 0.1.1 · Documentation · application/pdf · 192,595 · 2026-05-07
Title
multilink.pdf
Label
multilink.pdf

Reference for multilink (0.1.1)

12개 topic
create_comparison_data
Create Comparison Data
CRAN · 0.1.1 · multilink/man/create_comparison_data.Rd · 2026-05-07

Create comparison data for all pairs of records, except for those records in files which are assumed to have no duplicates.

Aliases
create_comparison_data
Usage
create_comparison_data( records, types, breaks, file_sizes, duplicates, verbose = TRUE )
Arguments
records
A data.frame containing the records to be linked, where each column of records is a field to be compared. If there are multiple files, records should be obtained by stacking the files on top of each other so that records[1:file_sizes[1], ] contains the records for file 1, records[(file_sizes[1] + 1):(file_sizes[1] + file_sizes[2]), ] contains the records for file 2, and so on. Missing values should be coded as NA.
types
A character vector, indicating the comparison to be used for each field (i.e. each column of records). The options are: "bi" for binary comparisons, "nu" for numeric comparisons (absolute difference), "lv" for string comparisons (normalized Levenshtein distance), "lv_sep" for string comparisons (normalized Levenshtein distance) where each string may contain multiple spellings separated by the "|" character. We assume that fields using options "bi", "lv", and "lv_sep" are of class character, and fields using the "nu" option are of class numeric. For fields using the "lv_sep" option, for each record pair the normalized Levenshtein distance is computed between each possible spelling, and the minimum normalized Levenshtein distance between spellings is then used as the comparison for that record pair.
breaks
A list, the same length as types, indicating the break points used to compute disagreement levels for each fields' comparisons. If types[f]="bi", breaks[[f]] is ignored (and thus can be set to NA). See Details for more information on specifying this argument.
file_sizes
A numeric vector indicating the size of each file.
duplicates
A numeric vector indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file.
verbose
A logical indicator of whether progress messages should be print (default TRUE).
Details
The purpose of this function is to construct comparison vectors for each pair of records. In order to construct these vectors, one needs to specify the types and breaks arguments. The types argument specifies how each field should be compared, and the breaks argument specifies how to discretize these comparisons. Currently, the types argument supports three types of field comparisons: binary, absolute difference, and the normalized Levenshtein distance. Please contact the package maintainer if you need a new type of comparison to be supported. The breaks argument should be a list, with with one element for each field. If a field is being compared with a binary comparison, i.e. types[f]="bi", then the corresponding element of breaks should be NA, i.e. breaks[[f]]=NA. If a field is being compared with a numeric or string comparison, then the corresponding element of breaks should be a vector of cut points used to discretize the comparisons. To give more detail, suppose you pass in cut points breaks[[f]]=c(cut_1, ...,cut_L). These cut points discretize the range of the comparisons into L+1 intervals: I_0=(-, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, ]. The raw comparisons, which lie in [0,) for numeric comparisons and [0,1] for string comparisons, are then replaced with indicators of which interval the comparisons lie in. The interval I_0 corresponds to the lowest level of disagreement for a comparison, while the interval I_L corresponds to the highest level of disagreement for a comparison.
Value
a list containing: record_pairsA data.frame, where each row contains the pair of records being compared in the corresponding row of comparisons. The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e. record_pairs[i, 1] < record_pairs[i, 2] for each row i. comparisonsA logical matrix, where each row contains the comparisons for the record pair in the corresponding row of record_pairs. Comparisons are in the same order as the columns of records, and are represented by L + 1 columns of TRUE/FALSE indicators, where L + 1 is the number of disagreement levels for the field based on breaks. KThe number of files, assumed to be of class numeric. file_sizesA numeric vector of length K, indicating the size of each file. duplicatesA numeric vector of length K, indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file. field_levelsA numeric vector indicating the number of disagreement levels for each field. file_labelsAn integer vector of length sum(file_sizes), where file_labels[i] indicates which file record i is in. fp_matrixAn integer matrix, where fp_matrix[k1, k2] is a label for the file pair (k1, k2). Note that fp_matrix[k1, k2] = fp_matrix[k2, k1]. rp_to_fpA logical matrix that indicates which record pairs belong to which file pairs. rp_to_fp[fp, rp] is TRUE if the records record_pairs[rp, ] belong to the file pair fp, and is FALSE otherwise. Note that fp is given by the labeling in fp_matrix. abAn integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair. file_sizes_not_includedA numeric vector of 0s. This element is non-zero when reduce_comparison_data is used. ab_not_includedA numeric vector of 0s. This element is non-zero when reduce_comparison_data is used. labelsNA. This element is not NA when reduce_comparison_data is used. pairs_to_keepNA. This element is not NA when reduce_comparison_data is used. cc0. This element is non-zero when reduce_comparison_data is used.
Examples
## Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) ## Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [https://doi.org/10.1080/01621459.2021.2013242][https://arxiv.org/abs/2110.03839arXiv]
dup_data
Duplicate Dataset
CRAN · 0.1.1 · data · multilink/man/dup_data.Rd · 2026-05-07

A dataset containing 867 simulated records from 3 files with no duplicate records in each file.

Aliases
dup_data
Keywords
datasets
Usage
dup_data
Format
A list with three elements: recordsA data.frame with the records, containing 7 fields, from all three files, in the format used for input to create_comparison_data. file_sizesThe size of each file. IDsThe true partition of the records, represented as an integer vector of arbitrary labels of length sum(file_sizes).
Source
Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.
Examples
data(dup_data) # There are 500 entities represented in the records length(unique(dup_data$IDs))
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [https://doi.org/10.1080/01621459.2021.2013242][https://arxiv.org/abs/2110.03839arXiv]
dup_data_small
Small Duplicate Dataset
CRAN · 0.1.1 · data · multilink/man/dup_data_small.Rd · 2026-05-07

A dataset containing 96 simulated records from 3 files with no duplicate records in each file, subset from dup_data.

Aliases
dup_data_small
Keywords
datasets
Usage
dup_data_small
Format
A list with three elements: recordsA data.frame with the records, containing 7 fields, from all three files, in the format used for input to create_comparison_data. file_sizesThe size of each file. IDsThe true partition of the records, represented as an integer vector of arbitrary labels of length sum(file_sizes).
Source
Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.
Examples
data(dup_data_small) # There are 96 entities represented in the records length(unique(dup_data_small$IDs))
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [https://doi.org/10.1080/01621459.2021.2013242][https://arxiv.org/abs/2110.03839arXiv]
find_bayes_estimate
Find the Bayes Estimate of a Partition
CRAN · 0.1.1 · multilink/man/find_bayes_estimate.Rd · 2026-05-07

Find the (approximate) Bayes estimate of a partition based on MCMC samples of the partition and a specified loss function.

Aliases
find_bayes_estimate
Usage
find_bayes_estimate( partitions, burn_in, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = nrow(partitions), verbose = TRUE )
Arguments
partitions
Posterior samples of the partition, where each column is one sample and the partition is represented as an integer vector of arbitrary labels, as produced by the output of a call to gibbs_sampler.
burn_in
The number of samples to discard for burn in.
L_FNM
Positive loss for a false non-match. Default is 1.
L_FM1
Positive loss for a type 1 false match. Default is 1.
L_FM2
Positive loss for a type 2 false match. Default is 2.
L_A
Positive loss for abstaining from making a decision for a record. Default is Inf, i.e. decisions are made for all records.
max_cc_size
The maximum allowable connected component size over which the posterior expected loss is minimized. Default is nrow(partitions), i.e. no approximation is used. When is.infinite(L_A), we recommend setting this argument to 50, then increasing based on a computational budget. When !is.infinite(L_A), we recommend setting this argument to 10-12, then increasing based on a computational budget (although an increase of 1 in this argument can in the worst case lead to a doubling in computation time).
verbose
A logical indicator of whether progress messages should be print (default TRUE).
Value
A vector, the same length of a column of partitions containing the (approximate) Bayes estimate of the partition. If !is.infinite(L_A) the output may be a partial estimate. A positive number l in index i indicates that record i is in the same cluster as every other record j with l in index j. A value of -1 in index i indicates that the Bayes estimate abstained from making a decision for record i.
Examples
# Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Specify the prior prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA) # Find initialization for the matching (this step is optional) # The following line corresponds to only keeping pairs of records as # potential matches in the initialization for which neither gname nor fname # disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42) # Run the Gibbs sampler results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12) # Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA) # Run the Gibbs sampler results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [https://doi.org/10.1080/01621459.2021.2013242][https://arxiv.org/abs/2110.03839arXiv]
gibbs_sampler
Gibbs Sampler for Posterior Inference
CRAN · 0.1.1 · multilink/man/gibbs_sampler.Rd · 2026-05-07

Run a Gibbs sampler to explore the posterior distribution of partitions of records.

Aliases
gibbs_sampler
Usage
gibbs_sampler( comparison_list, prior_list, n_iter = 2000, Z_init = 1:sum(comparison_list$file_sizes), seed = 70, single_likelihood = FALSE, chaperones_info = NA, verbose = TRUE )
Arguments
comparison_list
The output from a call to create_comparison_data or reduce_comparison_data.
prior_list
The output from a call to specify_prior.
n_iter
The number of iterations of the Gibbs sampler to run.
Z_init
Initialization of the partition of records, represented as an integer vector of arbitrary labels of length sum(comparison_list$file_sizes). The default initialization places each record in its own cluster. See initialize_partition for an alternative initialization when there are no duplicates in each file.
seed
The seed to use while running the Gibbs sampler.
single_likelihood
A logical indicator of whether to use a single likelihood for comparisons for all file pairs, or whether to use a separate likelihood for comparisons for each file pair. When single_likelihood=TRUE, a single likelihood is used, and the prior hyperparameters for m and u from the first file pair are used. We do not recommend using a single likelihood in general.
chaperones_info
If chaperones_info is set to NA, then Gibbs updates to the partition are used during the Gibbs sampler, as described in Aleshin-Guendel & Sadinle (2022). Else, Chaperones updates, as described in Miller et al. (2015) and Betancourt et al. (2016), are used and chaperones_info should be a list with five elements controlling Chaperones updates to the partition during the Gibbs sampler: chap_type, num_chap_iter, nonuniform_chap_type, extra_gibbs, num_restrict. chap_type is 0 if using a uniform Chaperones distribution, and 1 if using a nonuniform Chaperones distribution. num_chap_iter is the number of Chaperones updates to the partition that are made during each iteration of the Gibbs sampler. When using a nonuniform Chaperones distribution, nonuniform_chap_type is 0 if using the exact version, or 1 if using the partial version. extra_gibbs is a logical indicator of whether a Gibbs update to the partition should be done after the Chaperones updates, at each iteration of the Gibbs sampler. num_restrict is the number of restricted Gibbs steps to take during each Chaperones update to the partition.
verbose
A logical indicator of whether progress messages should be print (default TRUE).
Details
Given the prior specified using specify_prior, this function runs a Gibbs sampler to explore the posterior distribution of partitions of records, conditional on the comparison data created using create_comparison_data or reduce_comparison_data.
Value
a list containing: mPosterior samples of the m parameters. Each column is one sample. uPosterior samples of the u parameters. Each column is one sample. partitionsPosterior samples of the partition. Each column is one sample. Note that the partition is represented as an integer vector of arbitrary labels of length sum(comparison_list$file_sizes). contingency_tablesPosterior samples of the overlap table. Each column is one sample. This incorporates counts of records determined not to be candidate matches to any other records using reduce_comparison_data. cluster_sizesPosterior samples of the size of each cluster (associated with an arbitrary label from 1 to sum(comparison_list$file_sizes)). Each column is one sample. sampling_timeThe time in seconds it took to run the sampler.
Examples
# Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Specify the prior prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA) # Find initialization for the matching (this step is optional) # The following line corresponds to only keeping pairs of records as # potential matches in the initialization for which neither gname nor fname # disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42) # Run the Gibbs sampler results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42) # Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA) # Run the Gibbs sampler results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42)
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [https://doi.org/10.1080/01621459.2021.2013242][https://arxiv.org/abs/2110.03839arXiv] Jeffrey Miller, Brenda Betancourt, Abbas Zaidi, Hanna Wallach, & Rebecca C. Steorts (2015). Microclustering: When the cluster sizes grow sublinearly with the size of the data set. NeurIPS Bayesian Nonparametrics: The Next Generation Workshop Series. [https://arxiv.org/abs/1512.00792arXiv] Brenda Betancourt, Giacomo Zanella, Jeffrey Miller, Hanna Wallach, Abbas Zaidi, & Rebecca C. Steorts (2016). Flexible Models for Microclustering with Application to Entity Resolution. Advances in neural information processing systems. [https://proceedings.neurips.cc/paper/2016/hash/670e8a43b246801ca1eaca97b3e19189-Abstract.htmlPublished] [https://arxiv.org/abs/1610.09780arXiv]
initialize_partition
Initialize the Partition
CRAN · 0.1.1 · multilink/man/initialize_partition.Rd · 2026-05-07

Generate an initialization for the partition in the case when it is assumed there are no duplicates in all files (so that the partition is a matching).

Aliases
initialize_partition
Usage
initialize_partition(comparison_list, pairs_to_keep, seed = NA)
Arguments
comparison_list
the output from a call to create_comparison_data or reduce_comparison_data. Note that in order to correctly specify the initialization, if reduce_comparison_data is used to the reduce the number of record pairs that are candidate matches, then the output of reduce_comparison_data (not create_comparison_data) should be used for this argument.
pairs_to_keep
A logical vector, the same length as comparison_list$record_pairs, indicating which record pairs are potential matches in the initialization.
seed
The seed to use to generate the initialization.
Details
When it is assumed that there are no duplicates in all files, and reduce_comparison_data is not used to reduce the number of potential matches, the Gibbs sampler used for posterior inference may experience slow mixing when using an initialization for the partition where each record is in its own cluster (the default option for the Gibbs sampler). The purpose of this function is to provide an alternative initialization scheme. To use this initialization scheme, the user passes in a logical vector that indicates which record pairs are potential matches according to an indexing method (as in reduce_comparison_data). Note that this indexing is only used to generate the initialization, it is not used for inference. The initialization scheme first finds the transitive closure of the potential matches, which partitions the records into blocks. Within each block of records, the scheme randomly selects a record from each file, and these selected records are then placed in the same cluster for the partition initialization. All other records are placed in their own clusters.
Value
an integer vector of arbitrary labels of length sum(comparison_list$file_sizes), giving an initialization for the partition.
Examples
# Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Find initialization for the matching # The following line corresponds to only keeping pairs of records as # potential matches in the initialization for which neither gname nor fname # disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [https://doi.org/10.1080/01621459.2021.2013242][https://arxiv.org/abs/2110.03839arXiv]
no_dup_data
No Duplicate Dataset
CRAN · 0.1.1 · data · multilink/man/no_dup_data.Rd · 2026-05-07

A dataset containing 730 simulated records from 3 files with no duplicate records in each file.

Aliases
no_dup_data
Keywords
datasets
Usage
no_dup_data
Format
A list with three elements: recordsA data.frame with the records, containing 7 fields, from all three files, in the format used for input to create_comparison_data. file_sizesThe size of each file. IDsThe true partition of the records, represented as an integer vector of arbitrary labels of length sum(file_sizes).
Source
Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.
Examples
data(no_dup_data) # There are 500 entities represented in the records length(unique(no_dup_data$IDs))
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [https://doi.org/10.1080/01621459.2021.2013242] [https://arxiv.org/abs/2110.03839arXiv]
no_dup_data_small
Small No Duplicate Dataset
CRAN · 0.1.1 · data · multilink/man/no_dup_data_small.Rd · 2026-05-07

A dataset containing 71 simulated records from 3 files with no duplicate records in each file, subset from no_dup_data.

Aliases
no_dup_data_small
Keywords
datasets
Usage
no_dup_data_small
Format
A list with three elements: recordsA data.frame with the records, containing 7 fields, from all three files, in the format used for input to create_comparison_data. file_sizesThe size of each file. IDsThe true partition of the records, represented as an integer vector of arbitrary labels of length sum(file_sizes).
Source
Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.
Examples
data(no_dup_data_small) # There are 71 entities represented in the records length(unique(no_dup_data_small$IDs))
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [https://doi.org/10.1080/01621459.2021.2013242] [https://arxiv.org/abs/2110.03839arXiv]
reduce_comparison_data
Reduce Comparison Data Size
CRAN · 0.1.1 · multilink/man/reduce_comparison_data.Rd · 2026-05-07

Use indexing to reduce the number of record pairs that are potential matches.

Aliases
reduce_comparison_data
Usage
reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)
Arguments
comparison_list
The output of a call to create_comparison_data.
pairs_to_keep
A logical vector, the same length as comparison_list$record_pairs, indicating which record pairs should be kept as potential matches. These potential matches do not have to be transitive (see the argument cc).
cc
A numeric indicator of whether to find the transitive closure of pairs_to_keep, and use these potential matches instead of just those from pairs_to_keep. cc should be 1 if the transitive closure is being used, and cc should be 0 if the transitive closure is not being used. We recommend setting cc to 1.
Details
When using comparison-based record linkage methods, scalability is a concern, as the number of record pairs is quadratic in the number of records. In order to address these concerns, it's common to declare certain record pairs to not be potential matches a priori, using indexing methods. The user is free to index using any method they like, as long as they can produce a logical vector that indicates which record pairs are potential matches according to their indexing method. We recommend, if the user chosen indexing method does not output potential matches that are transitive, to set the cc argument to 1. By transitive we mean, for any three records i, j, and k, if i and j are potential matches, and j and k are potential matches, then i and k are potential matches. Non-transitive indexing schemes can lead to poor mixing of the Gibbs sampler used for posterior inference, and suggests that the indexing method used may have been too stringent. If indexing is used, it may be the case that some records are declared to not be potential matches to any other records. In this case, the indexing method has made the decision that these records have no matches, and thus we can remove them from the data set and relabel the remaining records; see the documentation for labels for information on how to go between the original labeling and the new labeling. If indexing is used, comparisons for record pairs that aren't potential matches are still used during inference, where they're used to inform the distribution of comparisons for non-matches.
Value
a list containing: record_pairsA data.frame, where each row contains the pair of records being compared in the corresponding row of comparisons. The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e. record_pairs[i, 1] < record_pairs[i, 2] for each row i. If according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled (see labels). comparisonsA logical matrix, where each row contains the comparisons between the record pair in the corresponding row of record_pairs. Comparisons are in the same order as the columns of records, and are represented by L + 1 columns of TRUE/FALSE indicators, where L + 1 is the number of disagreement levels for the field based on breaks. KThe number of files, assumed to be of class numeric. file_sizesA numeric vector of length K, indicating the size of each file. If according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled (see labels), and file_sizes now represents the sizes of each file after removing such records. duplicatesA numeric vector of length K, indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. field_levelsA numeric vector indicating the number of disagreement levels for each field. file_labelsAn integer vector of length sum(file_sizes), where file_labels[i] indicated which file record i is in. fp_matrixAn integer matrix, where fp_matrix[k1, k2] is a label for the file pair (k1, k2). Note that fp_matrix[k1, k2] = fp_matrix[k2, k1]. rp_to_fpA logical matrix that indicates which record pairs belong to which file pairs. rp_to_fp[fp, rp] is TRUE if the records record_pairs[rp, ] belong to the file pair fp, and is FALSE otherwise. Note that fp is given by the labeling in fp_matrix. abAn integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair. file_sizes_not_includedIf according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled (see labels), and file_sizes_not_included indicates, for each file, the number of such records that were removed. ab_not_includedFor record pairs not included according to pairs_to_keep, this is an integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair. labelsIf according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled. labels provides a dictionary that indicates, for each of the new labels, which record in the original labeling the new label corresponds to. In particular, the first column indicates the record in the original labeling, and the second column indicates the new labeling. pairs_to_keepA logical vector, the same length as comparison_list$record_pairs, indicating which record pairs were kept as potential matches. This may not be the same as the input pairs_to_keep if cc was set to 1. ccA numeric indicator of whether the connected components of the potential matches are closed under transitivity.
Examples
# Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [https://doi.org/10.1080/01621459.2021.2013242][https://arxiv.org/abs/2110.03839arXiv]
relabel_bayes_estimate
Relabel the Bayes Estimate of a Partition
CRAN · 0.1.1 · multilink/man/relabel_bayes_estimate.Rd · 2026-05-07

Relabel the Bayes estimate of a partition, for use after using indexing to reduce the number of record pairs that are potential matches.

Aliases
relabel_bayes_estimate
Usage
relabel_bayes_estimate(reduced_comparison_list, bayes_estimate)
Arguments
reduced_comparison_list
The output from a call to reduce_comparison_data.
bayes_estimate
The output from a call to find_bayes_estimate.
Details
When the function reduce_comparison_data is used to reduce the number of record pairs that are potential matches, it may be the case that some records are declared to not be potential matches to any other records. In this case, the indexing method has made the decision that these records have no matches, and thus we can remove them from the data set and relabel the remaining records; see the documentation for labels in reduce_comparison_data for information on how to go between the original labeling and the new labeling. The purpose of this function is to relabel the output of find_bayes_estimate when the function reduce_comparison_data is used, so that the user doesn't have to do this relabeling themselves.
Value
A data.frame, with as many rows as sum(reduced_comparison_list$file_sizes + reduced_comparison_list$file_sizes_not_included), i.e. the number of records originally input to create_comparison_data, before indexing occurred. This data.frame has two columns, "original_labels" and "link_id". Given row i of records originally input to create_comparison_data, the linkage id according to bayes_estimate is given by the ith row of the link_id column. See the documentation for find_bayes_estimate for information on how to interpret this linkage id.
Examples
# Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA) # Run the Gibbs sampler results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12) # Relabel the full and partial Bayes estimates full_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, full_estimate) partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, partial_estimate) # Add columns to the records corresponding to their full and partial # Bayes estimates dup_data_small$records <- cbind(dup_data_small$records, full_estimate_id = full_estimate_relabel$link_id, partial_estimate_id = partial_estimate_relabel$link_id)
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [https://doi.org/10.1080/01621459.2021.2013242][https://arxiv.org/abs/2110.03839arXiv]
specify_prior
Specify the Prior Distributions
CRAN · 0.1.1 · multilink/man/specify_prior.Rd · 2026-05-07

Specify the prior distributions for the m and u parameters of the models for comparison data among matches and non-matches, and the partition.

Aliases
specify_prior
Usage
specify_prior( comparison_list, mus = NA, nus = NA, flat = 0, alphas = NA, dup_upper_bound = NA, dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = NA, n_prior_pars = NA )
Arguments
comparison_list
the output from a call to create_comparison_data or reduce_comparison_data. Note that in order to correctly specify the prior, if reduce_comparison_data is used to the reduce the number of record pairs that are potential matches, then the output of reduce_comparison_data (not create_comparison_data) should be used for this argument.
mus, nus
The hyperparameters of the Dirichlet priors for the m and u parameters for the comparisons among matches and non-matches, respectively. These are positive numeric vectors which have length equal to the number of columns of comparison_list$comparisons times the number of file pairs (comparison_list$K * (comparison_list$K + 1) / 2). If set to NA, flat priors are used. We recommend using flat priors for m and u.
flat
A numeric indicator of whether a flat prior for partitions should be used. flat should be 1 if a flat prior is used, and flat should be 0 if a structured prior is used. If a flat prior is used, the remaining arguments should be set to NA. Otherwise, the remaining arguments should be specified. We do not recommend using a flat prior for partitions in general.
alphas
The hyperparameters for the Dirichlet-multinomial overlap table prior, a positive numeric vector of length 2 ^ comparison_list$K - 1. The indexing of these hyperparameters is based on the the comparison_list$K-bit binary representation of the inclusion patterns of the overlap table. To give a few examples, suppose comparison_list$K is 3. 1 in 3-bit binary is 001, so alphas[1] is the hyperparameter for the 001 cell of the overlap table, representing clusters containing only records from the third file. 2 in 3-bit binary is 010, so alphas[2] is the hyperparameter for the 010 cell of the overlap table, representing clusters containing only records from the second file. 3 in 3-bit binary is 011, so alphas[3] is the hyperparameter for the 011 cell of the overlap table, representing clusters containing only records from the second and third files. If set to NA, the hyperparameters will all be set to 1.
dup_upper_bound
A numeric vector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given file k, dup_upper_bound[k] should be between 1 and comparison_list$file_sizes[k], i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file. If set to NA, the upper bound for file k will be set to 1 if no duplicates are allowed for that file, or comparison_list$file_sizes[k] if duplicates are allowed for that file.
dup_count_prior_family
A character vector indicating the prior distribution family used for the number of duplicates in each cluster, for each file. Currently the only option is "Poisson" for a Poisson prior, truncated to lie between 1 and dup_upper_bound[k]. The mean parameter of the Poisson distribution is specified using the dup_count_prior_pars argument. If set to NA, a Poisson prior with mean 1 will be used.
dup_count_prior_pars
A list containing the parameters for the prior distribution for the number of duplicates in each cluster, for each file. For file k, when dup_count_prior_family[k]="Poisson", dup_count_prior_pars[[k]] is a positive constant representing the mean of the Poisson prior.
n_prior_family
A character indicating the prior distribution family used for n, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records using reduce_comparison_data. Currently the there are two options: "uniform" for a uniform prior for n, i.e. p(n) 1, and "scale" for a scale prior for n, i.e. p(n) 1/n. If set to NA, a uniform prior will be used.
n_prior_pars
Currently set to NA. When more prior distribution families for n are implemented, this will be a vector of parameters for those priors.
Details
The purpose of this function is to specify prior distributions for all parameters of the model. Please note that if reduce_comparison_data is used to the reduce the number of record pairs that are potential matches, then the output of reduce_comparison_data (not create_comparison_data) should be used as input. For the hyperparameters of the Dirichlet priors for the m and u parameters for the comparisons among matches and non-matches, respectively, we recommend using a flat prior. This is accomplished by setting mus=NA and nus=NA. Informative prior specifications are possible, but in practice they will be overwhelmed by the large number of comparisons. For the prior for partitions, we do not recommend using a flat prior. Instead we recommend using our structure prior for partitions. By setting flat=0 and the remaining arguments to NA, one obtains the default specification for the structured prior that we have found to perform well in simulation studies. The structured prior for partitions is specified as follows: Specify a prior for n, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records using reduce_comparison_data. Currently, a uniform prior and a scale prior for n are supported. Our default specification uses a uniform prior. Specify a prior for the overlap table (see the documentation for alphas for more information). Currently a Dirichlet-multinomial prior is supported. Our default specification sets all hyperparameters of the Dirichlet-multinomial prior to 1. For each file, specify a prior for the number of duplicates in each cluster. As a part of this prior, we specify the maximum number of records in a cluster for each file, through dup_upper_bound. When there are assumed to be no duplicates in a file, the maximum number of records in a cluster for that file is set to 1. When there are assumed to be duplicates in a file, we recommend setting the maximum number of records in a cluster for that file to be less than the file size, if prior knowledge allows. Currently, a Poisson prior for the the number of duplicates in each cluster is supported. Our default specification uses a Poisson prior with mean 1. Please contact the package maintainer if you need new prior families for n or the number of duplicates in each cluster to be supported.
Value
a list containing: musThe hyperparameters of the Dirichlet priors for the m parameters for the comparisons among matches. nusThe hyperparameters of the Dirichlet priors for the u parameters for the comparisons among non-matches. Includes data from comparisons of record pairs that were declared to not be potential matches using reduce_comparison_data. flatA numeric indicator of whether a flat prior for partitions should be used. flat is 1 if a flat prior is used, and flat is 0 if a structured prior is used. no_dupsA numeric indicator of whether no duplicates are allowed in all of the files. alphasThe hyperparameters for the Dirichlet-multinomial overlap table prior, a positive numeric vector of length 2 ^ comparison_list$K, where the first element is 0. alpha_0The sum of alphas. dup_upper_boundA numeric vector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given file k, dup_upper_bound[k] should be between 1 and comparison_list$file_sizes[k], i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file. log_dup_count_priorA list containing the log density of the prior distribution for the number of duplicates in each cluster, for each file. log_n_priorA numeric vector containing the log density of the prior distribution for the number of clusters represented in the records. nus_specifiedThe nus before data from comparisons of record pairs that were declared to not be potential matches using reduce_comparison_data are added. Used for input checking.
Examples
# Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Specify the prior prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA) # Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [https://doi.org/10.1080/01621459.2021.2013242] [https://arxiv.org/abs/2110.03839arXiv]

버전 이력

RepositoryVersionPublishedFirst seenLast seenDocs
CRAN0.1.12026-05-292026-05-30

보안

표시할 OSV 데이터가 없습니다.

문헌 신호

표시할 OpenAlex 데이터가 없습니다.