multilink

R 패키지 메타데이터와 수집 신호를 모아 봅니다.

Packages / CRAN / multilink

multilink

v0.1.1

multilink

Repository CRANLicense GPL-3Lifecycle activeNeeds compilation yes

Package page Reference Source DOI README NEWS

DOI

10.32614/CRAN.package.multilink

Core Signals

첫 화면에서 판단해야 할 수집 신호를 먼저 배치합니다.

표시할 핵심 신호가 없습니다.

Supported Backends

DESCRIPTION에서 감지한 backend 관련 package입니다.

backend package 신호가 없습니다.

Quick Facts

기본 메타데이터를 작은 카드와 토큰으로 압축합니다.

profile

Repository

CRAN

Version

0.1.1

License

GPL-3

Lifecycle

active

Needs compilation

yes

Last observed

2026-07-10

CRAN: cran.r-project.org/package=multilink

Build fields

LinkingTo

RcppRcppArmadillo

수집 소스별 패키지 정보

1개 소스

CRAN

0.1.1

2026-07-10

License

GPL-3

Depends

R (>= 3.5.0)

Imports

igraph, RecordLinkage, Rcpp, utils, mcclust, geosphere, stringr

LinkingTo

Rcpp, RcppArmadillo

Needs compilation

yes

Lifecycle

active

Last observed

2026-07-10 02:43:42

이 패키지가 의존하는 패키지

5개 표시전체 9개

Package	Type	Spec
geosphere CRAN · 0.1.1 · 2026-07-25	Imports	geosphere
igraph CRAN · 0.1.1 · 2026-07-25	Imports	igraph
mcclust CRAN · 0.1.1 · 2026-07-25	Imports	mcclust
Rcpp CRAN · 0.1.1 · 2026-07-25	Imports	Rcpp
RecordLinkage CRAN · 0.1.1 · 2026-07-25	Imports	RecordLinkage
stringr CRAN · 0.1.1 · 2026-07-25	Imports	stringr
utils CRAN · 0.1.1 · 2026-07-25	Imports	utils
Rcpp CRAN · 0.1.1 · 2026-07-25	LinkingTo	Rcpp
RcppArmadillo CRAN · 0.1.1 · 2026-07-25	LinkingTo	RcppArmadillo
검색 결과가 없습니다.

1 / 2

이 패키지를 쓰는 패키지

0개 표시전체 0개

Package	Type	Spec
표시할 dependency edge가 없습니다.
검색 결과가 없습니다.

1 / 1

패키지 페이지

All links

Repository: CRAN
Version: 0.1.1
Collected: 2026-05-29 04:11:04
Package page: https://cran.r-project.org/web/packages/multilink/index.html
DOI: 10.32614/CRAN.package.multilink
CRAN checks: https://cran.r-project.org/web/checks/check_results_multilink.html
README: https://cran.r-project.org/web/packages/multilink/readme/README.html
NEWS: https://cran.r-project.org/web/packages/multilink/news/news.html
Reference HTML: https://cran.r-project.org/web/packages/multilink/refman/multilink.html
Reference PDF: https://cran.r-project.org/web/packages/multilink/multilink.pdf
Source package: https://cran.r-project.org/src/contrib/multilink_0.1.1.tar.gz
Archive: https://CRAN.R-project.org/src/contrib/Archive/multilink

Page fields

Author: Serge Aleshin-Guendel [aut, cre]
BugReports: https://github.com/aleshing/multilink/issues
CRAN Checks: multilink results
DOI: 10.32614/CRAN.package.multilink
License: GPL-3
LinkingTo: Rcpp , RcppArmadillo
Maintainer: Serge Aleshin-Guendel <saleshinguendel at gmail.com>
Materials: README , NEWS
NeedsCompilation: yes
Old Sources: multilink archive
Package Source: multilink_0.1.1.tar.gz
Published: 2023-06-09
Reference Manual: multilink.html , multilink.pdf
URL: https://github.com/aleshing/multilink
Version: 0.1.1
Windows Binaries: r-devel: multilink_0.1.1.zip , r-release: multilink_0.1.1.zip , r-oldrel: multilink_0.1.1.zip
MacOS Binaries: r-release (arm64): multilink_0.1.1.tgz , r-oldrel (arm64): multilink_0.1.1.tgz , r-release (x86_64): multilink_0.1.1.tgz , r-oldrel (x86_64): multilink_0.1.1.tgz

Version

0.1.1

LinkingTo

Rcpp , RcppArmadillo

Rcpp RcppArmadillo

Published

2023-06-09

DOI

10.32614/CRAN.package.multilink

Author

Serge Aleshin-Guendel [aut, cre]

Maintainer

Serge Aleshin-Guendel <saleshinguendel at gmail.com>

BugReports

https://github.com/aleshing/multilink/issues

License

GPL-3

URL

https://github.com/aleshing/multilink

NeedsCompilation

yes

Materials

README , NEWS

README NEWS

CRAN Checks

multilink results

Reference Manual

multilink.html , multilink.pdf

multilink.html multilink.pdf

Package Source

multilink_0.1.1.tar.gz

Windows Binaries

r-devel: multilink_0.1.1.zip , r-release: multilink_0.1.1.zip , r-oldrel: multilink_0.1.1.zip

multilink_0.1.1.zip multilink_0.1.1.zip multilink_0.1.1.zip

MacOS Binaries

r-release (arm64): multilink_0.1.1.tgz , r-oldrel (arm64): multilink_0.1.1.tgz , r-release (x86_64): multilink_0.1.1.tgz , r-oldrel (x86_64): multilink_0.1.1.tgz

multilink_0.1.1.tgz multilink_0.1.1.tgz multilink_0.1.1.tgz multilink_0.1.1.tgz

Old Sources

multilink archive

Page sections 3

Documentation

Heading

Documentation

Links

[{"label":"multilink.html","section":"","type":"","url":"https://cran.r-project.org/web/packages/multilink/refman/multilink.html"},{"label":"multilink.pdf","section":"","type":"","url":"https://cran.r-project.org/web/packages/multilink/multilink.pdf"}]

Text

Reference manual: multilink.html , multilink.pdf

Downloads

Heading

Downloads

Links

[{"label":"multilink_0.1.1.tar.gz","section":"","type":"","url":"https://cran.r-project.org/src/contrib/multilink_0.1.1.tar.gz"},{"label":"multilink_0.1.1.zip","section":"","type":"","url":"https://cran.r-project.org/bin/windows/contrib/4.7/multilink_0.1.1.zip"},{"label":"multilink_0.1.1.zip","section":"","type":"","url":"https://cran.r-project.org/bin/windows/contrib/4.6/multilink_0.1.1.zip"},{"label":"multilink_0.1.1.zip","section":"","type":"","url":"https://cran.r-project.org/bin/windows/contrib/4.5/multilink_0.1.1.zip"},{"label":"multilink_0.1.1.tgz","section":"","type":"","url":"https://cran.r-project.org/bin/macosx/sonoma-arm64/contrib/4.6/multilink_0.1.1.tgz"},{"label":"multilink_0.1.1.tgz","section":"","type":"","url":"https://cran.r-project.org/bin/macosx/big-sur-arm64/contrib/4.5/multilink_0.1.1.tgz"},{"label":"multilink_0.1.1.tgz","section":"","type":"","url":"https://cran.r-project.org/bin/macosx/big-sur-x86_64/contrib/4.6/multilink_0.1.1.tgz"},{"label":"multilink_0.1.1.tgz","section":"","type":"","url":"https://cran.r-project.org/bin/macosx/big-sur-x86_64/contrib/4.5/multilink_0.1.1.tgz"},{"label":"multilink archive","section":"","type":"","url":"https://CRAN.R-project.org/src/contrib/Archive/multilink"}]

Text

Package source: multilink_0.1.1.tar.gz Windows binaries: r-devel: multilink_0.1.1.zip , r-release: multilink_0.1.1.zip , r-oldrel: multilink_0.1.1.zip macOS binaries: r-release (arm64): multilink_0.1.1.tgz , r-oldrel (arm64): multilink_0.1.1.tgz , r-release (x86_64): multilink_0.1.1.tgz , r-oldrel (x86_64): multilink_0.1.1.tgz Old sources: multilink archive

Linking

Heading

Linking

Links

[{"label":"https://CRAN.R-project.org/package=multilink","section":"","type":"","url":"https://CRAN.R-project.org/package=multilink"}]

Text

Please use the canonical form https://CRAN.R-project.org/package=multilink to link to this page.

Materials 2

README NEWS

Documentation 2

multilink.html multilink.pdf

Downloads 9

multilink_0.1.1.tar.gz multilink_0.1.1.zip multilink_0.1.1.zip multilink_0.1.1.zip multilink_0.1.1.tgz multilink_0.1.1.tgz multilink_0.1.1.tgz multilink_0.1.1.tgz multilink archive

All page links 27

패키지 문서 원문

4 artifacts

field

NEWS

CRAN · 0.1.1 · Materials · text/html · 790 · 2026-05-07

Title

NEWS

Label

NEWS

Text content

NEWS code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} multilink 0.1.1 Updated package for submission to CRAN.

field

README

CRAN · 0.1.1 · Materials · text/html · 5,505 · 2026-05-07

Title

README

Label

README

Text content

README code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} pre > code.sourceCode { white-space: pre; position: relative; } pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { color: #008000; } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { color: #008000; font-weight: bold; } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ multilink multilink is an R package which implements methodology presented in the manuscript “Multifile Partitioning for Record Linkage and Duplicate Detection” by Serge Aleshin-Guendel and Mauricio Sadinle, published in the Journal of the American Statistical Association and available on arXiv . It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates. Installation You can install the development version of multilink from GitHub with: install.packages ( "devtools" ) devtools :: install_github ( "aleshing/multilink" )

reference_manual_html

Reference manual HTML

CRAN · 0.1.1 · Documentation · text/html · 79,175 · 2026-05-07

Title

Help for package multilink

Label

Reference manual HTML

Text content

Help for package multilink const macros = { "\\R": "\\textsf{R}", "\\mbox": "\\text", "\\code": "\\texttt"}; function processMathHTML() { var l = document.getElementsByClassName('reqn'); for (let e of l) { katex.render(e.textContent, e, { throwOnError: false, macros }); } return; } Package {multilink} Contents create_comparison_data dup_data dup_data_small find_bayes_estimate gibbs_sampler initialize_partition multilink no_dup_data no_dup_data_small reduce_comparison_data relabel_bayes_estimate specify_prior Title: Multifile Record Linkage and Duplicate Detection Version: 0.1.1 Description: Implementation of the methodology of Aleshin-Guendel & Sadinle (2022) < doi:10.1080/01621459.2021.2013242 >. It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates. Depends: R (≥ 3.5.0) License: GPL-3 Encoding: UTF-8 LazyData: true RoxygenNote: 7.1.2 URL: https://github.com/aleshing/multilink BugReports: https://github.com/aleshing/multilink/issues Imports: igraph, RecordLinkage, Rcpp, utils, mcclust, geosphere, stringr LinkingTo: Rcpp, RcppArmadillo NeedsCompilation: yes Packaged: 2023-06-08 20:25:20 UTC; sergealeshin-guendel Author: Serge Aleshin-Guendel [aut, cre] Maintainer: Serge Aleshin-Guendel <saleshinguendel@gmail.com> Repository: CRAN Date/Publication: 2023-06-09 14:20:07 UTC Create Comparison Data Description Create comparison data for all pairs of records, except for those records in files which are assumed to have no duplicates. Usage create_comparison_data( records, types, breaks, file_sizes, duplicates, verbose = TRUE ) Arguments records A data.frame containing the records to be linked, where each column of records is a field to be compared. If there are multiple files, records should be obtained by stacking the files on top of each other so that records[1:file_sizes[1], ] contains the records for file 1 , records[(file_sizes[1] + 1):(file_sizes[1] + file_sizes[2]), ] contains the records for file 2 , and so on. Missing values should be coded as NA . types A character vector, indicating the comparison to be used for each field (i.e. each column of records ). The options are: "bi" for binary comparisons, "nu" for numeric comparisons (absolute difference), "lv" for string comparisons (normalized Levenshtein distance), "lv_sep" for string comparisons (normalized Levenshtein distance) where each string may contain multiple spellings separated by the "|" character. We assume that fields using options "bi" , "lv" , and "lv_sep" are of class character , and fields using the "nu" option are of class numeric . For fields using the "lv_sep" option, for each record pair the normalized Levenshtein distance is computed between each possible spelling, and the minimum normalized Levenshtein distance between spellings is then used as the comparison for that record pair. breaks A list , the same length as types , indicating the break points used to compute disagreement levels for each fields' comparisons. If types[f]="bi" , breaks[[f]] is ignored (and thus can be set to NA ). See Details for more information on specifying this argument. file_sizes A numeric vector indicating the size of each file. duplicates A numeric vector indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file. verbose A logical indicator of whether progress messages should be print (default TRUE ). Details The purpose of this function is to construct comparison vectors for each pair of records. In order to construct these vectors, one needs to specify the types and breaks arguments. The types argument specifies how each field should be compared, and the breaks argument specifies how to discretize these comparisons. Currently, the types argument supports three types of field comparisons: binary, absolute difference, and the normalized Levenshtein distance. Please contact the package maintainer if you need a new type of comparison to be supported. The breaks argument should be a list , with with one element for each field. If a field is being compared with a binary comparison, i.e. types[f]="bi" , then the corresponding element of breaks should be NA , i.e. breaks[[f]]=NA . If a field is being compared with a numeric or string comparison, then the corresponding element of breaks should be a vector of cut points used to discretize the comparisons. To give more detail, suppose you pass in cut points breaks[[f]]=c(cut_1, ...,cut_L) . These cut points discretize the range of the comparisons into L+1 intervals: I_0=(-\infty, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, \infty] . The raw comparisons, which lie in [0,\infty) for numeric comparisons and [0,1] for string comparisons, are then replaced with indicators of which interval the comparisons lie in. The interval I_0 corresponds to the lowest level of disagreement for a comparison, while the interval I_L corresponds to the highest level of disagreement for a comparison. Value a list containing: record_pairs A data.frame , where each row contains the pair of records being compared in the corresponding row of comparisons . The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e. record_pairs[i, 1] < record_pairs[i, 2] for each row i . comparisons A logical matrix, where each row contains the comparisons for the record pair in the corresponding row of record_pairs . Comparisons are in the same order as the columns of records , and are represented by L + 1 columns of TRUE/FALSE indicators, where L + 1 is the number of disagreement levels for the field based on breaks . K The number of files, assumed to be of class numeric . file_sizes A numeric vector of length K , indicating the size of each file. duplicates A numeric vector of length K , indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file. field_levels A numeric vector indicating the number of disagreement levels for each field. file_labels An integer vector of length sum(file_sizes) , where file_labels[i] indicates which file record i is in. fp_matrix An integer matrix, where fp_matrix[k1, k2] is a label for the file pair (k1, k2) . Note that fp_matrix[k1, k2] = fp_matrix[k2, k1] . rp_to_fp A logical matrix that indicates which record pairs belong to which file pairs. rp_to_fp[fp, rp] is TRUE if the records record_pairs[rp, ] belong to the file pair fp , and is FALSE otherwise. Note that fp is given by the labeling in fp_matrix . ab An integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair. file_sizes_not_included A numeric vector of 0 s. This element is non-zero when reduce_comparison_data is used. ab_not_included A numeric vector of 0 s. This element is non-zero when reduce_comparison_data is used. labels NA . This element is not NA when reduce_comparison_data is used. pairs_to_keep NA . This element is not NA when reduce_comparison_data is used. cc 0 . This element is non-zero when reduce_comparison_data is used. References Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association . [doi: 10.1080/01621459.2021.2013242 ][ arXiv ] Examples ## Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <

section

multilink.pdf

CRAN · 0.1.1 · Documentation · application/pdf · 192,595 · 2026-05-07

Title

multilink.pdf

Label

multilink.pdf

Reference for multilink (0.1.1)

12개 topic

create_comparison_data

Create Comparison Data

CRAN · 0.1.1 · multilink/man/create_comparison_data.Rd · 2026-05-07

Create comparison data for all pairs of records, except for those records in files which are assumed to have no duplicates.

Aliases

create_comparison_data

Usage

create_comparison_data( records, types, breaks, file_sizes, duplicates, verbose = TRUE )

Arguments

records

A data.frame containing the records to be linked, where each column of records is a field to be compared. If there are multiple files, records should be obtained by stacking the files on top of each other so that records[1:file_sizes[1], ] contains the records for file 1, records[(file_sizes[1] + 1):(file_sizes[1] + file_sizes[2]), ] contains the records for file 2, and so on. Missing values should be coded as NA.

types

A character vector, indicating the comparison to be used for each field (i.e. each column of records). The options are: "bi" for binary comparisons, "nu" for numeric comparisons (absolute difference), "lv" for string comparisons (normalized Levenshtein distance), "lv_sep" for string comparisons (normalized Levenshtein distance) where each string may contain multiple spellings separated by the "|" character. We assume that fields using options "bi", "lv", and "lv_sep" are of class character, and fields using the "nu" option are of class numeric. For fields using the "lv_sep" option, for each record pair the normalized Levenshtein distance is computed between each possible spelling, and the minimum normalized Levenshtein distance between spellings is then used as the comparison for that record pair.

breaks

A list, the same length as types, indicating the break points used to compute disagreement levels for each fields' comparisons. If types[f]="bi", breaks[[f]] is ignored (and thus can be set to NA). See Details for more information on specifying this argument.

file_sizes

A numeric vector indicating the size of each file.

duplicates

A numeric vector indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file.

verbose

A logical indicator of whether progress messages should be print (default TRUE).

Details

The purpose of this function is to construct comparison vectors for each pair of records. In order to construct these vectors, one needs to specify the types and breaks arguments. The types argument specifies how each field should be compared, and the breaks argument specifies how to discretize these comparisons. Currently, the types argument supports three types of field comparisons: binary, absolute difference, and the normalized Levenshtein distance. Please contact the package maintainer if you need a new type of comparison to be supported. The breaks argument should be a list, with with one element for each field. If a field is being compared with a binary comparison, i.e. types[f]="bi", then the corresponding element of breaks should be NA, i.e. breaks[[f]]=NA. If a field is being compared with a numeric or string comparison, then the corresponding element of breaks should be a vector of cut points used to discretize the comparisons. To give more detail, suppose you pass in cut points breaks[[f]]=c(cut_1, ...,cut_L). These cut points discretize the range of the comparisons into L+1 intervals: I_0=(-, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, ]. The raw comparisons, which lie in [0,) for numeric comparisons and [0,1] for string comparisons, are then replaced with indicators of which interval the comparisons lie in. The interval I_0 corresponds to the lowest level of disagreement for a comparison, while the interval I_L corresponds to the highest level of disagreement for a comparison.

Value

Examples

## Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) ## Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))

References

dup_data

Duplicate Dataset

CRAN · 0.1.1 · data · multilink/man/dup_data.Rd · 2026-05-07

A dataset containing 867 simulated records from 3 files with no duplicate records in each file.

Aliases

dup_data

Keywords

datasets

Usage

dup_data

Format

A list with three elements: recordsA data.frame with the records, containing 7 fields, from all three files, in the format used for input to create_comparison_data. file_sizesThe size of each file. IDsThe true partition of the records, represented as an integer vector of arbitrary labels of length sum(file_sizes).

Source

Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.

Examples

data(dup_data) # There are 500 entities represented in the records length(unique(dup_data$IDs))

References

dup_data_small

Small Duplicate Dataset

CRAN · 0.1.1 · data · multilink/man/dup_data_small.Rd · 2026-05-07

A dataset containing 96 simulated records from 3 files with no duplicate records in each file, subset from dup_data.

Aliases

dup_data_small

Keywords

datasets

Usage

dup_data_small

Format

Source

Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.

Examples

data(dup_data_small) # There are 96 entities represented in the records length(unique(dup_data_small$IDs))

References

find_bayes_estimate

Find the Bayes Estimate of a Partition

CRAN · 0.1.1 · multilink/man/find_bayes_estimate.Rd · 2026-05-07

Find the (approximate) Bayes estimate of a partition based on MCMC samples of the partition and a specified loss function.

Aliases

find_bayes_estimate

Usage

find_bayes_estimate( partitions, burn_in, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = nrow(partitions), verbose = TRUE )

Arguments

partitions

Posterior samples of the partition, where each column is one sample and the partition is represented as an integer vector of arbitrary labels, as produced by the output of a call to gibbs_sampler.

burn_in

The number of samples to discard for burn in.

L_FNM

Positive loss for a false non-match. Default is 1.

L_FM1

Positive loss for a type 1 false match. Default is 1.

L_FM2

Positive loss for a type 2 false match. Default is 2.

L_A

Positive loss for abstaining from making a decision for a record. Default is Inf, i.e. decisions are made for all records.

max_cc_size

The maximum allowable connected component size over which the posterior expected loss is minimized. Default is nrow(partitions), i.e. no approximation is used. When is.infinite(L_A), we recommend setting this argument to 50, then increasing based on a computational budget. When !is.infinite(L_A), we recommend setting this argument to 10-12, then increasing based on a computational budget (although an increase of 1 in this argument can in the worst case lead to a doubling in computation time).

verbose

A logical indicator of whether progress messages should be print (default TRUE).

Value

A vector, the same length of a column of partitions containing the (approximate) Bayes estimate of the partition. If !is.infinite(L_A) the output may be a partial estimate. A positive number l in index i indicates that record i is in the same cluster as every other record j with l in index j. A value of -1 in index i indicates that the Bayes estimate abstained from making a decision for record i.

Examples

# Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Specify the prior prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA) # Find initialization for the matching (this step is optional) # The following line corresponds to only keeping pairs of records as # potential matches in the initialization for which neither gname nor fname # disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42) # Run the Gibbs sampler results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12) # Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA) # Run the Gibbs sampler results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)

References

gibbs_sampler

Gibbs Sampler for Posterior Inference

CRAN · 0.1.1 · multilink/man/gibbs_sampler.Rd · 2026-05-07

Run a Gibbs sampler to explore the posterior distribution of partitions of records.

Aliases

gibbs_sampler

Usage

gibbs_sampler( comparison_list, prior_list, n_iter = 2000, Z_init = 1:sum(comparison_list$file_sizes), seed = 70, single_likelihood = FALSE, chaperones_info = NA, verbose = TRUE )

Arguments

comparison_list

The output from a call to create_comparison_data or reduce_comparison_data.

prior_list

The output from a call to specify_prior.

n_iter

The number of iterations of the Gibbs sampler to run.

Z_init

Initialization of the partition of records, represented as an integer vector of arbitrary labels of length sum(comparison_list$file_sizes). The default initialization places each record in its own cluster. See initialize_partition for an alternative initialization when there are no duplicates in each file.

seed

The seed to use while running the Gibbs sampler.

single_likelihood

A logical indicator of whether to use a single likelihood for comparisons for all file pairs, or whether to use a separate likelihood for comparisons for each file pair. When single_likelihood=TRUE, a single likelihood is used, and the prior hyperparameters for m and u from the first file pair are used. We do not recommend using a single likelihood in general.

chaperones_info

If chaperones_info is set to NA, then Gibbs updates to the partition are used during the Gibbs sampler, as described in Aleshin-Guendel & Sadinle (2022). Else, Chaperones updates, as described in Miller et al. (2015) and Betancourt et al. (2016), are used and chaperones_info should be a list with five elements controlling Chaperones updates to the partition during the Gibbs sampler: chap_type, num_chap_iter, nonuniform_chap_type, extra_gibbs, num_restrict. chap_type is 0 if using a uniform Chaperones distribution, and 1 if using a nonuniform Chaperones distribution. num_chap_iter is the number of Chaperones updates to the partition that are made during each iteration of the Gibbs sampler. When using a nonuniform Chaperones distribution, nonuniform_chap_type is 0 if using the exact version, or 1 if using the partial version. extra_gibbs is a logical indicator of whether a Gibbs update to the partition should be done after the Chaperones updates, at each iteration of the Gibbs sampler. num_restrict is the number of restricted Gibbs steps to take during each Chaperones update to the partition.

verbose

A logical indicator of whether progress messages should be print (default TRUE).

Details

Given the prior specified using specify_prior, this function runs a Gibbs sampler to explore the posterior distribution of partitions of records, conditional on the comparison data created using create_comparison_data or reduce_comparison_data.

Value

a list containing: mPosterior samples of the m parameters. Each column is one sample. uPosterior samples of the u parameters. Each column is one sample. partitionsPosterior samples of the partition. Each column is one sample. Note that the partition is represented as an integer vector of arbitrary labels of length sum(comparison_list$file_sizes). contingency_tablesPosterior samples of the overlap table. Each column is one sample. This incorporates counts of records determined not to be candidate matches to any other records using reduce_comparison_data. cluster_sizesPosterior samples of the size of each cluster (associated with an arbitrary label from 1 to sum(comparison_list$file_sizes)). Each column is one sample. sampling_timeThe time in seconds it took to run the sampler.

Examples

# Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Specify the prior prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA) # Find initialization for the matching (this step is optional) # The following line corresponds to only keeping pairs of records as # potential matches in the initialization for which neither gname nor fname # disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42) # Run the Gibbs sampler results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42) # Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA) # Run the Gibbs sampler results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42)

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [https://doi.org/10.1080/01621459.2021.2013242][https://arxiv.org/abs/2110.03839arXiv] Jeffrey Miller, Brenda Betancourt, Abbas Zaidi, Hanna Wallach, & Rebecca C. Steorts (2015). Microclustering: When the cluster sizes grow sublinearly with the size of the data set. NeurIPS Bayesian Nonparametrics: The Next Generation Workshop Series. [https://arxiv.org/abs/1512.00792arXiv] Brenda Betancourt, Giacomo Zanella, Jeffrey Miller, Hanna Wallach, Abbas Zaidi, & Rebecca C. Steorts (2016). Flexible Models for Microclustering with Application to Entity Resolution. Advances in neural information processing systems. [https://proceedings.neurips.cc/paper/2016/hash/670e8a43b246801ca1eaca97b3e19189-Abstract.htmlPublished] [https://arxiv.org/abs/1610.09780arXiv]

initialize_partition

Initialize the Partition

CRAN · 0.1.1 · multilink/man/initialize_partition.Rd · 2026-05-07

Generate an initialization for the partition in the case when it is assumed there are no duplicates in all files (so that the partition is a matching).

Aliases

initialize_partition

Usage

initialize_partition(comparison_list, pairs_to_keep, seed = NA)

Arguments

comparison_list

the output from a call to create_comparison_data or reduce_comparison_data. Note that in order to correctly specify the initialization, if reduce_comparison_data is used to the reduce the number of record pairs that are candidate matches, then the output of reduce_comparison_data (not create_comparison_data) should be used for this argument.

pairs_to_keep

A logical vector, the same length as comparison_list$record_pairs, indicating which record pairs are potential matches in the initialization.

seed

The seed to use to generate the initialization.

Details

When it is assumed that there are no duplicates in all files, and reduce_comparison_data is not used to reduce the number of potential matches, the Gibbs sampler used for posterior inference may experience slow mixing when using an initialization for the partition where each record is in its own cluster (the default option for the Gibbs sampler). The purpose of this function is to provide an alternative initialization scheme. To use this initialization scheme, the user passes in a logical vector that indicates which record pairs are potential matches according to an indexing method (as in reduce_comparison_data). Note that this indexing is only used to generate the initialization, it is not used for inference. The initialization scheme first finds the transitive closure of the potential matches, which partitions the records into blocks. Within each block of records, the scheme randomly selects a record from each file, and these selected records are then placed in the same cluster for the partition initialization. All other records are placed in their own clusters.

Value

an integer vector of arbitrary labels of length sum(comparison_list$file_sizes), giving an initialization for the partition.

Examples

# Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Find initialization for the matching # The following line corresponds to only keeping pairs of records as # potential matches in the initialization for which neither gname nor fname # disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)

References

multilink

Multifile Record Linkage and Duplicate Detection

CRAN · 0.1.1 · package · multilink/man/multilink.Rd · 2026-05-07

The multilink package implements the methodology of Aleshin-Guendel & Sadinle (2022). It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates.

Aliases

multilink

Examples

# Here we demonstrate an example workflow with the small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Specify the prior prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA) # Find initialization for the matching (this step is optional) # The following line corresponds to only keeping pairs of records as # potential matches in the initialization for which neither gname nor fname # disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42) # Run the Gibbs sampler results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # The number of clusters in the full estimate length(unique(full_estimate)) # The number of entities represented in the records length(unique(no_dup_data_small$IDs)) # Find which record pairs are truly coreferent based on IDs true_links <- no_dup_data_small$IDs[comparison_list$record_pairs[, 1]] == no_dup_data_small$IDs[comparison_list$record_pairs[, 2]] # Find which record pairs are in the same clusters in the full estimate full_estimate_links <- full_estimate[comparison_list$record_pairs[, 1]] == full_estimate[comparison_list$record_pairs[, 2]] # Find the number of true matches in the full estimate true_matches <- sum(full_estimate_links & true_links) # Precision of the full estimate true_matches / sum(full_estimate_links) # Recall of the full estimate true_matches / sum(true_links) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12) # The partial estimate abstains from making decisions for how many records? sum(partial_estimate == -1) # For the records which decisions were made for in the partial estimate, # there are how many clusters? length(unique(partial_estimate)) # Abstain rate of partial_estimate sum(partial_estimate == -1) / length(partial_estimate) # Relabel records where we abstained partial_estimate[which(partial_estimate == -1)] <- length(partial_estimate) + which(partial_estimate == -1) # Find which record pairs are in the same clusters in the full estimate partial_estimate_links <- partial_estimate[comparison_list$record_pairs[, 1]] == partial_estimate[comparison_list$record_pairs[, 2]] # Find the number of true matches in the partial estimate true_matches_A <- sum(partial_estimate_links & true_links) # Precision of the partial estimate true_matches_A / sum(partial_estimate_links) # Here we demonstrate an example workflow with the small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA) # Run the Gibbs sampler results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # The number of clusters in the full estimate (including records records # determined not to be candidate matches to any other records using # reduce_comparison_data) length(unique(full_estimate)) + sum(reduced_comparison_list$file_sizes_not_included) # The number of entities represented in the records length(unique(dup_data_small$IDs)) # Find which record pairs are truly coreferent based on IDs true_links <- dup_data_small$IDs[comparison_list$record_pairs[, 1]] == dup_data_small$IDs[comparison_list$record_pairs[, 2]] # Focus on the record pairs that were candidate matches true_links_reduced <- true_links[reduced_comparison_list$pairs_to_keep] # Calculate the number of prior false non-matches based on the indexing # scheme used prior_fnm <- nrow(comparison_list$record_pairs[true_links & (!reduced_comparison_list$pairs_to_keep), ]) # Find which record pairs are in the same clusters in the full estimate full_estimate_links <- full_estimate[reduced_comparison_list$record_pairs[, 1]] == full_estimate[reduced_comparison_list$record_pairs[, 2]] # Find the number of true matches in the full estimate true_matches <- sum(full_estimate_links & true_links_reduced) # Precision of the full estimate true_matches / sum(full_estimate_links) # Recall of the full estimate true_matches / (sum(true_links_reduced) + prior_fnm) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12) # The partial estimate abstains from making decisions for how many records? sum(partial_estimate == -1) # For the records which decisions were made for in the partial estimate, # there are how many clusters? (including records determined not to be # candidate matches to any other records using reduce_comparison_data) length(unique(partial_estimate)) + sum(reduced_comparison_list$file_sizes_not_included) # Abstain rate of partial_estimat (excluding records determined not # to be candidate matches to any other records using reduce_comparison_data) sum(partial_estimate == -1) / length(partial_estimate) # Relabel records where we abstained partial_estimate[which(partial_estimate == -1)] <- length(partial_estimate) + which(partial_estimate == -1) # Find which record pairs are in the same clusters in the full estimate partial_estimate_links <- partial_estimate[reduced_comparison_list$record_pairs[, 1]] == partial_estimate[reduced_comparison_list$record_pairs[, 2]] # Find the number of true matches in the partial estimate true_matches_A <- sum(partial_estimate_links & true_links_reduced) # Precision of the partial estimate true_matches_A / sum(partial_estimate_links) # Relabel the full and partial Bayes estimates full_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, full_estimate) partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, partial_estimate) # Add columns to the records corresponding to their full and partial # Bayes estimates dup_data_small$records <- cbind(dup_data_small$records, full_estimate_id = full_estimate_relabel$link_id, partial_estimate_id = partial_estimate_relabel$link_id)

References

no_dup_data

No Duplicate Dataset

CRAN · 0.1.1 · data · multilink/man/no_dup_data.Rd · 2026-05-07

A dataset containing 730 simulated records from 3 files with no duplicate records in each file.

Aliases

no_dup_data

Keywords

datasets

Usage

no_dup_data

Format

Source

Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.

Examples

data(no_dup_data) # There are 500 entities represented in the records length(unique(no_dup_data$IDs))

References

no_dup_data_small

Small No Duplicate Dataset

CRAN · 0.1.1 · data · multilink/man/no_dup_data_small.Rd · 2026-05-07

A dataset containing 71 simulated records from 3 files with no duplicate records in each file, subset from no_dup_data.

Aliases

no_dup_data_small

Keywords

datasets

Usage

no_dup_data_small

Format

Source

Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.

Examples

data(no_dup_data_small) # There are 71 entities represented in the records length(unique(no_dup_data_small$IDs))

References

reduce_comparison_data

Reduce Comparison Data Size

CRAN · 0.1.1 · multilink/man/reduce_comparison_data.Rd · 2026-05-07

Use indexing to reduce the number of record pairs that are potential matches.

Aliases

reduce_comparison_data

Usage

reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)

Arguments

comparison_list

The output of a call to create_comparison_data.

pairs_to_keep

A logical vector, the same length as comparison_list$record_pairs, indicating which record pairs should be kept as potential matches. These potential matches do not have to be transitive (see the argument cc).

A numeric indicator of whether to find the transitive closure of pairs_to_keep, and use these potential matches instead of just those from pairs_to_keep. cc should be 1 if the transitive closure is being used, and cc should be 0 if the transitive closure is not being used. We recommend setting cc to 1.

Details

When using comparison-based record linkage methods, scalability is a concern, as the number of record pairs is quadratic in the number of records. In order to address these concerns, it's common to declare certain record pairs to not be potential matches a priori, using indexing methods. The user is free to index using any method they like, as long as they can produce a logical vector that indicates which record pairs are potential matches according to their indexing method. We recommend, if the user chosen indexing method does not output potential matches that are transitive, to set the cc argument to 1. By transitive we mean, for any three records i, j, and k, if i and j are potential matches, and j and k are potential matches, then i and k are potential matches. Non-transitive indexing schemes can lead to poor mixing of the Gibbs sampler used for posterior inference, and suggests that the indexing method used may have been too stringent. If indexing is used, it may be the case that some records are declared to not be potential matches to any other records. In this case, the indexing method has made the decision that these records have no matches, and thus we can remove them from the data set and relabel the remaining records; see the documentation for labels for information on how to go between the original labeling and the new labeling. If indexing is used, comparisons for record pairs that aren't potential matches are still used during inference, where they're used to inform the distribution of comparisons for non-matches.

Value

a list containing: record_pairsA data.frame, where each row contains the pair of records being compared in the corresponding row of comparisons. The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e. record_pairs[i, 1] < record_pairs[i, 2] for each row i. If according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled (see labels). comparisonsA logical matrix, where each row contains the comparisons between the record pair in the corresponding row of record_pairs. Comparisons are in the same order as the columns of records, and are represented by L + 1 columns of TRUE/FALSE indicators, where L + 1 is the number of disagreement levels for the field based on breaks. KThe number of files, assumed to be of class numeric. file_sizesA numeric vector of length K, indicating the size of each file. If according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled (see labels), and file_sizes now represents the sizes of each file after removing such records. duplicatesA numeric vector of length K, indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. field_levelsA numeric vector indicating the number of disagreement levels for each field. file_labelsAn integer vector of length sum(file_sizes), where file_labels[i] indicated which file record i is in. fp_matrixAn integer matrix, where fp_matrix[k1, k2] is a label for the file pair (k1, k2). Note that fp_matrix[k1, k2] = fp_matrix[k2, k1]. rp_to_fpA logical matrix that indicates which record pairs belong to which file pairs. rp_to_fp[fp, rp] is TRUE if the records record_pairs[rp, ] belong to the file pair fp, and is FALSE otherwise. Note that fp is given by the labeling in fp_matrix. abAn integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair. file_sizes_not_includedIf according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled (see labels), and file_sizes_not_included indicates, for each file, the number of such records that were removed. ab_not_includedFor record pairs not included according to pairs_to_keep, this is an integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair. labelsIf according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled. labels provides a dictionary that indicates, for each of the new labels, which record in the original labeling the new label corresponds to. In particular, the first column indicates the record in the original labeling, and the second column indicates the new labeling. pairs_to_keepA logical vector, the same length as comparison_list$record_pairs, indicating which record pairs were kept as potential matches. This may not be the same as the input pairs_to_keep if cc was set to 1. ccA numeric indicator of whether the connected components of the potential matches are closed under transitivity.

Examples

# Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)

References

relabel_bayes_estimate

Relabel the Bayes Estimate of a Partition

CRAN · 0.1.1 · multilink/man/relabel_bayes_estimate.Rd · 2026-05-07

Relabel the Bayes estimate of a partition, for use after using indexing to reduce the number of record pairs that are potential matches.

Aliases

relabel_bayes_estimate

Usage

relabel_bayes_estimate(reduced_comparison_list, bayes_estimate)

Arguments

reduced_comparison_list

The output from a call to reduce_comparison_data.

bayes_estimate

The output from a call to find_bayes_estimate.

Details

When the function reduce_comparison_data is used to reduce the number of record pairs that are potential matches, it may be the case that some records are declared to not be potential matches to any other records. In this case, the indexing method has made the decision that these records have no matches, and thus we can remove them from the data set and relabel the remaining records; see the documentation for labels in reduce_comparison_data for information on how to go between the original labeling and the new labeling. The purpose of this function is to relabel the output of find_bayes_estimate when the function reduce_comparison_data is used, so that the user doesn't have to do this relabeling themselves.

Value

A data.frame, with as many rows as sum(reduced_comparison_list$file_sizes + reduced_comparison_list$file_sizes_not_included), i.e. the number of records originally input to create_comparison_data, before indexing occurred. This data.frame has two columns, "original_labels" and "link_id". Given row i of records originally input to create_comparison_data, the linkage id according to bayes_estimate is given by the ith row of the link_id column. See the documentation for find_bayes_estimate for information on how to interpret this linkage id.

Examples

# Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA) # Run the Gibbs sampler results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42) # Find the full Bayes estimate full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50) # Find the partial Bayes estimate partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12) # Relabel the full and partial Bayes estimates full_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, full_estimate) partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, partial_estimate) # Add columns to the records corresponding to their full and partial # Bayes estimates dup_data_small$records <- cbind(dup_data_small$records, full_estimate_id = full_estimate_relabel$link_id, partial_estimate_id = partial_estimate_relabel$link_id)

References

specify_prior

Specify the Prior Distributions

CRAN · 0.1.1 · multilink/man/specify_prior.Rd · 2026-05-07

Specify the prior distributions for the m and u parameters of the models for comparison data among matches and non-matches, and the partition.

Aliases

specify_prior

Usage

specify_prior( comparison_list, mus = NA, nus = NA, flat = 0, alphas = NA, dup_upper_bound = NA, dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = NA, n_prior_pars = NA )

Arguments

comparison_list

the output from a call to create_comparison_data or reduce_comparison_data. Note that in order to correctly specify the prior, if reduce_comparison_data is used to the reduce the number of record pairs that are potential matches, then the output of reduce_comparison_data (not create_comparison_data) should be used for this argument.

mus, nus

The hyperparameters of the Dirichlet priors for the m and u parameters for the comparisons among matches and non-matches, respectively. These are positive numeric vectors which have length equal to the number of columns of comparison_list$comparisons times the number of file pairs (comparison_list$K * (comparison_list$K + 1) / 2). If set to NA, flat priors are used. We recommend using flat priors for m and u.

flat

A numeric indicator of whether a flat prior for partitions should be used. flat should be 1 if a flat prior is used, and flat should be 0 if a structured prior is used. If a flat prior is used, the remaining arguments should be set to NA. Otherwise, the remaining arguments should be specified. We do not recommend using a flat prior for partitions in general.

alphas

The hyperparameters for the Dirichlet-multinomial overlap table prior, a positive numeric vector of length 2 ^ comparison_list$K - 1. The indexing of these hyperparameters is based on the the comparison_list$K-bit binary representation of the inclusion patterns of the overlap table. To give a few examples, suppose comparison_list$K is 3. 1 in 3-bit binary is 001, so alphas[1] is the hyperparameter for the 001 cell of the overlap table, representing clusters containing only records from the third file. 2 in 3-bit binary is 010, so alphas[2] is the hyperparameter for the 010 cell of the overlap table, representing clusters containing only records from the second file. 3 in 3-bit binary is 011, so alphas[3] is the hyperparameter for the 011 cell of the overlap table, representing clusters containing only records from the second and third files. If set to NA, the hyperparameters will all be set to 1.

dup_upper_bound

A numeric vector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given file k, dup_upper_bound[k] should be between 1 and comparison_list$file_sizes[k], i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file. If set to NA, the upper bound for file k will be set to 1 if no duplicates are allowed for that file, or comparison_list$file_sizes[k] if duplicates are allowed for that file.

dup_count_prior_family

A character vector indicating the prior distribution family used for the number of duplicates in each cluster, for each file. Currently the only option is "Poisson" for a Poisson prior, truncated to lie between 1 and dup_upper_bound[k]. The mean parameter of the Poisson distribution is specified using the dup_count_prior_pars argument. If set to NA, a Poisson prior with mean 1 will be used.

dup_count_prior_pars

A list containing the parameters for the prior distribution for the number of duplicates in each cluster, for each file. For file k, when dup_count_prior_family[k]="Poisson", dup_count_prior_pars[[k]] is a positive constant representing the mean of the Poisson prior.

n_prior_family

A character indicating the prior distribution family used for n, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records using reduce_comparison_data. Currently the there are two options: "uniform" for a uniform prior for n, i.e. p(n) 1, and "scale" for a scale prior for n, i.e. p(n) 1/n. If set to NA, a uniform prior will be used.

n_prior_pars

Currently set to NA. When more prior distribution families for n are implemented, this will be a vector of parameters for those priors.

Details

The purpose of this function is to specify prior distributions for all parameters of the model. Please note that if reduce_comparison_data is used to the reduce the number of record pairs that are potential matches, then the output of reduce_comparison_data (not create_comparison_data) should be used as input. For the hyperparameters of the Dirichlet priors for the m and u parameters for the comparisons among matches and non-matches, respectively, we recommend using a flat prior. This is accomplished by setting mus=NA and nus=NA. Informative prior specifications are possible, but in practice they will be overwhelmed by the large number of comparisons. For the prior for partitions, we do not recommend using a flat prior. Instead we recommend using our structure prior for partitions. By setting flat=0 and the remaining arguments to NA, one obtains the default specification for the structured prior that we have found to perform well in simulation studies. The structured prior for partitions is specified as follows: Specify a prior for n, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records using reduce_comparison_data. Currently, a uniform prior and a scale prior for n are supported. Our default specification uses a uniform prior. Specify a prior for the overlap table (see the documentation for alphas for more information). Currently a Dirichlet-multinomial prior is supported. Our default specification sets all hyperparameters of the Dirichlet-multinomial prior to 1. For each file, specify a prior for the number of duplicates in each cluster. As a part of this prior, we specify the maximum number of records in a cluster for each file, through dup_upper_bound. When there are assumed to be no duplicates in a file, the maximum number of records in a cluster for that file is set to 1. When there are assumed to be duplicates in a file, we recommend setting the maximum number of records in a cluster for that file to be less than the file size, if prior knowledge allows. Currently, a Poisson prior for the the number of duplicates in each cluster is supported. Our default specification uses a Poisson prior with mean 1. Please contact the package maintainer if you need new prior families for n or the number of duplicates in each cluster to be supported.

Value

a list containing: musThe hyperparameters of the Dirichlet priors for the m parameters for the comparisons among matches. nusThe hyperparameters of the Dirichlet priors for the u parameters for the comparisons among non-matches. Includes data from comparisons of record pairs that were declared to not be potential matches using reduce_comparison_data. flatA numeric indicator of whether a flat prior for partitions should be used. flat is 1 if a flat prior is used, and flat is 0 if a structured prior is used. no_dupsA numeric indicator of whether no duplicates are allowed in all of the files. alphasThe hyperparameters for the Dirichlet-multinomial overlap table prior, a positive numeric vector of length 2 ^ comparison_list$K, where the first element is 0. alpha_0The sum of alphas. dup_upper_boundA numeric vector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given file k, dup_upper_bound[k] should be between 1 and comparison_list$file_sizes[k], i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file. log_dup_count_priorA list containing the log density of the prior distribution for the number of duplicates in each cluster, for each file. log_n_priorA numeric vector containing the log density of the prior distribution for the number of clusters represented in the records. nus_specifiedThe nus before data from comparisons of record pairs that were declared to not be potential matches using reduce_comparison_data are added. Used for input checking.

Examples

# Example with small no duplicate dataset data(no_dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0)) # Specify the prior prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA) # Example with small duplicate dataset data(dup_data_small) # Create the comparison data comparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1)) # Reduce the comparison data # The following line corresponds to only keeping pairs of records for which # neither gname nor fname disagree at the highest level pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE) reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1) # Specify the prior prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)

References

버전 이력

Repository	Version	Published	First seen	Last seen	Docs
CRAN	0.1.0	2023-01-23	2026-05-31	2026-07-25
CRAN	0.1.1		2026-06-01	2026-07-10

보안

표시할 OSV 데이터가 없습니다.

문헌 신호

표시할 OpenAlex 데이터가 없습니다.