Title: | Compare Similarity Across Text, Factors, or Numbers |
---|---|
Description: | Compare lists of texts, factors, or numerical values to measure their similarity. The motivating use case is evaluating the similarity of large language model responses across models, providers, or prompts. Approximate string matching is implemented using 'stringdist'. |
Authors: | Dylan Pieper [aut, cre] |
Maintainer: | Dylan Pieper <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-03-21 17:17:47 UTC |
Source: | https://github.com/dylanpieper/samesies |
Calculates and returns the average similarity score for each method used in the comparison.
average_similarity(x, ...) average_similarity(x, ...)
average_similarity(x, ...) average_similarity(x, ...)
x |
A similarity object |
... |
Additional arguments (not used) |
A named numeric vector of mean similarity scores for each method
A named numeric vector of mean similarity scores for each method
Calculates and returns the average similarity scores for each pair of lists compared, broken down by method.
pair_averages(x, method = NULL, ...) pair_averages(x, method = NULL, ...)
pair_averages(x, method = NULL, ...) pair_averages(x, method = NULL, ...)
x |
A similarity object |
method |
Optional character vector of methods to include |
... |
Additional arguments (not used) |
A data frame containing:
method |
The similarity method used |
pair |
The pair of lists compared |
avg_score |
Mean similarity score for the pair |
A data frame containing pair-wise average scores
Print a similarity object
## S3 method for class 'similar' print(x, ...)
## S3 method for class 'similar' print(x, ...)
x |
A similarity object |
... |
Additional arguments (not used) |
The object invisibly
Print method for similar_factor objects
## S3 method for class 'similar_factor' print(x, ...)
## S3 method for class 'similar_factor' print(x, ...)
x |
A similar_factor object |
... |
Additional arguments (not used) |
The object invisibly
Print method for similar_number objects
## S3 method for class 'similar_number' print(x, ...)
## S3 method for class 'similar_number' print(x, ...)
x |
A similar_number object |
... |
Additional arguments (not used) |
The object invisibly
Print method for similar_text objects
## S3 method for class 'similar_text' print(x, ...)
## S3 method for class 'similar_text' print(x, ...)
x |
A similar_text object |
... |
Additional arguments (not used) |
The object invisibly
Print method for summary.similar objects
## S3 method for class 'summary.similar' print(x, ...)
## S3 method for class 'summary.similar' print(x, ...)
x |
A summary.similar object |
... |
Additional arguments (not used) |
The summary object invisibly
Print method for summary.similar_factor objects
## S3 method for class 'summary.similar_factor' print(x, ...)
## S3 method for class 'summary.similar_factor' print(x, ...)
x |
A summary.similar_factor object |
... |
Additional arguments (not used) |
The object invisibly
Print method for summary.similar_number objects
## S3 method for class 'summary.similar_number' print(x, ...)
## S3 method for class 'summary.similar_number' print(x, ...)
x |
A summary.similar_number object |
... |
Additional arguments (not used) |
The object invisibly
Print method for summary.similar_text objects
## S3 method for class 'summary.similar_text' print(x, ...)
## S3 method for class 'summary.similar_text' print(x, ...)
x |
A summary.similar_text object |
... |
Additional arguments (not used) |
The object invisibly
Compare Factor Similarity Across Lists
same_factor( ..., method = c("exact", "order"), levels, ordered = FALSE, digits = 3 )
same_factor( ..., method = c("exact", "order"), levels, ordered = FALSE, digits = 3 )
... |
Lists of categorical values (character or factor) to compare. Can be named (e.g., |
method |
Character vector of similarity methods. Choose from: "exact", "order" (default: all) |
levels |
Character vector of all allowed levels for comparison |
ordered |
Logical. If TRUE, treat levels as ordered (ordinal). If FALSE, the "order" method is skipped. |
digits |
Number of digits to round results (default: 3) |
An S3 object of type "similar_factor" containing:
scores: Numeric similarity scores by method and comparison
summary: Summary statistics by method and comparison
methods: Methods used for comparison
list_names: Names of compared lists
levels: Levels used for categorical comparison
list1 <- list("high", "medium", "low") list2 <- list("high", "low", "medium") # Using unnamed lists result1 <- same_factor(list1, list2, levels = c("low", "medium", "high")) # Using named lists for more control result2 <- same_factor( "l1" = list1, "l2" = list2, levels = c("low", "medium", "high") )
list1 <- list("high", "medium", "low") list2 <- list("high", "low", "medium") # Using unnamed lists result1 <- same_factor(list1, list2, levels = c("low", "medium", "high")) # Using named lists for more control result2 <- same_factor( "l1" = list1, "l2" = list2, levels = c("low", "medium", "high") )
Computes similarity scores between two or more lists of numeric values using multiple comparison methods.
same_number( ..., method = c("exact", "raw", "exp", "percent", "normalized", "fuzzy"), epsilon = 0.05, epsilon_pct = 0.02, max_diff = NULL, digits = 3 )
same_number( ..., method = c("exact", "raw", "exp", "percent", "normalized", "fuzzy"), epsilon = 0.05, epsilon_pct = 0.02, max_diff = NULL, digits = 3 )
... |
Two or more lists containing numeric values to compare. Can be named (e.g., |
method |
Character vector specifying similarity methods (default: all) |
epsilon |
Threshold for fuzzy matching (default: NULL for auto-calculation) |
epsilon_pct |
Relative epsilon percentile (default: 0.02 or 2%). Only used when method is "fuzzy" |
max_diff |
Maximum difference for normalization (default: NULL for auto-calculation) |
digits |
Number of digits to round results (default: 3) |
The available methods are:
exact
: Binary similarity (1 if equal, 0 otherwise)
percent
: Percentage difference relative to the larger value
normalized
: Absolute difference normalized by a maximum difference value
fuzzy
: Similarity based on an epsilon threshold
exp
: Exponential decay based on absolute difference (e^-diff)
raw
: Returns the raw absolute difference (|num1 - num2|) instead of a similarity score
An S3 object containing:
scores
: A list of similarity scores for each method and list pair
summary
: A list of statistical summaries for each method and list pair
methods
: The similarity methods used
list_names
: Names of the input lists
raw_values
: The original input lists
list1 <- list(1, 2, 3) list2 <- list(1, 2.1, 3.2) # Using unnamed lists result1 <- same_number(list1, list2) # Using named lists for more control result2 <- same_number("n1" = list1, "n2" = list2)
list1 <- list(1, 2, 3) list2 <- list(1, 2.1, 3.2) # Using unnamed lists result1 <- same_number(list1, list2) # Using named lists for more control result2 <- same_number("n1" = list1, "n2" = list2)
Compare Text Similarity Across Lists
same_text( ..., method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), q = 1, p = NULL, bt = 0, weight = c(d = 1, i = 1, s = 1, t = 1), digits = 3 )
same_text( ..., method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), q = 1, p = NULL, bt = 0, weight = c(d = 1, i = 1, s = 1, t = 1), digits = 3 )
... |
Lists of character strings to compare. Can be named (e.g., |
method |
Character vector of similarity methods from |
q |
Size of q-gram for q-gram based methods (default: 1) |
p |
Winkler scaling factor for "jw" method (default: 0.1) |
bt |
Booth matching threshold |
weight |
Vector of weights for operations: deletion (d), insertion (i), substitution (s), transposition (t) |
digits |
Number of digits to round results (default: 3) |
An S3 class object of type "similar_text" containing:
scores: Numeric similarity scores by method and comparison
summary: Summary statistics by method and comparison
methods: Methods used for comparison
list_names: Names of compared lists
list1 <- list("hello", "world") list2 <- list("helo", "word") # Using unnamed lists result1 <- same_text(list1, list2) # Using named lists for more control result2 <- same_text("l1" = list1, "l2" = list2)
list1 <- list("hello", "world") list2 <- list("helo", "word") # Using unnamed lists result1 <- same_text(list1, list2) # Using named lists for more control result2 <- same_text("l1" = list1, "l2" = list2)
similar
is an S3 class for all similarity comparison objects.
This class defines common properties shared among child classes
like similar_text
, similar_factor
, and similar_number
.
similar(scores, summary, methods, list_names, digits = 3)
similar(scores, summary, methods, list_names, digits = 3)
scores |
List of similarity scores per method and comparison |
summary |
Summary statistics by method and comparison |
methods |
Character vector of methods used for comparison |
list_names |
Character vector of names for the compared lists |
digits |
Number of digits to round results (default: 3) |
This class provides the foundation for all similarity comparison classes. It includes common properties:
scores: List of similarity scores per method and comparison
summary: Summary statistics by method and comparison
methods: Character vector of methods used for comparison
list_names: Character vector of names for the compared lists
digits: Number of digits to round results in output
An object of class "similar" with the following components:
scores: List of similarity scores per method and comparison
summary: Summary statistics by method and comparison
methods: Character vector of methods used for comparison
list_names: Character vector of names for the compared lists
digits: Number of digits to round results in output
The similarity scores are normalized values between 0 and 1, where 1 indicates perfect similarity and 0 indicates no similarity.
similar_factor
is an S3 class for categorical/factor similarity comparisons.
similar_factor(scores, summary, methods, list_names, levels, digits = 3)
similar_factor(scores, summary, methods, list_names, levels, digits = 3)
scores |
List of similarity scores per method and comparison |
summary |
Summary statistics by method and comparison |
methods |
Character vector of methods used for comparison |
list_names |
Character vector of names for the compared lists |
levels |
Character vector of factor levels |
digits |
Number of digits to round results (default: 3) |
This class extends the similar
class and implements
categorical data-specific similarity comparison methods.
An object of class "similar_factor" (which inherits from "similar") containing:
scores: List of factor similarity scores per method and comparison
summary: Summary statistics by method and comparison
methods: Character vector of factor comparison methods used (exact, order)
list_names: Character vector of names for the compared factor lists
digits: Number of digits to round results in output
levels: Character vector of factor levels used in the comparison
The factor similarity scores are normalized values between 0 and 1, where 1 indicates identical factors and 0 indicates completely different factors based on the specific method used.
similar_number
is an S3 class for numeric similarity comparisons.
similar_number(scores, summary, methods, list_names, raw_values, digits = 3)
similar_number(scores, summary, methods, list_names, raw_values, digits = 3)
scores |
List of similarity scores per method and comparison |
summary |
Summary statistics by method and comparison |
methods |
Character vector of methods used for comparison |
list_names |
Character vector of names for the compared lists |
raw_values |
List of raw numeric values being compared |
digits |
Number of digits to round results (default: 3) |
This class extends the similar
class and implements
numeric data-specific similarity comparison methods.
An object of class "similar_number" (which inherits from "similar") containing:
scores: List of numeric similarity scores per method and comparison
summary: Summary statistics by method and comparison
methods: Character vector of numeric comparison methods used (exact, percent, normalized, fuzzy, exp, raw)
list_names: Character vector of names for the compared numeric lists
digits: Number of digits to round results in output
raw_values: List of raw numeric values that were compared
The numeric similarity scores are normalized values between 0 and 1, where 1 indicates identical numbers and 0 indicates maximally different numbers based on the specific method used. The exception is the "raw" method, which returns the absolute difference between values.
similar_text
is an S3 class for text similarity comparisons.
similar_text(scores, summary, methods, list_names, digits = 3)
similar_text(scores, summary, methods, list_names, digits = 3)
scores |
List of similarity scores per method and comparison |
summary |
Summary statistics by method and comparison |
methods |
Character vector of methods used for comparison |
list_names |
Character vector of names for the compared lists |
digits |
Number of digits to round results (default: 3) |
This class extends the similar
class and implements
text-specific similarity comparison methods.
An object of class "similar_text" (which inherits from "similar") containing:
scores: List of text similarity scores per method and comparison
summary: Summary statistics by method and comparison
methods: Character vector of text similarity methods used (osa, lv, dl, etc.)
list_names: Character vector of names for the compared text lists
digits: Number of digits to round results in output
The text similarity scores are normalized values between 0 and 1, where 1 indicates identical text and 0 indicates completely different text based on the specific method used.
Summarize a similarity object
## S3 method for class 'similar' summary(object, ...)
## S3 method for class 'similar' summary(object, ...)
object |
A similarity object |
... |
Additional arguments (not used) |
A summary object
Summary method for similar_factor objects
## S3 method for class 'similar_factor' summary(object, ...)
## S3 method for class 'similar_factor' summary(object, ...)
object |
A similar_factor object |
... |
Additional arguments (not used) |
A summary.similar_factor object
Summary method for similar_number objects
## S3 method for class 'similar_number' summary(object, ...)
## S3 method for class 'similar_number' summary(object, ...)
object |
A similar_number object |
... |
Additional arguments (not used) |
A summary.similar_number object
Summary method for similar_text objects
## S3 method for class 'similar_text' summary(object, ...)
## S3 method for class 'similar_text' summary(object, ...)
object |
A similar_text object |
... |
Additional arguments (not used) |
A summary.similar_text object