Title: | Batch Process LLM Text Completions Using a Data Frame |
---|---|
Description: | Batch process large language model (LLM) text completions using data frame rows, with support for OpenAI's 'GPT' (<https://chat.openai.com>), Anthropic's 'Claude' (<https://claude.ai>), and Google's 'Gemini' (<https://gemini.google.com>). Includes features such as local storage, metadata logging, API rate limiting delays, and a 'shiny' app addin. |
Authors: | Dylan Pieper [aut, cre, cph] |
Maintainer: | Dylan Pieper <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.0 |
Built: | 2024-11-18 02:53:05 UTC |
Source: | https://github.com/dylanpieper/batchllm |
Batch process large language model (LLM) text completions by looping across the rows of a data frame column. The package currently supports OpenAI's GPT, Anthropic's Claude, and Google's Gemini models, with built-in delays for API rate limiting. The package provides advanced text processing features, including automatic logging of batches and metadata to local files, side-by-side comparison of outputs from different LLMs, and integration of a user-friendly Shiny App Addin. Use cases include natural language processing tasks such as sentiment analysis, thematic analysis, classification, labeling or tagging, and language translation.
batchLLM( df, df_name = NULL, col, prompt, LLM = "openai", model = "gpt-4o-mini", temperature = 0.5, max_tokens = 500, batch_delay = "random", batch_size = 10, case_convert = NULL, sanitize = FALSE, attempts = 1, log_name = "batchLLM-log", hash_algo = "crc32c", ... )
batchLLM( df, df_name = NULL, col, prompt, LLM = "openai", model = "gpt-4o-mini", temperature = 0.5, max_tokens = 500, batch_delay = "random", batch_size = 10, case_convert = NULL, sanitize = FALSE, attempts = 1, log_name = "batchLLM-log", hash_algo = "crc32c", ... )
df |
A data frame that contains the input data. |
df_name |
An optional string specifying the name of the data frame to log. This is particularly useful in Shiny applications or when the data frame is passed programmatically rather than explicitly. Default is NULL. |
col |
The name of the column in the data frame to process. |
prompt |
A system prompt for the LLM model. |
LLM |
A string for the name of the LLM with the options: "openai", "anthropic", and "google". Default is "openai". |
model |
A string for the name of the model from the LLM. Default is "gpt-4o-mini". |
temperature |
A temperature for the LLM model. Default is .5. |
max_tokens |
A maximum number of tokens to generate before stopping. Default is 500. |
batch_delay |
A string for the batch delay with the options: "random", "min", and "sec". Numeric examples include "1min" and "30sec". Default is "random" which is an average of 10.86 seconds (n = 1,000 simulations). |
batch_size |
The number of rows to process in each batch. Default is 10. |
case_convert |
A string for the case conversion of the output with the options: "upper", "lower", or NULL (no change). Default is NULL. |
sanitize |
Extract the LLM text completion from the model's response by returning only content in |
attempts |
The maximum number of loop retry attempts. Default is 1. |
log_name |
A string for the name of the log without the |
hash_algo |
A string for a hashing algorithm from the 'digest' package. Default is |
... |
Additional arguments to pass on to the LLM API function. |
Returns the input data frame with an additional column containing the text completion output. The function also writes the output and metadata to the log file after each batch in a nested list format.
## Not run: library(batchLLM) # Set API keys Sys.setenv(OPENAI_API_KEY = "your_openai_api_key") Sys.setenv(ANTHROPIC_API_KEY = "your_anthropic_api_key") Sys.setenv(GEMINI_API_KEY = "your_gemini_api_key") # Define LLM configurations llm_configs <- list( list(LLM = "openai", model = "gpt-4o-mini"), list(LLM = "anthropic", model = "claude-3-haiku-20240307"), list(LLM = "google", model = "1.5-flash") ) # Apply batchLLM function to each configuration beliefs <- lapply(llm_configs, function(config) { batchLLM( df = beliefs, col = statement, prompt = "classify as a fact or misinformation in one word", LLM = config$LLM, model = config$model, batch_size = 10, batch_delay = "1min", case_convert = "lower" ) })[[length(llm_configs)]] # Print the updated data frame print(beliefs) ## End(Not run)
## Not run: library(batchLLM) # Set API keys Sys.setenv(OPENAI_API_KEY = "your_openai_api_key") Sys.setenv(ANTHROPIC_API_KEY = "your_anthropic_api_key") Sys.setenv(GEMINI_API_KEY = "your_gemini_api_key") # Define LLM configurations llm_configs <- list( list(LLM = "openai", model = "gpt-4o-mini"), list(LLM = "anthropic", model = "claude-3-haiku-20240307"), list(LLM = "google", model = "1.5-flash") ) # Apply batchLLM function to each configuration beliefs <- lapply(llm_configs, function(config) { batchLLM( df = beliefs, col = statement, prompt = "classify as a fact or misinformation in one word", LLM = config$LLM, model = config$model, batch_size = 10, batch_delay = "1min", case_convert = "lower" ) })[[length(llm_configs)]] # Print the updated data frame print(beliefs) ## End(Not run)
This function provides a user interface using Shiny to interact with
the batchLLM
package. It allows users to configure and execute batch processing
through an interactive dashboard.
batchLLM_shiny()
batchLLM_shiny()
No return value. Launches a Shiny Gadget that allows users to interact with the batchLLM
package.
The beliefs dataset consists of 20 statements representing opposing views on various scientific, environmental, and societal topics.
beliefs
beliefs
A data frame with 20 rows and 1 variable:
A character string with a statement representing a belief.
head(beliefs)
head(beliefs)
This function provides an interface to interact with Claude AI models via Anthropic's API, allowing for flexible text generation based on user inputs. This function was adapted from the claudeR repository by yrvelez on GitHub (MIT License).
claudeR( prompt, model = "claude-3-5-sonnet-20240620", max_tokens = 500, stop_sequences = NULL, temperature = 0.7, top_k = -1, top_p = -1, api_key = NULL, system_prompt = NULL )
claudeR( prompt, model = "claude-3-5-sonnet-20240620", max_tokens = 500, stop_sequences = NULL, temperature = 0.7, top_k = -1, top_p = -1, api_key = NULL, system_prompt = NULL )
prompt |
A string vector for Claude-2, or a list for Claude-3 specifying the input for the model. |
model |
The model to use for the request. Default is the latest Claude-3 model. |
max_tokens |
A maximum number of tokens to generate before stopping. |
stop_sequences |
Optional. A list of strings upon which to stop generating. |
temperature |
Optional. Amount of randomness injected into the response. |
top_k |
Optional. Only sample from the top K options for each subsequent token. |
top_p |
Optional. Does nucleus sampling. |
api_key |
Your API key for authentication. |
system_prompt |
Optional. An optional system role specification. |
The resulting completion up to and excluding the stop sequences.
## Not run: library(batchLLM) # Set API in the env or use api_key parameter in the claudeR call Sys.setenv(ANTHROPIC_API_KEY = "your_anthropic_api_key") # Using Claude-2 response <- claudeR( prompt = "What is the capital of France?", model = "claude-2.1", max_tokens = 50 ) cat(response) # Using Claude-3 response <- claudeR( prompt = list( list(role = "user", content = "What is the capital of France?") ), model = "claude-3-5-sonnet-20240620", max_tokens = 50, temperature = 0.8 ) cat(response) # Using a system prompt response <- claudeR( prompt = list( list(role = "user", content = "Summarize the history of France in one paragraph.") ), system_prompt = "You are a concise summarization assistant.", max_tokens = 500 ) cat(response) ## End(Not run)
## Not run: library(batchLLM) # Set API in the env or use api_key parameter in the claudeR call Sys.setenv(ANTHROPIC_API_KEY = "your_anthropic_api_key") # Using Claude-2 response <- claudeR( prompt = "What is the capital of France?", model = "claude-2.1", max_tokens = 50 ) cat(response) # Using Claude-3 response <- claudeR( prompt = list( list(role = "user", content = "What is the capital of France?") ), model = "claude-3-5-sonnet-20240620", max_tokens = 50, temperature = 0.8 ) cat(response) # Using a system prompt response <- claudeR( prompt = list( list(role = "user", content = "Summarize the history of France in one paragraph.") ), system_prompt = "You are a concise summarization assistant.", max_tokens = 500 ) cat(response) ## End(Not run)
Get batches of generated output in a single data frame from the .rds
log file.
get_batches(df_name = NULL, log_name = "batchLLM-log")
get_batches(df_name = NULL, log_name = "batchLLM-log")
df_name |
A string to match the name of a processed data frame. |
log_name |
A string specifying the name of the log without the |
A data frame containing the generated output.
## Not run: library(batchLLM) # Assuming you have a log file with data for "beliefs_40a3012b" (see batchLLM example) batches <- get_batches("beliefs_40a3012b") head(batches) # Using a custom log file name custom_batches <- get_batches("beliefs_40a3012b", log_name = "custom-log.rds") head(custom_batches) ## End(Not run)
## Not run: library(batchLLM) # Assuming you have a log file with data for "beliefs_40a3012b" (see batchLLM example) batches <- get_batches("beliefs_40a3012b") head(batches) # Using a custom log file name custom_batches <- get_batches("beliefs_40a3012b", log_name = "custom-log.rds") head(custom_batches) ## End(Not run)
Scrape metadata from the .rds
log file.
scrape_metadata(df_name = NULL, log_name = "batchLLM-log")
scrape_metadata(df_name = NULL, log_name = "batchLLM-log")
df_name |
Optional. A string to match the name of a processed data frame. |
log_name |
A string specifying the name of the log file without the extension. Default is "batchLLM-log". |
A data frame containing metadata.
library(batchLLM) # Scrape metadata for all data frames in the default log file all_metadata <- scrape_metadata() head(all_metadata) # Scrape metadata for a specific data frame specific_metadata <- scrape_metadata("beliefs_40a3012b") head(specific_metadata) # Use a custom log file name custom_metadata <- scrape_metadata(log_name = "custom-log") head(custom_metadata)
library(batchLLM) # Scrape metadata for all data frames in the default log file all_metadata <- scrape_metadata() head(all_metadata) # Scrape metadata for a specific data frame specific_metadata <- scrape_metadata("beliefs_40a3012b") head(specific_metadata) # Use a custom log file name custom_metadata <- scrape_metadata(log_name = "custom-log") head(custom_metadata)