combine retrieval and mapreduce together
let's call it the gallantry mode --gallantry|-g
. We need to first filter the whole retrieval database with a confidence threshold (using Gaussian CDF), instead of the simple top-k retrieval. Then we send all the retrieved documents to mapreducer.
What I want to achieve with this mode is to ask vague question on the whole local retrieval database. Here I'll ask something I'm familiar with so I can verify correctness.
debgpt embed -f ldo:debian-devel/[1995:2024]/[1:12]
debgpt -Hg -a 'what was the SIMDebian project? what is its status? and what was the conclusion?'
debgpt -Hg -a 'what is ML-Policy? how is it created and what is the key points from it?'
Surely we can do the same with mapreduce alone, but that requires a deep pocket.
import numpy as np
from scipy.stats import norm
# Suppose 'similarities' is an array of cosine similarities between the query and each document
similarities = np.array([...]) # replace with your cosine similarity values
# Fit a Gaussian distribution
mean = np.mean(similarities)
std = np.std(similarities)
# Compute confidence scores based on the Gaussian CDF
confidence_scores = norm.cdf(similarities, loc=mean, scale=std)
# Select documents with a confidence level of at least 90%
confidence_threshold = 0.90
relevant_documents = [doc for i, doc in enumerate(documents) if confidence_scores[i] >= confidence_threshold]