Skip to content

combine retrieval and mapreduce together

let's call it the gallantry mode --gallantry|-g. We need to first filter the whole retrieval database with a confidence threshold (using Gaussian CDF), instead of the simple top-k retrieval. Then we send all the retrieved documents to mapreducer.

What I want to achieve with this mode is to ask vague question on the whole local retrieval database. Here I'll ask something I'm familiar with so I can verify correctness.

debgpt embed -f ldo:debian-devel/[1995:2024]/[1:12]
debgpt -Hg -a 'what was the SIMDebian project? what is its status? and what was the conclusion?'
debgpt -Hg -a 'what is ML-Policy? how is it created and what is the key points from it?'

Surely we can do the same with mapreduce alone, but that requires a deep pocket.

import numpy as np
from scipy.stats import norm

# Suppose 'similarities' is an array of cosine similarities between the query and each document
similarities = np.array([...])  # replace with your cosine similarity values

# Fit a Gaussian distribution
mean = np.mean(similarities)
std = np.std(similarities)

# Compute confidence scores based on the Gaussian CDF
confidence_scores = norm.cdf(similarities, loc=mean, scale=std)

# Select documents with a confidence level of at least 90%
confidence_threshold = 0.90
relevant_documents = [doc for i, doc in enumerate(documents) if confidence_scores[i] >= confidence_threshold]