SCORE TYPE
eleuther_fuzz
Description
Asks a model if a set of selected tokens should activate a feature given an explanation. Activating contexts are sampled from different quantiles of the full distribution of activating contexts. Non-activating contexts have random tokens highlighted with the same proportion of activating contexts.
Author
EleutherAI
URL
https://github.com/EleutherAI/sae-auto-interpScore Calculation
Score = (true positive rate + true negative rate)/2, where the true positive rate is TP/P, the times that the model correctly predicted the feature was active over the times it was active, and the true negative rate is TN/N, the times the model correctly predicted the feature was not active over the times it was active.
Settings
Samples 10 contexts from each of 10 quantiles and 100 non-activating contexts. Uses temperature 0.7, max returned token 500.