INDEX
Explanations
phrases indicating knowledge or lack of knowledge in a subject
phrases indicating a lack of knowledge or understanding
New Auto-Interp
Negative Logits
ramid
-0.80
hement
-0.75
odder
-0.74
uably
-0.74
sidx
-0.73
erate
-0.69
raught
-0.69
nir
-0.69
Featured
-0.69
rall
-0.68
POSITIVE LOGITS
firsthand
0.78
whereabouts
0.77
beforehand
0.71
intimately
0.69
æĿ
0.66
secret
0.65
ä½
0.65
Orig
0.63
basics
0.63
LAB
0.62
Activations Density 0.229%