INDEX
Explanations
phrases related to knowledge or lack thereof
phrases indicating a lack of knowledge or certainty
New Auto-Interp
Negative Logits
hement
-0.84
sidx
-0.80
erate
-0.78
igham
-0.73
odder
-0.72
uably
-0.71
ramid
-0.70
rative
-0.69
vertisement
-0.69
nir
-0.68
POSITIVE LOGITS
whereabouts
0.78
firsthand
0.76
beforehand
0.73
intimately
0.73
secret
0.73
CHAT
0.72
secrets
0.68
æĿ
0.67
ledged
0.67
Orig
0.63
Activations Density 0.254%