INDEX
Explanations
words related to confirming or affirming statements
affirmative responses to questions or statements
New Auto-Interp
Negative Logits
bage
-0.84
perial
-0.70
inese
-0.67
ensis
-0.65
RAW
-0.64
ILCS
-0.63
externalToEVAOnly
-0.63
isf
-0.63
enegger
-0.60
leted
-0.59
POSITIVE LOGITS
terday
1.73
sir
0.81
Means
0.78
ZI
0.74
YES
0.72
eed
0.70
asar
0.67
Mi
0.66
ñ
0.65
matter
0.65
Activations Density 0.020%