INDEX
Explanations
answers or responses within a sentence
assertions or statements that present information or answers
New Auto-Interp
Negative Logits
ership
-0.76
rongh
-0.68
ombat
-0.68
idi
-0.66
Cutting
-0.66
Defenders
-0.64
roying
-0.63
ivities
-0.62
Keefe
-0.62
Samar
-0.62
POSITIVE LOGITS
YES
0.99
yes
0.94
answer
0.85
affirmative
0.81
YES
0.80
yes
0.79
answer
0.78
QUI
0.77
answ
0.74
answers
0.73
Activations Density 0.141%