INDEX
Explanations
phrases related to certainty or emphasis in statements
New Auto-Interp
Negative Logits
akable
-0.77
igers
-0.71
oulder
-0.71
ription
-0.70
uit
-0.69
efer
-0.69
cart
-0.68
Appearances
-0.67
alez
-0.67
itionally
-0.67
POSITIVE LOGITS
unaware
0.86
unrelated
0.78
forgot
0.76
forgetting
0.76
incapable
0.75
oblivious
0.74
swayed
0.74
influenced
0.72
unaffected
0.71
lacking
0.68
Activations Density 0.034%