INDEX
Explanations
phrases indicating surprise or disbelief
phrases expressing a sense of denial or lack of awareness
New Auto-Interp
Negative Logits
idon
-0.86
rend
-0.73
abre
-0.72
ubi
-0.69
rog
-0.69
ijing
-0.67
osity
-0.67
orm
-0.66
center
-0.65
aim
-0.65
POSITIVE LOGITS
remotely
1.29
bothered
0.99
bother
0.97
bothering
0.95
pretend
0.88
halfway
0.79
close
0.78
kidding
0.78
mention
0.77
hint
0.77
Activations Density 0.049%