INDEX
Explanations
commands or requests to stop doing something
New Auto-Interp
Negative Logits
orthy
-0.82
ridge
-0.77
ocene
-0.77
eer
-0.77
ãĤĬ
-0.73
essee
-0.72
aceae
-0.72
Sov
-0.71
dds
-0.71
rocket
-0.69
POSITIVE LOGITS
bothering
1.38
wasting
1.22
worrying
1.19
pretending
1.14
whining
1.11
messing
1.10
caring
1.09
behaving
1.06
talking
1.02
abusing
1.02
Activations Density 0.041%