INDEX
Explanations
commands or phrases that urge cessation or stopping an action
New Auto-Interp
Negative Logits
Sov
-0.86
Versions
-0.77
orthy
-0.71
aceae
-0.69
amph
-0.68
eer
-0.67
ighth
-0.66
rium
-0.65
ieth
-0.65
esses
-0.64
POSITIVE LOGITS
bothering
1.20
worrying
1.08
wasting
1.05
whining
1.01
raining
0.92
watching
0.92
blaming
0.92
pretending
0.91
messing
0.89
interfering
0.88
Activations Density 0.020%