INDEX
Explanations
phrases related to explosive or impactful actions
New Auto-Interp
Negative Logits
ummings
-0.16
olland
-0.16
ully
-0.15
untime
-0.14
ABCDEFGHIJKLMNOP
-0.13
merce
-0.13
amerate
-0.13
lest
-0.13
anel
-0.13
óa
-0.13
POSITIVE LOGITS
torch
0.36
smoke
0.27
apart
0.26
kisses
0.25
blew
0.24
bubbles
0.24
torch
0.23
away
0.23
out
0.23
tor
0.22
Activations Density 0.017%