INDEX
Explanations
similes and comparisons that describe actions or situations
New Auto-Interp
Negative Logits
imilation
-0.69
KEN
-0.68
kamp
-0.67
damn
-0.64
Honour
-0.61
endered
-0.61
GG
-0.60
oided
-0.60
hemy
-0.60
darn
-0.58
POSITIVE LOGITS
rats
0.68
fireworks
0.66
raft
0.64
TNT
0.63
confir
0.62
debating
0.62
reb
0.62
intrins
0.62
some
0.61
any
0.60
Activations Density 0.128%