INDEX
Explanations
words related to action and violence
New Auto-Interp
Negative Logits
gling
-0.72
orf
-0.63
inately
-0.63
Fey
-0.61
olls
-0.61
porous
-0.61
owship
-0.60
oiler
-0.60
ringe
-0.60
mbuds
-0.59
POSITIVE LOGITS
ives
0.93
ivism
0.91
ivated
0.89
Replay
0.86
iveness
0.85
ality
0.80
able
0.80
ual
0.80
aries
0.79
uated
0.77
Activations Density 0.482%