INDEX
Explanations
references to "defense" in various contexts
New Auto-Interp
Negative Logits
erer
-0.17
hey
-0.16
ish
-0.15
ãĥ«ãĤ¯
-0.15
oras
-0.15
iquid
-0.15
essian
-0.15
ings
-0.15
icious
-0.14
affe
-0.14
POSITIVE LOGITS
less
0.26
against
0.23
lessness
0.23
Against
0.21
mechanisms
0.20
mechanism
0.20
/off
0.19
against
0.19
contractor
0.19
LESS
0.18
Activations Density 0.031%