INDEX
Explanations
violent actions or physical confrontations
New Auto-Interp
Negative Logits
lr
-0.17
psilon
-0.16
inc
-0.16
neck
-0.15
ži
-0.14
atti
-0.14
throat
-0.14
esh
-0.14
ih
-0.14
Bols
-0.14
POSITIVE LOGITS
against
0.49
against
0.46
Against
0.45
Against
0.44
tegen
0.29
gegen
0.27
contre
0.27
对
0.26
HARD
0.24
пÑĢоÑĤи
0.24
Activations Density 0.018%