INDEX
Explanations
references to violence, particularly in relation to media and societal issues
New Auto-Interp
Negative Logits
endon
-0.17
esar
-0.16
iram
-0.16
elling
-0.15
ixels
-0.15
orias
-0.15
iron
-0.14
iven
-0.14
ied
-0.14
rollo
-0.14
POSITIVE LOGITS
ence
0.24
ent
0.22
ENCE
0.21
ently
0.19
Consort
0.18
aceous
0.18
ations
0.17
ative
0.17
acea
0.17
ins
0.17
Activations Density 0.004%