INDEX
Explanations
instances of violence, breaches, and notable figures in reports and discussions
New Auto-Interp
Negative Logits
arsing
-0.17
.labelX
-0.16
辦
-0.15
IQUE
-0.14
или
-0.14
chers
-0.14
Birch
-0.14
keiten
-0.14
egen
-0.14
εÏį
-0.13
POSITIVE LOGITS
being
0.29
being
0.25
Being
0.20
Being
0.19
sendo
0.18
essere
0.16
_basename
0.15
therein
0.15
owell
0.14
být
0.14
Activations Density 0.101%