INDEX
Negative Logits
violating
-0.78
harassing
-0.76
Viol
-0.73
baptism
-0.73
violated
-0.71
publiques
-0.70
SharedDtor
-0.69
démocr
-0.68
intimidate
-0.66
intimidating
-0.66
POSITIVE LOGITS
control
0.52
agents
0.52
Autoritní
0.51
NUMX
0.51
controlled
0.49
agent
0.47
open
0.47
styles
0.47
minute
0.46
ERICA
0.46
Activations Density 0.044%