INDEX
Explanations
references to violent or aggressive actions
New Auto-Interp
Negative Logits
ality
-0.16
olu
-0.16
arity
-0.15
âķIJâķIJ
-0.15
.au
-0.15
ãģĦãĤĭ
-0.15
verty
-0.14
erva
-0.14
ding
-0.14
ally
-0.14
POSITIVE LOGITS
ively
0.17
insky
0.16
次æķ°
0.15
iveness
0.15
InProgress
0.15
orney
0.15
&T
0.14
able
0.14
ilent
0.14
ademic
0.14
Activations Density 0.048%