INDEX
Explanations
phrases related to causing harm or destruction
references to acts of killing
New Auto-Interp
Negative Logits
Collider
-0.76
rial
-0.73
BuyableInstoreAndOnline
-0.73
umn
-0.67
Grad
-0.66
concess
-0.65
arity
-0.65
anwhile
-0.64
provided
-0.62
Celest
-0.61
POSITIVE LOGITS
spree
0.89
icides
0.88
houses
0.82
killer
0.81
kill
0.79
killing
0.79
joy
0.78
switch
0.78
icide
0.76
killers
0.75
Activations Density 0.025%