INDEX
Explanations
references to harmful effects or substances
New Auto-Interp
Negative Logits
PRNewswire
-0.56
mustache
-0.48
elegance
-0.48
baptism
-0.48
zwe
-0.48
trow
-0.47
acquittal
-0.47
mystère
-0.47
Itr
-0.46
IZABETH
-0.46
POSITIVE LOGITS
harmful
1.91
Harmful
1.84
Harmful
1.82
injurious
1.14
有害
1.09
harm
1.02
dangerous
1.01
hurtful
0.95
dangerous
0.94
toxic
0.91
Activations Density 0.012%