INDEX
Explanations
and flag instances of toxic terms or descriptions
references to toxic substances and their effects
New Auto-Interp
Negative Logits
gain
-0.83
hung
-0.76
FORE
-0.74
AUT
-0.73
bler
-0.71
ploma
-0.70
quart
-0.70
BO
-0.70
stand
-0.70
Month
-0.69
POSITIVE LOGITS
poisoning
1.00
ologist
0.99
toxic
0.95
ologically
0.95
substances
0.91
ological
0.90
fumes
0.89
masculinity
0.89
ologists
0.88
ology
0.86
Activations Density 0.008%