INDEX
Explanations
mentions of toxic substances or situations
references to toxic substances and their effects
New Auto-Interp
Negative Logits
gain
-0.80
hung
-0.79
quart
-0.74
through
-0.73
FORE
-0.71
held
-0.71
Month
-0.71
telling
-0.70
stand
-0.70
UTERS
-0.70
POSITIVE LOGITS
toxic
1.13
masculinity
1.03
ologically
0.99
toxicity
0.99
substances
0.96
poisoning
0.95
fumes
0.95
ologist
0.94
ological
0.92
poisonous
0.92
Activations Density 0.009%