INDEX
Explanations
phrases related to toxicity or harmful substances
references to toxic substances and their effects
New Auto-Interp
Negative Logits
FORE
-0.72
hung
-0.71
quart
-0.70
Untitled
-0.69
zzi
-0.68
wright
-0.68
ploma
-0.68
BO
-0.66
bler
-0.65
telling
-0.65
POSITIVE LOGITS
masculinity
1.06
poisoning
0.96
algae
0.92
waste
0.92
ological
0.90
fumes
0.90
oxic
0.89
substances
0.89
ologist
0.87
ology
0.87
Activations Density 0.017%