INDEX
Explanations
phrases related to toxic substances
references to toxins or harmful substances
New Auto-Interp
Negative Logits
noon
-0.88
dar
-0.77
pora
-0.76
aan
-0.69
quarter
-0.66
doms
-0.64
RM
-0.64
Merit
-0.63
romy
-0.63
ann
-0.62
POSITIVE LOGITS
poisoning
1.07
poison
1.00
dart
0.95
poisoned
0.91
darts
0.90
iv
0.90
ously
0.85
Ivy
0.84
poisonous
0.84
gas
0.84
Activations Density 0.012%