INDEX
Explanations
Heavily focuses on detecting mentions of poisonous substances
references to various types of poison
New Auto-Interp
Negative Logits
noon
-0.83
dar
-0.72
pora
-0.67
dimension
-0.66
blance
-0.66
irc
-0.66
Raid
-0.64
aan
-0.64
stand
-0.63
Scouting
-0.63
POSITIVE LOGITS
poisoning
1.11
poison
1.01
poisoned
0.97
poisonous
0.93
dart
0.93
gas
0.87
arsenic
0.84
ously
0.84
darts
0.83
Ivy
0.82
Activations Density 0.012%