INDEX
Explanations
references to toxic substances and their effects
New Auto-Interp
Negative Logits
ufs
-0.14
ijken
-0.14
nego
-0.14
herpes
-0.14
sond
-0.14
.bias
-0.13
precious
-0.13
ụ
-0.13
neut
-0.13
enefit
-0.13
POSITIVE LOGITS
poisoning
0.43
poison
0.43
poisonous
0.42
Poison
0.42
toxic
0.38
poisoned
0.35
toxicity
0.35
æ¯Ĵ
0.34
Toxic
0.34
toxins
0.32
Activations Density 0.103%