INDEX
Explanations
mentions of the word "Horse" with varying levels of activation
references to "Horse" and related terms
New Auto-Interp
Negative Logits
questioning
-0.69
sil
-0.66
reens
-0.66
middle
-0.64
theless
-0.63
fortune
-0.63
mistrust
-0.63
cursing
-0.62
tut
-0.62
semantic
-0.62
POSITIVE LOGITS
Horse
3.75
Horses
1.59
horse
1.28
Elephant
1.26
Bunny
1.13
Goat
1.13
Cobra
1.11
Legs
1.05
Sheep
1.03
Toad
1.03
Activations Density 0.029%