INDEX
Explanations
references to cats and their behavior
New Auto-Interp
Negative Logits
quito
-0.19
horses
-0.16
Pig
-0.16
horse
-0.16
PG
-0.15
MP
-0.15
ala
-0.14
pigs
-0.14
úb
-0.14
allo
-0.14
POSITIVE LOGITS
cat
0.32
Cat
0.31
-cat
0.30
cats
0.30
Cats
0.27
çĮ«
0.27
Cat
0.26
(cat
0.25
/cat
0.25
.cat
0.24
Activations Density 0.039%