INDEX
Explanations
references to the word "cat"
references to cats
New Auto-Interp
Negative Logits
mble
-0.76
eous
-0.73
htt
-0.72
undo
-0.71
indal
-0.71
Sacrament
-0.70
Seym
-0.69
oppable
-0.65
unda
-0.63
gur
-0.63
POSITIVE LOGITS
aclysm
1.44
heter
1.29
fish
1.05
chers
1.01
alogue
0.97
apult
0.94
cat
0.89
wal
0.89
hawk
0.88
cher
0.87
Activations Density 0.020%