INDEX
Explanations
questions that begin with "Are"
New Auto-Interp
Negative Logits
664
-0.16
lore
-0.16
eur
-0.15
ills
-0.15
rzy
-0.15
orra
-0.15
pery
-0.14
ruc
-0.14
ré
-0.14
poons
-0.14
POSITIVE LOGITS
zzo
0.25
ospace
0.25
nda
0.23
nds
0.22
ady
0.21
tha
0.20
obic
0.18
nts
0.18
psilon
0.18
tap
0.17
Activations Density 0.033%