INDEX
Explanations
references to "ab" or variations of it
New Auto-Interp
Negative Logits
y
-0.24
yk
-0.21
i
-0.21
yi
-0.21
yb
-0.20
yre
-0.19
ÛĮ
-0.19
in
-0.18
lein
-0.17
uario
-0.17
POSITIVE LOGITS
bing
0.29
ilitation
0.27
oard
0.24
bed
0.24
ulous
0.23
bling
0.23
ba
0.23
STRACT
0.21
on
0.21
ber
0.21
Activations Density 0.027%