INDEX
Explanations
phrases indicating possession and attributes
New Auto-Interp
Negative Logits
aba
-0.15
oy
-0.15
Lil
-0.14
receipt
-0.14
und
-0.14
782
-0.14
experience
-0.14
ik
-0.13
nings
-0.13
TT
-0.13
POSITIVE LOGITS
ataire
0.16
htag
0.16
untu
0.16
eldorf
0.15
oud
0.15
coli
0.15
essaging
0.15
ouz
0.15
orre
0.15
ouble
0.15
Activations Density 0.313%