INDEX
Explanations
words or phrases indicating choices or alternatives
New Auto-Interp
Negative Logits
nes
-0.15
både
-0.15
sWith
-0.15
Ïĥα
-0.15
acco
-0.14
spir
-0.14
sci
-0.14
każ
-0.14
Ìĥ
-0.14
adel
-0.14
POSITIVE LOGITS
/or
0.25
-than
0.23
directly
0.19
wel
0.19
anges
0.18
ipse
0.18
/all
0.17
side
0.17
phans
0.17
alone
0.16
Activations Density 0.030%