INDEX
Explanations
contrasting phrases or opinions
New Auto-Interp
Negative Logits
ocol
-0.16
ewis
-0.15
kaar
-0.15
.inline
-0.15
lexport
-0.14
μεν
-0.14
strup
-0.14
rott
-0.14
genden
-0.14
audi
-0.13
POSITIVE LOGITS
enton
0.16
rin
0.14
rone
0.14
Ŀ
0.14
ruh
0.14
archy
0.14
bab
0.13
ler
0.13
Coupe
0.13
bro
0.13
Activations Density 0.076%