INDEX
Explanations
phrases expressing negation and clarification
New Auto-Interp
Negative Logits
zwar
-0.16
odied
-0.15
اع
-0.14
æ¬ł
-0.14
един
-0.14
emory
-0.14
airo
-0.14
vice
-0.14
olkien
-0.14
elin
-0.13
POSITIVE LOGITS
è¿ĺæĺ¯
0.17
nonetheless
0.15
tera
0.15
GLOSS
0.15
iew
0.15
tering
0.15
ully
0.14
tiles
0.14
.geo
0.14
elsewhere
0.14
Activations Density 0.158%