INDEX
Explanations
the presence of the word "you" and its associated forms in various contexts
New Auto-Interp
Negative Logits
pieces
-0.17
νε
-0.17
ustos
-0.15
enet
-0.15
ensa
-0.15
activations
-0.14
orsi
-0.14
ç¼ĺ
-0.14
سخ
-0.14
oud
-0.14
POSITIVE LOGITS
ìĬĪ
0.15
á»ĩ
0.14
Drain
0.14
èĻ«
0.14
092
0.14
RAIN
0.14
anela
0.14
942
0.14
941
0.14
jeu
0.14
Activations Density 0.002%