INDEX
Explanations
words or phrases related to actions and their impacts or characteristics
New Auto-Interp
Negative Logits
och
-0.18
eb
-0.17
their
-0.16
ox
-0.15
Ñı
-0.15
um
-0.15
ik
-0.15
athi
-0.15
ãģ¨ãģĹãģŁ
-0.15
oth
-0.15
POSITIVE LOGITS
isque
0.16
itself
0.15
lique
0.14
uars
0.14
lick
0.14
licken
0.14
redo
0.14
ngle
0.14
dü
0.14
/***/
0.14
Activations Density 0.846%