INDEX
Explanations
words indicating significant impact or change
New Auto-Interp
Negative Logits
layer
-0.15
orer
-0.14
лам
-0.14
Clem
-0.13
anes
-0.13
aries
-0.13
Cri
-0.13
Mehr
-0.13
vik
-0.13
attendant
-0.13
POSITIVE LOGITS
emey
0.18
ãĥ¼ãĥį
0.16
bbc
0.16
é¾Ħ
0.15
etas
0.15
nicos
0.15
zos
0.15
ìłĪ
0.14
Calibri
0.14
owie
0.14
Activations Density 0.001%