INDEX
Explanations
phrases indicating transformation or change
New Auto-Interp
Negative Logits
yy
-0.16
izz
-0.15
iry
-0.15
reece
-0.14
ves
-0.14
esi
-0.13
.normalized
-0.13
ses
-0.13
omp
-0.13
quo
-0.13
POSITIVE LOGITS
ucz
0.17
part
0.16
ildo
0.16
ocard
0.14
-translate
0.14
ieve
0.14
Alle
0.13
azzi
0.13
etus
0.13
stride
0.13
Activations Density 0.065%