INDEX
Explanations
references to people and their actions or attributes
New Auto-Interp
Negative Logits
402
-0.18
preference
-0.16
789
-0.15
stract
-0.15
ouble
-0.14
ÙĦÙģ
-0.14
326
-0.14
Ben
-0.14
encounter
-0.14
cran
-0.13
POSITIVE LOGITS
ادÙħ
0.15
eÄį
0.15
бо
0.15
تا
0.15
hq
0.14
anton
0.14
_SA
0.14
innen
0.14
ater
0.14
ause
0.13
Activations Density 0.004%