INDEX
Explanations
phrases that repeatedly reference specific groups or individuals as "those."
New Auto-Interp
Negative Logits
enstein
-0.17
ayne
-0.15
主人
-0.14
aptops
-0.14
those
-0.14
outset
-0.14
Ë
-0.14
à¥ģरस
-0.14
idan
-0.13
isode
-0.13
POSITIVE LOGITS
who
0.42
who
0.32
whom
0.29
same
0.29
curity
0.28
Who
0.26
Who
0.24
kinds
0.23
اÙĦذÙĬÙĨ
0.23
whose
0.22
Activations Density 0.060%