INDEX
Explanations
phrases related to social norms and inequality
New Auto-Interp
Negative Logits
raq
-0.16
olle
-0.16
jang
-0.15
LEASE
-0.15
prostitutas
-0.15
Ïīμα
-0.15
ytut
-0.14
rů
-0.14
боÑĤ
-0.14
emean
-0.14
POSITIVE LOGITS
atur
0.17
ips
0.16
ipro
0.15
acht
0.15
conti
0.14
hoe
0.14
ky
0.14
gre
0.14
bard
0.14
ëĭī
0.13
Activations Density 0.387%