INDEX
Explanations
words related to representation and social or ethical issues
New Auto-Interp
Negative Logits
o
-0.24
at
-0.23
ains
-0.20
ay
-0.19
im
-0.19
ings
-0.18
i
-0.18
an
-0.17
ives
-0.16
ap
-0.16
POSITIVE LOGITS
uze
0.16
ếu
0.16
eren
0.16
æĹıèĩªæ²»
0.15
)((((
0.15
amage
0.14
Ä©
0.14
Ế
0.14
kowski
0.14
imit
0.14
Activations Density 0.056%