INDEX
Explanations
references to living beings and their societal context
New Auto-Interp
Negative Logits
osen
-0.15
erval
-0.15
ambi
-0.14
loor
-0.14
igung
-0.14
"'",
-0.14
odos
-0.13
Rosen
-0.13
etta
-0.13
Åĵ
-0.13
POSITIVE LOGITS
Moy
0.16
_PID
0.16
icode
0.15
ÑĨвеÑĤ
0.15
vs
0.15
swire
0.15
ughter
0.15
strup
0.15
anko
0.15
atures
0.14
Activations Density 0.013%