INDEX
Explanations
concepts related to identity and self-expression
New Auto-Interp
Negative Logits
nen
-0.18
wert
-0.17
ampion
-0.15
nesia
-0.14
acht
-0.14
lernen
-0.14
esterday
-0.14
ipop
-0.13
fold
-0.13
iment
-0.13
POSITIVE LOGITS
personal
0.25
personal
0.21
Personal
0.20
Personal
0.20
self
0.18
åĢĭ人
0.17
pesso
0.17
лиÑĩ
0.17
_self
0.17
Self
0.17
Activations Density 0.297%