INDEX
Explanations
expressions of agency and opportunity for engagement
New Auto-Interp
Negative Logits
.Ui
-0.16
avax
-0.16
mania
-0.15
ropoda
-0.15
ndef
-0.15
oku
-0.15
Ĥ¨
-0.15
Ìī
-0.15
arium
-0.15
ngr
-0.14
POSITIVE LOGITS
themselves
0.39
Their
0.20
thems
0.20
Their
0.19
their
0.19
flock
0.17
alike
0.16
their
0.16
pei
0.16
vers
0.16
Activations Density 0.082%