INDEX
Explanations
references to individuals or groups in a supportive context
New Auto-Interp
Negative Logits
itself
-0.16
eam
-0.16
.grpc
-0.14
ozo
-0.14
acht
-0.14
ington
-0.14
lein
-0.14
Tome
-0.14
loquent
-0.14
itude
-0.13
POSITIVE LOGITS
/us
0.27
/her
0.25
zelf
0.23
self
0.21
/th
0.18
-même
0.17
iner
0.17
же
0.16
SELF
0.16
ÑĥÑīеÑģÑĤв
0.15
Activations Density 0.156%