INDEX
Explanations
phrases referring to people or entities in various roles or situations
New Auto-Interp
Negative Logits
rane
-0.17
ruba
-0.16
gba
-0.16
gist
-0.14
dition
-0.14
LIABLE
-0.14
ract
-0.14
.getAs
-0.13
rž
-0.13
ã
-0.13
POSITIVE LOGITS
themselves
0.18
otherwise
0.17
aid
0.15
otherwise
0.14
are
0.14
olly
0.14
Already
0.14
oo
0.14
803
0.14
already
0.14
Activations Density 0.103%