INDEX
Explanations
phrases indicating actions or attributes of individuals
New Auto-Interp
Negative Logits
itself
-0.22
acher
-0.16
abelle
-0.15
Stap
-0.15
igor
-0.14
ubat
-0.14
796
-0.14
ÃŃl
-0.14
themselves
-0.14
igi
-0.13
POSITIVE LOGITS
/her
0.24
himself
0.19
/she
0.19
arken
0.19
eyse
0.17
radan
0.17
ulk
0.15
iat
0.15
MES
0.15
upported
0.15
Activations Density 0.666%