INDEX
Explanations
statements of affirmation or claims of dominance
New Auto-Interp
Negative Logits
esser
-0.19
.Restr
-0.15
ics
-0.15
.travel
-0.14
StateException
-0.14
INTR
-0.14
erman
-0.13
ειο
-0.13
stick
-0.13
TERM
-0.13
POSITIVE LOGITS
245
0.17
alat
0.17
urd
0.15
baugh
0.15
ance
0.15
andalone
0.15
648
0.14
267
0.14
/question
0.14
inker
0.14
Activations Density 0.012%