INDEX
Explanations
first-person statements indicating state or identity
New Auto-Interp
Negative Logits
ãĥ¼ãĥĵ
-0.17
nuest
-0.15
cona
-0.15
itself
-0.14
ERGE
-0.14
-FIRST
-0.14
obao
-0.14
addCriterion
-0.14
inis
-0.14
loub
-0.14
POSITIVE LOGITS
myself
0.30
sure
0.29
aze
0.24
sorry
0.22
Sure
0.21
hereby
0.21
Sure
0.20
guessing
0.19
not
0.19
privileged
0.18
Activations Density 0.145%