INDEX
Explanations
references to personal identity and interpersonal relationships
New Auto-Interp
Negative Logits
teenth
-0.20
phans
-0.17
ropolis
-0.17
uable
-0.17
xes
-0.17
alted
-0.17
instanc
-0.17
zers
-0.17
resse
-0.16
ımıza
-0.16
POSITIVE LOGITS
gether
0.56
etheless
0.48
linear
0.48
existent
0.46
west
0.46
ductory
0.46
adays
0.45
neath
0.44
adecimal
0.43
selling
0.40
Activations Density 0.586%