INDEX
Explanations
words and phrases related to overt actions or manifestations
New Auto-Interp
Negative Logits
er
-0.34
y
-0.32
oa
-0.30
erse
-0.28
oj
-0.28
eri
-0.27
erm
-0.26
ime
-0.26
ing
-0.26
ype
-0.26
POSITIVE LOGITS
et
0.19
etik
0.17
an
0.17
à¸ļาà¸Ĺ
0.17
ta
0.17
chal
0.16
te
0.16
g
0.16
old
0.16
anse
0.15
Activations Density 0.081%