INDEX
Explanations
expressions of preference or decision-making
New Auto-Interp
Negative Logits
ito
-0.17
rac
-0.17
anes
-0.16
.lst
-0.15
oha
-0.14
esen
-0.14
à¸Ńà¸ĩ
-0.14
ña
-0.14
ras
-0.14
Ø®ÙĦ
-0.14
POSITIVE LOGITS
rather
0.37
rather
0.33
Rather
0.32
Rather
0.31
than
0.28
than
0.27
plutôt
0.26
пÑĢедпоÑĩ
0.23
å®ģ
0.22
prefer
0.22
Activations Density 0.184%