INDEX
Explanations
sentences that express understanding or acknowledgment
New Auto-Interp
Negative Logits
stery
-0.15
ategorical
-0.15
_regular
-0.14
borg
-0.14
felt
-0.14
Ton
-0.14
ettes
-0.14
Lage
-0.14
urum
-0.13
cours
-0.13
POSITIVE LOGITS
üf
0.16
yz
0.15
.should
0.15
edes
0.14
sum
0.14
yect
0.14
пÑĢавилÑĮно
0.14
ysql
0.14
üç
0.14
صÙĨعت
0.14
Activations Density 0.111%