INDEX
Explanations
phrases that express disagreement or doubts about opinions and their significance
"not as" or similar negative comparisons
not nearly as much
New Auto-Interp
Negative Logits
indeed
-0.64
بيها
-0.59
nonetheless
-0.47
любом
-0.44
อยู่
-0.44
indeed
-0.43
jopa
-0.43
ftagPool
-0.43
aver
-0.42
idemiology
-0.42
POSITIVE LOGITS
tão
1.04
那麼
1.00
tantas
0.99
autant
0.99
столь
0.99
lika
0.98
那么
0.94
tantos
0.92
tanta
0.88
толкова
0.88
Activations Density 0.353%