INDEX
Explanations
negative affirmations or expressions of absence
"no" followed by a noun
no followed by negation
New Auto-Interp
Negative Logits
the
-0.55
<bos>
-0.51
terakhir
-0.50
Some
-0.50
redor
-0.48
persino
-0.48
متعلقه
-0.47
culturelles
-0.47
فريبيس
-0.46
gangen
-0.46
POSITIVE LOGITS
tably
1.00
doubt
0.93
longer
0.92
tifies
0.91
etheless
0.89
tifying
0.84
vartis
0.84
coda
0.81
odles
0.81
xious
0.81
Activations Density 0.129%