INDEX
Explanations
instances of dismissive or contradictory statements followed by discussions of challenges or issues
New Auto-Interp
Negative Logits
ÑĢеб
-0.16
nackte
-0.14
undo
-0.14
elik
-0.14
ropp
-0.14
ìĬµ
-0.14
arella
-0.13
affairs
-0.13
eson
-0.13
iglia
-0.13
POSITIVE LOGITS
being
0.20
ipa
0.17
recent
0.17
odds
0.17
fact
0.16
protest
0.16
contrary
0.16
apparent
0.15
previous
0.15
каж
0.15
Activations Density 0.042%