INDEX
Explanations
counter-narrative and counter-assertion
New Auto-Interp
Negative Logits
ла
0.55
↵
0.52
et
0.48
り
0.47
т
0.46
να
0.46
एस
0.46
る
0.45
م
0.43
त
0.42
POSITIVE LOGITS
0.48
ien
0.40
h
0.37
\
0.37
ck
0.36
чувство
0.35
I
0.35
^
0.34
he
0.33
*
0.32
Activations Density 0.000%