INDEX
Explanations
introspection and questioning
New Auto-Interp
Negative Logits
Vý
0.35
they
0.32
provides
0.31
اں
0.30
plated
0.30
contains
0.29
pathways
0.29
the
0.28
streamlined
0.28
hubo
0.28
POSITIVE LOGITS
nghĩ
0.39
misog
0.38
Ironically
0.38
我现在
0.37
ощущение
0.37
философ
0.35
质疑
0.35
şunu
0.35
bertanya
0.34
लक्षात
0.34
Activations Density 0.081%