INDEX
Explanations
false beliefs and self-criticism
New Auto-Interp
Negative Logits
практически
0.85
также
0.81
максимально
0.79
ань
0.77
લગભગ
0.76
developer
0.76
стный
0.75
myös
0.75
számos
0.74
ustan
0.74
POSITIVE LOGITS
inferiority
1.06
disbelief
0.98
justifies
0.93
wrongdoing
0.93
homosexuality
0.93
untrue
0.92
superiority
0.91
wrongly
0.89
beliefs
0.89
falsely
0.89
Activations Density 0.093%