INDEX
Explanations
challenging assumptions or ideas
New Auto-Interp
Negative Logits
መል
0.45
squadra
0.41
ದ್ದರಿಂದ
0.41
rozpozn
0.40
スク
0.40
ккей
0.40
envisions
0.40
鴛
0.40
चीत
0.40
해결
0.40
POSITIVE LOGITS
validity
0.58
assumptions
0.57
questioning
0.56
notion
0.55
abuse
0.52
excessive
0.51
unjust
0.51
质疑
0.51
tyranny
0.49
hegemony
0.47
Activations Density 0.051%