INDEX
Explanations
philosophical problems and evil
New Auto-Interp
Negative Logits
berpikir
0.43
controversies
0.40
decisions
0.39
trends
0.39
Controversy
0.39
attitudes
0.38
謙
0.38
谨慎
0.38
গভ
0.37
멋
0.37
POSITIVE LOGITS
Evil
0.63
evil
0.62
Evil
0.55
evil
0.51
Explain
0.49
why
0.49
鱻
0.47
邪
0.47
erklären
0.46
ải
0.43
Activations Density 0.015%