INDEX
Explanations
questions or phrases related to ethical considerations and societal issues, particularly those involving racism and harmful stereotypes.
New Auto-Interp
Negative Logits
難しい
0.49
eventuali
0.47
結局
0.45
liệu
0.45
顰
0.45
ية
0.43
併
0.42
ließlich
0.41
ஏனெனில்
0.40
ন
0.40
POSITIVE LOGITS
থাকত
0.50
olisi
0.45
нови
0.44
থাকিত
0.42
وقلنا
0.42
....
0.39
isher
0.39
were
0.38
就好了
0.38
Were
0.37
Activations Density 0.077%