INDEX
Explanations
patterns related to social issues, particularly those involving race and discrimination
New Auto-Interp
Negative Logits
-0.15
399
-0.14
ansk
-0.14
Harm
-0.14
inn
-0.14
harm
-0.13
hindsight
-0.13
steps
-0.13
âĢİ
-0.13
aska
-0.13
POSITIVE LOGITS
aspect
0.34
idea
0.30
factor
0.30
phenomenon
0.27
issue
0.27
concept
0.27
principle
0.26
aspect
0.26
thing
0.25
angle
0.25
Activations Density 0.337%