INDEX
Explanations
arguments related to historical or social injustices
New Auto-Interp
Negative Logits
readcr
-0.16
portrayed
-0.16
roadcast
-0.15
denounced
-0.15
@student
-0.15
erves
-0.14
portray
-0.14
.operations
-0.14
Controls
-0.14
лÑİ
-0.14
POSITIVE LOGITS
says
0.61
states
0.49
Says
0.44
notes
0.44
says
0.43
claims
0.37
states
0.35
explains
0.35
recalls
0.35
notes
0.34
Activations Density 0.519%