INDEX
Explanations
themes related to political discourse and criticism of societal norms
New Auto-Interp
Negative Logits
entina
-0.17
iller
-0.17
stron
-0.15
illard
-0.15
ibold
-0.15
WithValue
-0.15
ailable
-0.14
irket
-0.14
alah
-0.14
achu
-0.14
POSITIVE LOGITS
claim
0.23
claims
0.22
claiming
0.21
claimed
0.21
Claim
0.19
saying
0.18
CLAIM
0.18
Claims
0.18
argument
0.17
complain
0.17
Activations Density 0.350%