INDEX
Explanations
references to political contexts and sentiments
New Auto-Interp
Negative Logits
Decomp
-0.17
cÃŃ
-0.16
dera
-0.15
.imag
-0.15
chen
-0.15
urance
-0.15
705
-0.14
imitives
-0.14
ropa
-0.14
ždy
-0.14
POSITIVE LOGITS
although
0.22
especially
0.21
although
0.20
especially
0.18
Although
0.17
Although
0.17
ES
0.16
surtout
0.15
aunque
0.15
oder
0.15
Activations Density 0.132%