INDEX
Explanations
discussions about social and political issues
New Auto-Interp
Negative Logits
istically
-1.16
istic
-0.96
ists
-0.86
ism
-0.85
istical
-0.84
aries
-0.79
ist
-0.78
isers
-0.76
izes
-0.74
isation
-0.69
POSITIVE LOGITS
riter
1.15
ards
1.11
bour
1.07
ARDS
1.02
flake
1.01
cloth
0.97
dh
0.96
pot
0.95
intosh
0.94
sie
0.92
Activations Density 3.415%