INDEX
Explanations
references to various societal issues and systems
New Auto-Interp
Head Attr Weights
0:0.03
1:0.05
2:0.35
3:0.08
4:0.07
5:0.07
6:0.08
7:0.03
8:0.05
9:0.03
10:0.05
11:0.05
Negative Logits
)=(
-1.63
ertodd
-1.52
rame
-1.51
ayan
-1.49
cand
-1.47
ivan
-1.45
rar
-1.44
Scand
-1.44
ograph
-1.41
ensical
-1.40
POSITIVE LOGITS
cause
1.65
ngth
1.64
selves
1.61
◼
1.54
ylum
1.52
selves
1.52
outwe
1.51
repertoire
1.50
glers
1.50
asca
1.49
Activations Density 0.136%