INDEX
Explanations
concepts related to social issues and structures
New Auto-Interp
Head Attr Weights
0:0.18
1:0.03
2:0.01
3:0.11
4:0.28
5:0.05
6:0.03
7:0.02
8:0.11
9:0.07
10:0.01
11:0.03
Negative Logits
)",
-1.99
ラン
-1.97
ilaterally
-1.76
FINE
-1.72
ファ
-1.66
GER
-1.65
Victoria
-1.64
eteria
-1.59
Carrie
-1.58
actionDate
-1.54
POSITIVE LOGITS
represents
3.24
extends
3.17
underscores
2.88
illustrates
2.75
reflects
2.60
lends
2.60
resembles
2.56
embodies
2.55
does
2.53
constitutes
2.53
Activations Density 0.054%