INDEX
Explanations
phrases expressing interests, preferences, and ethical concerns
New Auto-Interp
Head Attr Weights
0:0.02
1:0.42
2:0.08
3:0.02
4:0.01
5:0.03
6:0.06
7:0.04
8:0.04
9:0.10
10:0.07
11:0.04
Negative Logits
�
-1.62
lio
-1.49
�
-1.48
Library
-1.42
pione
-1.41
武
-1.39
シャ
-1.37
CHA
-1.33
Globe
-1.33
ukemia
-1.31
POSITIVE LOGITS
Governments
1.73
Poles
1.65
frogs
1.58
Bulgar
1.57
sew
1.54
raining
1.47
tigers
1.41
Regulations
1.36
frog
1.33
uphill
1.32
Activations Density 0.530%