INDEX
Explanations
phrases that express strong opinions or feelings about a subject
New Auto-Interp
Head Attr Weights
0:0.01
1:0.03
2:0.06
3:0.05
4:0.15
5:0.02
6:0.34
7:0.11
8:0.03
9:0.02
10:0.07
11:0.06
Negative Logits
////////////////////////////////
-1.30
��
-1.26
edIn
-1.25
LOAD
-1.24
�
-1.24
Loading
-1.21
��
-1.20
Modes
-1.19
Schedule
-1.18
onomous
-1.16
POSITIVE LOGITS
rison
1.65
Horowitz
1.51
aughs
1.39
nce
1.35
ipple
1.35
impression
1.34
ensical
1.32
onement
1.32
alks
1.31
arate
1.31
Activations Density 0.040%