INDEX
Explanations
conversations between different individuals, possibly with some conflict or disagreement
elements of dialogue, particularly interjections and speaker labels in conversations
New Auto-Interp
Head Attr Weights
0:0.06
1:0.09
2:0.08
3:0.09
4:0.04
5:0.22
6:0.06
7:0.02
8:0.07
9:0.09
10:0.09
11:0.03
Negative Logits
MFT
-1.38
gradient
-1.38
versions
-1.28
Mexicans
-1.27
Grad
-1.26
Mexican
-1.24
cro
-1.24
suitable
-1.23
sensitive
-1.20
alist
-1.20
POSITIVE LOGITS
answer
1.75
lein
1.50
Answer
1.50
utch
1.46
kie
1.44
rike
1.44
leen
1.44
clair
1.43
Answers
1.43
Puzzle
1.41
Activations Density 0.012%