INDEX
Explanations
phrases related to credibility and its assessment
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.08
3:0.07
4:0.11
5:0.03
6:0.04
7:0.36
8:0.04
9:0.03
10:0.06
11:0.09
Negative Logits
ILCS
-1.65
RIP
-1.60
�
-1.50
ogram
-1.50
acht
-1.46
ograms
-1.44
Thumbnails
-1.43
cule
-1.38
lete
-1.35
ILA
-1.34
POSITIVE LOGITS
assertions
1.87
assertion
1.69
denying
1.66
spurious
1.59
claim
1.57
unfounded
1.57
claims
1.55
anecdotal
1.54
fortified
1.52
unsupported
1.51
Activations Density 0.001%