INDEX
Explanations
claims and statements regarding actions or behaviors
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.17
3:0.26
4:0.08
5:0.04
6:0.05
7:0.05
8:0.06
9:0.06
10:0.10
11:0.04
Negative Logits
Coordinator
-1.71
Commodore
-1.49
Chr
-1.44
Garry
-1.43
ainment
-1.41
Corporate
-1.39
POLITICO
-1.38
Restoration
-1.38
GY
-1.37
Roh
-1.37
POSITIVE LOGITS
reply
1.75
itutes
1.56
trigger
1.54
prostitutes
1.53
votes
1.51
violates
1.50
ocaust
1.46
�
1.44
wrong
1.42
"...
1.42
Activations Density 0.021%