INDEX
Explanations
phrases indicating a request for trustworthy information or feedback
questions regarding trustworthiness and subscriptions to news content
New Auto-Interp
Negative Logits
beit
-0.77
naire
-0.67
misunder
-0.63
alist
-0.61
esome
-0.60
hof
-0.60
liest
-0.59
eni
-0.58
Calais
-0.57
helicop
-0.57
POSITIVE LOGITS
0.91
utm
0.90
Subscribe
0.83
Attend
0.78
Become
0.74
Replay
0.74
Content
0.71
Want
0.71
Visit
0.70
Try
0.69
Activations Density 0.021%