INDEX
Explanations
social media or news-related content like tweets and news updates
New Auto-Interp
Negative Logits
audits
-0.71
involuntary
-0.66
indemn
-0.65
confidentiality
-0.63
volunt
-0.62
subsistence
-0.61
tsun
-0.61
disadvant
-0.60
conformity
-0.59
settlements
-0.58
POSITIVE LOGITS
1.65
1.14
imgur
1.09
1.06
twitch
1.01
redd
0.96
youtube
0.93
0.93
blogspot
0.90
github
0.89
Activations Density 0.014%