INDEX
Explanations
names and dates in social media posts
references to Twitter handles or user mentions
New Auto-Interp
Negative Logits
harms
-0.75
houses
-0.72
itory
-0.71
attractions
-0.68
pockets
-0.68
clothes
-0.68
criminals
-0.66
fruit
-0.64
medicines
-0.63
Ĥª
-0.63
POSITIVE LOGITS
TPS
1.00
76561
0.83
Twe
0.78
>]
0.71
Ùħ
0.70
VIDEOS
0.69
Official
0.69
VERTIS
0.68
Patch
0.67
Originally
0.66
Activations Density 0.043%