INDEX
Explanations
references to social media
New Auto-Interp
Negative Logits
so
-0.19
etro
-0.16
(es
-0.15
ning
-0.15
sei
-0.15
odian
-0.15
holds
-0.14
ìŀ¡
-0.14
rog
-0.14
-await
-0.14
POSITIVE LOGITS
0.20
/social
0.19
platforms
0.17
0.17
outlets
0.17
eval
0.17
0.17
presence
0.16
IRROR
0.15
outlet
0.15
Activations Density 0.009%