INDEX
Explanations
phrases related to contentious or controversial topics or opinions
topics related to social issues and cultural perceptions
New Auto-Interp
Negative Logits
eatures
-0.65
swick
-0.59
VIDEOS
-0.59
ggles
-0.58
redes
-0.56
confir
-0.55
arthed
-0.54
acia
-0.53
everal
-0.53
OTOS
-0.53
POSITIVE LOGITS
because
1.09
somehow
1.08
unfairly
1.03
rather
0.96
?,
0.95
anymore
0.94
whereas
0.90
or
0.89
despite
0.87
deserving
0.86
Activations Density 0.490%