INDEX
Explanations
actions related to expressing opinions, debating critical issues, and discussing social justice topics
New Auto-Interp
Negative Logits
Bucs
-0.63
Uz
-0.60
wont
-0.59
Gh
-0.55
Pis
-0.55
Singh
-0.54
--------------------
-0.54
Falcons
-0.54
Goblin
-0.53
Dub
-0.53
POSITIVE LOGITS
oneself
1.49
ourselves
0.86
yourself
0.84
uate
0.83
enance
0.82
itate
0.80
entious
0.76
them
0.72
yourselves
0.71
onom
0.70
Activations Density 0.187%