INDEX
Explanations
words related to politics and public figures
expressions of social justice concerns and systemic issues
New Auto-Interp
Negative Logits
uca
-0.62
ction
-0.62
rad
-0.60
distraction
-0.57
imity
-0.56
activity
-0.56
heit
-0.55
uto
-0.55
wanting
-0.55
emies
-0.53
POSITIVE LOGITS
³³³
0.83
³³³³³³³³³³³³³³³³
0.75
³³³³³³³³
0.74
³³³³
0.69
ECK
0.68
WD
0.62
reditary
0.61
↵Âł
0.61
eki
0.59
????????
0.59
Activations Density 0.542%