INDEX
Explanations
specific phrases and word combinations that don't seem to follow grammatical or contextual rules
expressions that indicate societal attitudes towards gender roles
New Auto-Interp
Negative Logits
isSpecialOrderable
-0.72
é¾įå¥ij士
-0.70
cture
-0.69
onymous
-0.65
successor
-0.64
ãĥ¯ãĥ³
-0.62
orney
-0.61
çIJ
-0.60
manent
-0.60
mere
-0.60
POSITIVE LOGITS
deserve
0.92
rejoice
0.92
alike
0.86
behave
0.86
prefer
0.86
instinctively
0.85
notoriously
0.84
differ
0.83
thrive
0.82
flock
0.82
Activations Density 0.444%