INDEX
Explanations
words related to physical actions or activities
phrases related to speed or performance metrics
New Auto-Interp
Negative Logits
LGBTQ
-0.64
Sharia
-0.64
LGBT
-0.60
hijab
-0.60
transgender
-0.57
Pepe
-0.53
lesbian
-0.51
LGBT
-0.51
hammad
-0.51
Koran
-0.51
POSITIVE LOGITS
lier
0.65
(>
0.63
pse
0.63
Bench
0.62
Interstellar
0.60
EStream
0.60
outl
0.60
pload
0.59
ivably
0.58
GHC
0.58
Activations Density 1.941%