INDEX
Explanations
references to the concept of gangs and related terminology
New Auto-Interp
Negative Logits
o
-0.19
edn
-0.17
oze
-0.17
ed
-0.17
ukes
-0.16
eck
-0.16
oq
-0.16
oise
-0.16
eded
-0.16
edb
-0.16
POSITIVE LOGITS
aroo
0.35
ladesh
0.30
rove
0.28
bang
0.26
ue
0.26
alore
0.26
ster
0.25
reen
0.24
lobal
0.24
nam
0.23
Activations Density 0.031%