INDEX
Explanations
discussions around controversial social topics
New Auto-Interp
Negative Logits
Anyway
-0.16
sez
-0.16
illon
-0.15
bbe
-0.14
Anyway
-0.14
aar
-0.14
kok
-0.14
pok
-0.14
milieu
-0.14
zb
-0.13
POSITIVE LOGITS
apart
0.25
engr
0.23
void
0.20
priv
0.19
cater
0.19
able
0.17
ran
0.17
heavily
0.16
plaster
0.16
preca
0.16
Activations Density 0.434%