INDEX
Explanations
words related to civilized behavior and respectability
language associated with social norms and civility
New Auto-Interp
Negative Logits
burst
-0.78
thur
-0.72
oons
-0.69
arial
-0.68
wered
-0.68
oult
-0.67
anas
-0.67
falls
-0.67
pain
-0.67
scl
-0.65
POSITIVE LOGITS
reputable
0.88
respectable
0.84
sane
0.80
citiz
0.77
@@
0.77
@@
0.77
bourgeois
0.74
abiding
0.72
Ô
0.72
Wiki
0.72
Activations Density 0.027%