INDEX
Explanations
references to social etiquette and courtesy
New Auto-Interp
Negative Logits
olla
-0.14
127
-0.14
owi
-0.13
wend
-0.13
/is
-0.13
icc
-0.13
stren
-0.12
arel
-0.12
ноÑģÑĤ
-0.12
incons
-0.12
POSITIVE LOGITS
courtesy
0.52
Courtesy
0.46
Courtesy
0.42
etiquette
0.42
manners
0.42
civ
0.41
courteous
0.41
polite
0.38
礼
0.37
polit
0.37
Activations Density 0.531%