INDEX
Explanations
words related to communication and formalities, such as greetings and expressions of appreciation
New Auto-Interp
Negative Logits
ories
-0.75
mob
-0.67
snowball
-0.65
untreated
-0.65
debunked
-0.64
fights
-0.61
submerged
-0.61
crumbling
-0.61
myth
-0.61
fighting
-0.60
POSITIVE LOGITS
Thank
1.16
Honour
1.15
Thank
1.14
thank
1.11
thank
1.03
honour
1.01
gracious
1.00
congratulations
0.99
congratulate
0.99
Hon
0.99
Activations Density 0.758%