INDEX
Explanations
text fragments containing communication patterns like email responses
instances of written correspondence or email formats
New Auto-Interp
Negative Logits
adversaries
-0.68
principals
-0.68
æ©
-0.67
comprom
-0.66
marches
-0.64
lenders
-0.62
exting
-0.62
åĤ
-0.62
angering
-0.62
escal
-0.61
POSITIVE LOGITS
Quote
1.18
Hi
1.17
Excellent
1.16
Quote
1.16
Nice
1.13
Hello
1.11
Originally
1.11
wow
1.10
nice
1.10
yeah
1.08
Activations Density 0.106%