INDEX
Explanations
words related to emphasis or clarification of a statement, often indicating a contrast between appearances and reality
words that indicate clarification or contradiction of a statement
New Auto-Interp
Negative Logits
regate
-0.83
inav
-0.72
enaries
-0.71
idences
-0.69
might
-0.69
ortmund
-0.68
Awakens
-0.68
rug
-0.66
doms
-0.65
would
-0.65
POSITIVE LOGITS
synonymous
0.95
supposed
0.93
irrelevant
0.93
considered
0.93
indicative
0.93
incompatible
0.90
regarded
0.89
going
0.88
problematic
0.87
worth
0.87
Activations Density 0.281%