INDEX
Explanations
phrases that indicate personal agreement or alignment with opinions and assessments
New Auto-Interp
Negative Logits
ishable
-0.72
ificial
-0.70
resil
-0.69
Fior
-0.66
unsuspecting
-0.64
andise
-0.62
hidden
-0.62
staffed
-0.62
ornia
-0.61
Joined
-0.60
POSITIVE LOGITS
tenets
1.00
premise
0.94
conclusions
0.93
proposition
0.88
wording
0.87
sentiments
0.87
principle
0.87
characterization
0.86
thesis
0.84
Terms
0.81
Activations Density 0.094%