INDEX
Explanations
statements or quotes within a text
instances of numerical ratings or scores associated with evaluations
New Auto-Interp
Negative Logits
bear
-0.70
instit
-0.66
trades
-0.64
Jinping
-0.64
citiz
-0.63
retreat
-0.62
dilig
-0.62
lifes
-0.62
diseng
-0.62
goods
-0.61
POSITIVE LOGITS
Unlike
0.98
Instead
0.98
Both
0.97
ccording
0.96
Specifically
0.95
Asked
0.94
Earlier
0.94
Although
0.93
Though
0.92
They
0.91
Activations Density 0.431%