INDEX
Explanations
phrases that contrast viewpoints or actions between different entities
references to "others" in various contexts
New Auto-Interp
Negative Logits
"},"
-0.82
Opening
-0.69
Pen
-0.68
Awesome
-0.65
Alright
-0.64
United
-0.64
SI
-0.63
Rated
-0.63
Annotations
-0.62
RNA
-0.62
POSITIVE LOGITS
merely
1.08
simply
1.05
prefer
0.94
succumb
0.89
remain
0.80
cling
0.79
rely
0.78
just
0.78
are
0.77
opt
0.76
Activations Density 0.124%