INDEX
Explanations
phrases indicating contrast or difference
phrases that indicate differences or variations between subjects
New Auto-Interp
Negative Logits
icide
-0.71
VICE
-0.67
indust
-0.66
record
-0.64
phies
-0.64
vice
-0.63
wind
-0.63
ongyang
-0.62
stop
-0.62
ocious
-0.62
POSITIVE LOGITS
Different
0.78
differing
0.77
":"/
0.77
Differences
0.77
personalities
0.71
depending
0.70
timelines
0.70
Original
0.68
Same
0.67
Race
0.67
Activations Density 0.583%