INDEX
Explanations
references to contrasting options or aspects within a scenario
New Auto-Interp
Negative Logits
1915
-0.72
udence
-0.59
1903
-0.57
1919
-0.56
1906
-0.55
1918
-0.55
reintrodu
-0.55
1961
-0.55
2024
-0.54
1912
-0.54
POSITIVE LOGITS
worldly
1.90
than
1.09
wise
1.07
etheless
0.95
ials
0.89
Redd
0.88
than
0.87
ially
0.85
parts
0.83
Than
0.80
Activations Density 0.049%