INDEX
Explanations
references to specific events or quotes
expressions of pain or discomfort related to emotional experiences
New Auto-Interp
Negative Logits
unsurprisingly
-0.62
moreover
-0.57
similarly
-0.57
surprisingly
-0.56
anwhile
-0.54
predictably
-0.54
additionally
-0.52
meanwhile
-0.51
reportedly
-0.50
furthermore
-0.50
POSITIVE LOGITS
â̦"
0.93
â̦"
0.90
..."
0.86
â̦."
0.82
..."
0.81
fuckin
0.73
gonna
0.70
-"
0.70
?"
0.67
?'
0.66
Activations Density 1.484%