INDEX
Explanations
links and prompts to visit external websites
phrases indicating the purpose or intent of providing information
New Auto-Interp
Negative Logits
conn
-0.85
oln
-0.78
Factor
-0.73
orbit
-0.69
Bridge
-0.69
illin
-0.69
issan
-0.69
jam
-0.68
Collins
-0.67
itton
-0.66
POSITIVE LOGITS
example
1.06
details
1.05
instance
1.00
awhile
0.88
gotten
0.88
clarification
0.87
directions
0.86
reasons
0.86
inspiration
0.84
updates
0.84
Activations Density 0.074%