INDEX
Explanations
language related to intentions, plans, or motives
references to intentions
New Auto-Interp
Negative Logits
rooms
-0.82
ded
-0.78
thumbnails
-0.73
GS
-0.71
upon
-0.69
room
-0.69
sen
-0.67
Lear
-0.66
enegger
-0.66
Interstitial
-0.65
POSITIVE LOGITS
intentions
0.97
omething
0.84
pring
0.77
motivations
0.75
uggest
0.75
behavi
0.73
afety
0.72
poons
0.72
intent
0.71
cape
0.71
Activations Density 0.029%