INDEX
Explanations
specific actions like reading, watching, examining, and talking
actions that involve reading, watching, or reviewing content
New Auto-Interp
Negative Logits
essert
-0.68
Els
-0.67
equal
-0.65
harbour
-0.65
amen
-0.63
interstitial
-0.62
arded
-0.62
unfairly
-0.62
Enjoy
-0.59
IVE
-0.58
POSITIVE LOGITS
hran
0.68
foregoing
0.67
assurances
0.67
¿½
0.64
acquaint
0.62
romy
0.61
reports
0.60
math
0.59
KH
0.59
KS
0.59
Activations Density 0.491%