INDEX
Explanations
instances of a specific phrase, probably related to a specific event or action
New Auto-Interp
Negative Logits
²
-0.65
ãĥ«
-0.65
ãĥ¼ãĥ
-0.62
ãĥĺ
-0.59
present
-0.59
é¾įå¥ij士
-0.59
968
-0.58
herent
-0.58
hips
-0.58
ãĤ±
-0.58
POSITIVE LOGITS
!,
0.95
!.
0.88
chy
0.86
!
0.81
!'
0.78
alian
0.76
ÃĥÃĤ
0.72
chwitz
0.69
lla
0.68
self
0.68
Activations Density 0.064%