INDEX
Explanations
references to psychological concepts and diagnoses
New Auto-Interp
Negative Logits
LEX
-0.17
ollar
-0.15
Goldberg
-0.15
Ì
-0.15
strike
-0.15
odega
-0.15
ãĥ¬ãĥ¼
-0.15
_SF
-0.14
strike
-0.14
Ĥ
-0.14
POSITIVE LOGITS
Alice
0.57
Alice
0.50
alice
0.45
Wonderland
0.41
alice
0.39
Lewis
0.34
Lewis
0.30
Carroll
0.29
Alic
0.27
Jab
0.27
Activations Density 0.010%