Assignment #4: Correlation, Regression, Chi-Square
Due October 28, 2021
Don't forget your cheat sheet from class on correlation and regression with SPSS.
Jump to lecture on: correlation, linear regression, or nonparametric techniques
Don't forget your cheat sheet from class on correlation and regression with SPSS.
Jump to lecture on: correlation, linear regression, or nonparametric techniques
TASK #1: CORRELATION
Models of reading all make predictions about how humans identify individual letters and words, and a typical prediction is that more frequent letters should be faster to recognize. Similarly, in cryptoanalysis, it is important to know the naturally occurring frequencies of letters in the English language to create algorithms that can break coded messages. Some models make the prediction that it is repetition of the physical pattern itself that produces the frequency advantage in memory, whereas, others predict that it is simply the identity of the letter, independent of how exactly it was written or what font it was in (for similar arguments in phonetics, see Bob Port's work).
I wish to test the assumption that letters of higher frequency in their lower-case form also tend to be higher frequency in their upper-case form. Further, I would like to know if simply counting the number of pages in a dictionary dedicated to a word is a good measure of its true upper-case frequency in English (as many have assumed). Since words vary in frequency, a letter that begins a frequent word will be encountered many times even though this might be the only word that the letter appears at the beginning of. Further, sentences are more likely to begin with certain words than others (hence, the initial letter is more likely to be encountered in capital form)...so anything is possible.
Download the data file from https://www.dropbox.com/s/9cwpabr769dchde/Letters.txt?dl=0
The file contains 26 cases: one for each letter of the English alphabet. For each letter, I have counted the number of times it appeared in upper-case or lower-case form across several large text sources. In addition, I have counted the number of pages in the dictionary that begin with the letter as a crude measure of upper-case frequency to compare to the text counts. For more details, see Jones & Mewhort (2004). The variables (in order) are:
U_NYT: Uppercase frequency from New York Times
L_NYT: Lowercase frequency from New York Times
U_Web: Uppercase frequency from a web-spider scanning web pages
L_Web: Lowercase frequency from the same web-spider
U_News: Uppercase frequency from newsgroups off Google
L_News: Lowercase frequency from newsgroups off Google
U_Ency: Uppercase frequency from an encyclopedia
L_Ency: Lowercase frequency from an encyclopedia
U_Wiki: Uppercase frequency from Wikipedia
L_Wiki: Lowercase frequency from Wikipedia
Dict: Number of dictionary pages that begin with the letter
References
Jones, M. N., & Mewhort, D. J. K. (2004). Case-sensitive letter and bigram frequency counts from large-scale English corpora. Behavior Research Methods, Instruments, and Computers, 36, 388-396.
Models of reading all make predictions about how humans identify individual letters and words, and a typical prediction is that more frequent letters should be faster to recognize. Similarly, in cryptoanalysis, it is important to know the naturally occurring frequencies of letters in the English language to create algorithms that can break coded messages. Some models make the prediction that it is repetition of the physical pattern itself that produces the frequency advantage in memory, whereas, others predict that it is simply the identity of the letter, independent of how exactly it was written or what font it was in (for similar arguments in phonetics, see Bob Port's work).
I wish to test the assumption that letters of higher frequency in their lower-case form also tend to be higher frequency in their upper-case form. Further, I would like to know if simply counting the number of pages in a dictionary dedicated to a word is a good measure of its true upper-case frequency in English (as many have assumed). Since words vary in frequency, a letter that begins a frequent word will be encountered many times even though this might be the only word that the letter appears at the beginning of. Further, sentences are more likely to begin with certain words than others (hence, the initial letter is more likely to be encountered in capital form)...so anything is possible.
Download the data file from https://www.dropbox.com/s/9cwpabr769dchde/Letters.txt?dl=0
The file contains 26 cases: one for each letter of the English alphabet. For each letter, I have counted the number of times it appeared in upper-case or lower-case form across several large text sources. In addition, I have counted the number of pages in the dictionary that begin with the letter as a crude measure of upper-case frequency to compare to the text counts. For more details, see Jones & Mewhort (2004). The variables (in order) are:
U_NYT: Uppercase frequency from New York Times
L_NYT: Lowercase frequency from New York Times
U_Web: Uppercase frequency from a web-spider scanning web pages
L_Web: Lowercase frequency from the same web-spider
U_News: Uppercase frequency from newsgroups off Google
L_News: Lowercase frequency from newsgroups off Google
U_Ency: Uppercase frequency from an encyclopedia
L_Ency: Lowercase frequency from an encyclopedia
U_Wiki: Uppercase frequency from Wikipedia
L_Wiki: Lowercase frequency from Wikipedia
Dict: Number of dictionary pages that begin with the letter
- Compute a correlation matrix using Pearson's R including all of the variables. Would you say that the assumption letter frequency is consistent across cases is accurate? (i.e., how well do the upper and lowercase counts correlate within text corpora?)
- Do the uppercase counts from a particular text source correlate better with the lowercase counts from the same text source, or with the uppercase counts from different text sources?
- Is there a significant correlation between the simple idea of counting pages in a dictionary and the uppercase counts from the other text sources?
References
Jones, M. N., & Mewhort, D. J. K. (2004). Case-sensitive letter and bigram frequency counts from large-scale English corpora. Behavior Research Methods, Instruments, and Computers, 36, 388-396.
TASK 2: MULTIPLE REGRESSION OF CHESS EXPERTISE
You believe that chess ability involves considering many different possibilities per move and a good short-term memory (STM) capacity. Your friend has an alternative opinion that chess ability is completely determined by crystallized intelligence (chess wisdom), which increases with age. You set up a booth at a chess tournament, and measure eye-movements while participants play a game of chess—you convert this number into mean number of squares scanned per move (this is your estimate of how many possibilities a player considers each move). You also measure STM capacity using the digitspan task, and age. Each of your participants then competes in the tournament, and at the end of the day you calculate the proportion of wins (# of games won / total games played). You perform a linear regression using mean squares scanned, STM capacity, and age as predictors.
You can find the data located at https://www.dropbox.com/s/kt7klz8rbxcnoc3/Chess.txt?dl=0
The data are tab-delimited with variables in the following order: proportion of wins, squares scanned, STM capacity, and age. Read the data into SPSS and conduct a multiple regression to predict proportion of wins (dependent variable) from the predictor variables. Nothing fancy here, just enter all the variables at once as in our first regression example. Include your output and answer the following questions:
You believe that chess ability involves considering many different possibilities per move and a good short-term memory (STM) capacity. Your friend has an alternative opinion that chess ability is completely determined by crystallized intelligence (chess wisdom), which increases with age. You set up a booth at a chess tournament, and measure eye-movements while participants play a game of chess—you convert this number into mean number of squares scanned per move (this is your estimate of how many possibilities a player considers each move). You also measure STM capacity using the digitspan task, and age. Each of your participants then competes in the tournament, and at the end of the day you calculate the proportion of wins (# of games won / total games played). You perform a linear regression using mean squares scanned, STM capacity, and age as predictors.
You can find the data located at https://www.dropbox.com/s/kt7klz8rbxcnoc3/Chess.txt?dl=0
The data are tab-delimited with variables in the following order: proportion of wins, squares scanned, STM capacity, and age. Read the data into SPSS and conduct a multiple regression to predict proportion of wins (dependent variable) from the predictor variables. Nothing fancy here, just enter all the variables at once as in our first regression example. Include your output and answer the following questions:
- Using the ANOVA table, is the overall regression equation significant?
- What is the multiple correlation between all the variables?
- Consulting the t-tests in the coefficients table, which predictors are contributing significant variance in predicting wins?
- Write out the regression equation predicting wins from the significant predictors (hint: use the B values or unstandardized coefficients to fill in the basic regression equation )
TASK #3: NOMINAL DATA
The chi-square procedure is particularly useful when observed data represent a nominal scale of measurement. That is, when we are dealing with categorical or frequency data. These nominal data are generated by sorting and counting – sorting the data into discrete, mutually exclusive categories and then counting the frequency of occurrence within each category. The statistical analysis of nominal data is sometimes also called categorical data analysis.
Below are the correct responses to 20 multiple-choice questions from a test in one of my undergraduate classes (1=A, 2=B, 3=C, 4=D).
2, 3, 3, 1, 2, 3, 3, 3, 3, 3, 2, 4, 3, 3, 2, 1, 1, 3, 3, 4
Enter the nominal data into SPSS and conduct a Chi-Square test to determine if my selection of positions for the correct alternative is distributed randomly. In your syntax editor, enter:
NPAR TEST /CHISQUARE=alternative.
Write our a brief results section for this “study.” Can you conclude that my selection of the correct response random?
The chi-square procedure is particularly useful when observed data represent a nominal scale of measurement. That is, when we are dealing with categorical or frequency data. These nominal data are generated by sorting and counting – sorting the data into discrete, mutually exclusive categories and then counting the frequency of occurrence within each category. The statistical analysis of nominal data is sometimes also called categorical data analysis.
Below are the correct responses to 20 multiple-choice questions from a test in one of my undergraduate classes (1=A, 2=B, 3=C, 4=D).
2, 3, 3, 1, 2, 3, 3, 3, 3, 3, 2, 4, 3, 3, 2, 1, 1, 3, 3, 4
Enter the nominal data into SPSS and conduct a Chi-Square test to determine if my selection of positions for the correct alternative is distributed randomly. In your syntax editor, enter:
NPAR TEST /CHISQUARE=alternative.
Write our a brief results section for this “study.” Can you conclude that my selection of the correct response random?