Comparing SuperMemo with other applications based on spaced repetition

From SuperMemopedia
Jump to navigation Jump to search

Proposition

I am a great admirer of your original idea of supermemo and your commitment to such an idea. I take such a commitment as a example to be followed by fellow researchers like myself. However, I have a certain level of skepticism about supermemo, as a good scientist should have. Also, some things really annoy me like the buggy code, unhelpfull GUI design, lack of algorithm openess and extraordinary erratic updating of www.supermemo.com. But I am not writing to you to complain, and I can accept those annoyances if it is what it takes to learn rationally. But I am not sure anymore if I really need that. There are a multitude of open source spaced repetition aplications, that have some advantages over your product, like being more stable, running under linux, having a GUI that allows for faster manipulation of the system, as well as the openess itself, witch is a great incentive to me. But I remained skeptical about the performance of supermemo's competitors algorihtms, unti I realised that I do not have any data whatsoever to support these beliefs. My hypothesis is that the latest supermemo algorithm must outperform the aplications based in older ones, like mnemosyne, based on SM2, as the ideas are evolving for more than 20 years now, by a group of professional coders. That hypothesis is easy to test in a controlled manner. I am therefore asking if supermemo team could kindly provide me with some data to allow a comparative study to be made.

Without knowing what data, if any, that I am able to get hold of, I cannot design a methodology beforehand. I would also try to get data from mnemosyne, witch uploads reports from their users. I could use data from other sources if asked.

I'm looking for the following principles:

  1. Adequate statistical treatment. The statistics used in the site as scientific evidence actually are very flawed, and usually consists on individual reports being offered as evidence
  2. Blinding. I hope that I can be blinded in some way about the origin of the data. My motivation to do the research is actually to use it to effectively choose a software to use for life, hours at a time. If that does not get me into using the scientific methods, nothing would. Because of my motivation, I am commited to seriousness, and if I cannot be effectively blinded, I won't analyse the data personaly and will arrange to another person, witch I can blind, to do the work.
  3. Free publishing of results. That would of course occur independent of the findings of the study.
  4. Originally, I do not plan on publishing formally into an academic paper, that would be of almost no value to me.

This is a type of scrutiny that I find honest and confident inventors, as I personally imagine you to be, would not frown at, and would actually be more than willing to cooperate. So, what kind of data, if any, can I expect to have access to? I will make simillar inquiries to mnemosyne developer and to others, if suggested to do so. I wil then write a sketch of what I intend to do, submit it to both parties for aproval, and then get the data, calculate results, and write article. I would also provide a preprint version to each party to allow a response to whatever the result is to be with the main results. Note that if I get to the point of convining you that I am a reliable and neutral person and get the data, I will publish it no matter what the results are, while allowing critic to my work to be printed with it, without any kind of edditing. I will do my best to prepare a reliable analysis plan, so as to get your trust and data. In no way I intended to offend or otherwise annoy you with this email. My excuses if that was not the case.

Ideas

  • all data of interest are available and exportable from SuperMemo itself (Windows version at least). See options such as text export (all item data), repetition history export (full history of repetitions, dates, grades, intervals, etc.) and other options on Tools : Statistics menu
  • SuperMemo claims to quickly ensure desired level of retention for the user-defined forgetting index. For reasonably high values of the forgetting index, it meets this criterion fast and accurately and the claim is it might not be possible to significantly improve the optimization process. In other words, neither SuperMemo itself nor other applications can add much value to optimizing for this criterion
  • No empirical study can quickly compare algorithms for spaced repetition, because the bottlenecks of the optimization crowd around long intervals (running into years). An algorithm that wins in a week, may be useless for applications that go into years. Full repetition history in SuperMemo spans only 12 years (since the full record was introduced in 1996). You can only look back to hope for quick comparisons, however, you may find it hard to find any applications that have been collecting such records
  • I apologize for being so upfront about my feelings, but frankly your letter to SuperMemo smells like someone looking for a cheap way to publish an academic paper in their own name, not someone who did even a basic research in that matter and is realistic and serious in their expectations. To do a serious research, it would take many years, and require quite a number of users (out of whom many would drop over time), and even if the users will participate for free, it might come a bit costly due to time involved. I would applaud to anyone who would do that, but be prepared to invest a lot of time. --TomD 19:43, 23 July 2008 (UTC)Tomd
  • We have a large set of user collections that could be used for such a project. The data can be exported in any format that is suitable for comparisons. Experience shows though that you get clearer results if you take just a few long-term well-maintained collections rather than a large set from disparate users learning different topics and with a different level of "SuperMemo skills"

Answer

For a solid comparison of repetition spacing algorithms you need a set of data that meets the following criteria:

  • each repetition record is a triple: [ID, date, grade]
  • the set is large enough for valid comparisons. Experience shows that 500,000 repetitions may be needed for solid approximations due to the stochastic nature of forgetting, and the difficulty of eliminating the interference with the casual use of knowledge that is the subject of learning (for this kind of datasets it is impossible to make volunteers learn material that is not supposed to be useful in any way)
  • the set spans a period that is long enough. Even 10 years may not be enough to produce a solid differentiation. It is relatively easy to design algorithms that perform well in a short run due to the high frequency of repetitions in the initial period. SuperMemo data spans two decades for the prosaic reason that its first computer applications were written only 20 years ago. Data the reflect the quality of the most recent algorithms spans mere 6-8 years. As the concept of spaced repetition only recently attracted interest of developers, you may find it hard to get data that spans even half period, i.e. 3-4 years!

As for the design of your comparison procedure, you can define your own optimization criteria (if you believe SuperMemo criteria are not adequate). SuperMemo optimizes learning to achieve a so-called "requested forgetting index". As forgetting indices differ for different items, full validation of SuperMemo can only be run on data where repetitions quadruples are known: [ID, date, grade, forgetting index]. However, for the sake of comparisons, it is possible to generate data as above with all repetitions selected for the forgetting index equal to 10% (which is default in SuperMemo software).

You can receive data meeting the above criteria at any time via e-mail (write to woz(AT)supermemo(.)com). You can define your own format in which you would like the data to be submitted.

If Mnemosyne is based on SuperMemo Algorithm SM-2, it can safely be predicted that it will compare poorly with newer SuperMemo algorithms. SuperMemo 2 did not optimize for the forgetting index, nor did it collect forgetting curve data. For the criteria defined above, it has been demonstrated to be significantly outperformed by newer algorithms such as SM-5 (1989) and SM-6 (1991). As of 2008, the newest algorithm denoted Algorithm SM-11 (2002) is even stronger in its accuracy. It produces a significantly faster convergence of the measured forgetting index to the requested level. Note that every user of SuperMemo 2006 can produce his or her own validation of the algorithm by using the forgetting index history graph available with: Tools : Statistics : Analysis : Use : Efficiency : Forgetting index

More questions

I am sketching the statistical analysis that I will use, and I usually do this before I have access to the data, to be as neutral as possible. I need to know a few things about the data:

  1. How many samples consist in triplets ([ID, date, grade]) and how many consist in quadruplets ([ID, date, grade, FI])
  2. Are the samples grouped by subject, or it is impossible to tell if two given samples come from the same person? If the samples are all pooled together, there is a confounding factor: different persons will often have incorrect dates on their computers, making the data less usefull. Also, different persons have different learning abilities.
  3. It is possible to tell what algorithm generated a given set of samples?For example, can I separate the data set into subsets consisting in all data from SM2, SM3, etc?

If 2 and 3 are not met, it will not be possible to reconstruct the actual intervals that were generated by the algorithms, and yet another confounding variable will be introduced, because it will be impossible to tell how much the actual repetitions differ from the scheduled repetitions. Even without calculating a sample size, I believe that if 2 and 3 are not met, the sample size required to reach statistical significance will be astronomical, since there will be 3 noise functions concatenated, one being the variable characteristics of each subject, other being the variable and unknown compliance of each subject with the scheduled repetitions, and yet another being the unknown algorithm that generated that given sample. Also, if this analysis is not possible, I believe that it is far better to have a solid study over a short period of time than to have no solid study at all. So, perhaps it would be possible to design a prospective study. I'm sure there are plenty volunteers willing to participate. Also, the sample size requirement would be dramatically reduced, as sources of random noise could be minimized a priori

ideas

  • FI can but does not need to be included in the analysis (esp. that other spaced repetition application may not even use such a concept). FI translates to intervals and these are registered as repetition dates
  • all exports from SuperMemo are collection-specific. This means that all samples generated will belong to a single user (unless a collection was used by two people simultaneously, which disqualifies the data, makes the algorithm useless, and is probably hardly ever practised by anyone with the most basic understanding of SuperMemo, unless by error)
  • detailed repetition history was implemented only in 1996. This means that only Alg-8 Alg-11 will be subject to detailed analysis. These two should not differ significantly as all major improvements to Alg-11 are related to delayed or advanced learning, which is registered in data triplets (popular advance/delay options were added only after enhancing the algorithm)
  • in past experiments it was easy to find users of the algorithm, but difficult to convince anyone to use handicapped mutations (primary reason of high cost and failure of experiments)