In the mid-80s, in my first full-time position after college, I worked for a now-defunct software company doing artificial intelligence, specifically natural-language processing. The most significant project I worked on while there was a text categorization system. I was the tech lead (this was 1987ish). The client was Reuters, who at the time had literal rooms full of people whose job was to skim news stories coming over the wire, attach categories to them, and send them back out quickly. Our job was to automate that -- or, more realistically, to automate the parts that machines could do and send a much smaller set of "don't know" cases to humans. I'm writing this from memory; it's been more than 30 years and details are fuzzy.
I left that company and went on to do other things. I was vaguely aware that, at some point, the corpus of news stories we used for training data had been released publicly, by agreement between Reuters and my then-employer. I wasn't a researcher, wasn't in the NLP business any more, and lost touch. Technology moves on, and I figured our little project had long since faded into obscurity.
Tonight I got email with a question about that data set. My name is in the README file as one of the original compilers, and somebody tracked me down.
Somebody still cares about that data set.
I Googled it. Our data set was popular for close to a decade, during which time people improved the formatting (SGML, baby!) and cleaned up some other things. It spawned a child -- the original either had, or had acquired, some duplicate entries, and the new one removed them. (The question I got was actually about the child data set.) And now I'm curious about the question I was asked too, because I either don't know or don't remember how it got that way.