Skip to content

Latest commit

 

History

History
321 lines (239 loc) · 15 KB

README.md

File metadata and controls

321 lines (239 loc) · 15 KB

NOTE: This repo is/was a clone of https://github.com/DallanQ/Names, which is no longer available. Although given an Apache license by its original author, any updates incorporated from WeRelate should respect its Creative Commons Attribution-ShareAlike license.

The purpose of this project is to create a comprehensive database of name variants to include in searches whenever a particular name is searched. It is a better name-matcher than Soundex. Read more about the Variant names project

This readme explains how to incorporate name variants into your own website.

Search module

The search module contains everything you need to incorporate name variants into your own website.

  • Normalizer.java - converts a user-entered text string to a normalized form - converts non-roman characters to corresponding roman characters, removes noise words, prepends surname prefixes, converts -dotter endings to -son, etc. You must call this on user-entered text before calling any other function. See NormalizerTest.java for an example.

  • Searcher.java - the two main functions are getAdditionalIndexTokens and getAdditionalSearchTokens. See SearcherTest.java for an example.

    • Each name should be indexed under its normalized form as well as any tokens returned by getAdditionalIndexTokens. getAdditionalIndexTokens returns the soundex code for the name when the name rare. It does not return any tokens otherwise. Thus, most names require just a single token (the normalized name itself), while some names require two tokens.

    • To return variants for a user-specified name at search time, search for the normalized form of the user-specified name as well as additional tokens returned by getAdditionalSearchTokens. On average, getAdditionalSearchTokens will return 25-35 tokens to include in searches.

  • surnamePrefixedNames.txt contains a list of prefixed-surnames (e.g., McDonald), and their unprefixed roots (e.g., Donald). According to the labeled pairs provided by Ancestry, unprefixed roots need to be included in searches for a prefixed surname, and vice-versa. But determining whether a name is prefixed is not easy. That is, Vandyke and Ohare are prefixed, but Vance and Olson are not. This table attempts to identify common prefixed surnames. Rare surnames use the prefix list in searcher.properties to estimate whether they are prefixed.

  • givenname_similar_names.csv is a table of variants for the 70,000 most-frequent given names in Ancestry.com's database.

  • surname_similar_names.csv is a table of variants for the 200,000 most-frequent surnames in Ancestry.com's database.

Rare names not in the above tables appear less than once in every 5,000,000 names in Ancestry's database, are indexed under their Soundex code by getAdditionalIndexTokens. getAdditionalSearchTokens includes these names by including the Soundex code of the searched-for name as one of the tokens to search.

The above tables will be updated periodically to incorporate user modifications made at the Variant names project. As part of this project, people are being encouraged not only to improve the names, but also to review the changes made by others. A changes log (see Changes log) will be included here so you can browse the changes.

Searcher.java reads the two above tables to determine whether a name must be indexed under a Soundex token (if it is not in the table), and what additional names to include in searches. If you don't want to use java, you could pretty easily write your own code to read the two tables.

Installing the tables into a database

By default, Searcher.java reads the above two tables into memory. If you want to read the tables from a database instead of reading them into memory, do the following:

create table givenname_similar_names (
name varchar(255) not null,
similar_names varchar(4096) not null,
primary key (name));

mysqlimport --fields-enclosed-by='"' --fields-terminated-by=','
--lines-terminated-by="\\n" --local <dbname> givenname_similar_names.csv

create table surname_similar_names (
name varchar(255) not null,
similar_names varchar(4096) not null,
primary key (name));

mysqlimport --fields-enclosed-by='"' --fields-terminated-by=','
--lines-terminated-by="\\n" --local <dbname> surname_similar_names.csv

In addition,

  • copy c3p0.properties.example to c3p0.properties, customize it as needed, and make sure it is on your classpath

  • copy db_memcache.properties.example to db_memcache.properties, customize it as needed, and make sure it is on your classpth

Building

mvn install creates the normal jar file as well as one with all dependencies

Score module

The score module contains code to score a pair of names: return a number indicating how similar they are.

  • Scorer.java - contains the scoring function. See ScorerTest.java for an example of use. Remember to normalize the names beforehand.

  • DMSoundex.java - a simplified implementation of Daitch-Mokotov soundex

  • Nysiis.java - an implementation of NYSIIS

  • WeightedEditDistance.java - an edit distance algorithm, like Levenstein, but where each edit is assigned a weight, where the weights have been learned using an Expectation Maximization algorithm over the positive labled examples provided by Ancestry. By weighting edits, we can make the edit distance between ACE and APE greater than the edit distance between ACE and ASE.

  • FeaturesGenerator.java - when name pairs are scored, a set of features is generated for each pair. These features include whether the NYSIIS codes match, the Soundex codes match, the Refined Soundex codes match, the Daitch-Mokotov codes match, the value of the Levenstein distance, and the value of the Weighted Edit Distance. The weights for each feature were learned by running the labeled training data provided by Ancestry.com through Weka A number of other features were evaluated, but these features seemed to provide the best results.

  • SimilarNameGenerator.java returns a set of names similar to the specified name. See SimilarNameGeneratorTest.java for an example.

  • givennameClusters.txt and surnameClusters.txt - these files can be used to speed up SimilarNameGenerator. By clustering the names, rather than scoring each name against the specified name to determine if it is similar, we can score the cluster "centroids", and score the names in a cluster only if the centroid is closer than a certain threshold.

  • cmulex_lts.bin - this file was created by people at Carnegie Mellon University, and is used by the FreeTTS library to convert name strings to phonetic values. The phonetic values are used in WeightedEditDistance - it calculates the distance between the phonemes.

You might be tempted to index each non-rare name under its cluster centroid, so that you'd only have one token to look up at search time. You could do this, and you'll get reasonable results. But you'll get better results using the approach described in the search module. The reason is that while name X might be similar to name Y, and name Y might be similar to name Z, name X may not be very similar to name Z. In other words, similarity is reflexive but not transitive. Attempting to cluster names and searching on cluster centroids only, assumes transitivity, which doesn't exist.

Building

Before building Scorer.java you must install the customized freetts.jar and William Cohen's secondstring library from the external directory into your local maven repository:

mvn install:install-file -Dfile=score/external/freetts.jar -DgroupId=com.sun.speech -DartifactId=freetts -Dversion=1.2.2-threadsafe -Dpackaging=jar

mvn install:install-file -Dfile=score/external/secondstring.jar -DgroupId=com.wcohen.ss -DartifactId=secondstring -Dversion=20101021 -Dpackaging=jar

mvn install creates the normal jar file as well as one with all dependencies

Eval module

Use this module if you want to evaluate the precision and recall of your own coder or name variants tables against Ancestry's labeled pairs.

  • CodeEvaluator.java shows how to evaluate a code-based approach; modify this to incorporate your coder.

  • TableEvaluator.java shows how to evaluate a table-based approach.

  • SimilarNameAdder.java is used to generate a core name-variants file based upon names scoring above a certain threshold.

  • SimilarNameAugmenter.java adds additional names to a name-variants file. If you had your own custom list of name variants, they could be added to the core name-variants file using this function. '''If you do have your own list of name variants, we hope you share them with us.'''

  • AncestryGivennamePairs.csv, AncestrySurnamePairs.csv, and BorderSurnamePairs.csv contain the labeled pairs provided by Ancestry. However, BorderSurnamePairs was found to not help, so it was not used.

  • givenname_ancestry.txt and surname_ancestry.txt contain the positively-labeled pairs from Ancestry's labeled data.

  • givenname_behindthename.txt contains a set of related givennames graciously donated by BehindTheName.com, an excellent resource for understanding how given names are related. In particular, the file lists all given names that are one "hop" away from the specified name in the "Family tree" view at BehindTheName, are of the same gender, and are not an ancient or medieval name.

  • givenname_nicknames.txt contains a set of nicknames.

  • givenname_werelate.txt and surname_werelate.txt contain name pairs given by users at WeRelate.org, the book "Dunkling, Leslie and William Gosling, The New American Dictionary of Baby Names, published by Signet (New American Library), 1985", and the book "Patrick Hanks and Flavia Hodges, A Dictionary of Surnames, Oxford University Press, 1990."

  • givenname_orig.csv and surname_orig.csv contain the original core of the names-variants tables.

To create the name-variants tables in the search module, first, SimilarNameAdder was used to add all names scoring above a threshold of 2.3 to each of the 70,000 most-frequent givennames and above a threshold of 0.7 to each of the 200,000 most-frequent surnames. For performance reasons, SimilarNameAdder adds considers only names greater than the specified name, so SimilarNameAugmenter was used with the -p option to add the reverse pairs.

The result is stored in givenname_orig.csv and surname_orig.csv.

At this point, the precision and recall of the name-variants tables against Ancestry's labeled pairs were:

  • Given names: P/R=97.4/71.6
  • Surnames: P/R=89.4/76.4

Compared to Soundex:

  • Given names: P/R=97.2/64.6
  • Surnames: P/R=88.6/67.8

This represents a 20% decrease in false-negatives (names that should be searched together but are not) for given names, and a 27% decrease in false-negatives for surnames.

Next, _givenname_behindthename.txt_, givenname_nicknames.txt, and givenname_werelate.txt were added to the given names table, and surname_werelate.txt was added to the surnames table, using SimilarNameAugmenter from the Score module. The options passed to SimilarNameAugmenter were:

  • behindthename: -p -r

  • nicknames: -a

  • werelate: -p At this point, the Precision and Recall of the name-variants tables against Ancestry's labeled pairs were:

  • Given names: P/R=96.8/74.4

  • Surnames: P/R=89.2/76.8

So we end up with a 28% decrease in false-negatives for both givennames and surnames.

Finally, we added the positively-labeled training pairs: givenname_ancestry.txt and surname_ancestry.txt so the initial tables would be as complete as possible to create the tables found in the Search module. User-modifications to these tables as part of the Variant names project will further modify, and hopefully improve, these tables.

If you want to re-create the P/R numbers above, you won't be able to test using the name-variants files in the Search module, since they include the training data. Instead, you'll need to start with the givenname_similar_names_orig.csv or surname_similar_names_orig.csv file in this module and follow the process outlined above. Then run TableEvaluator.java with the "-t" option to pass in the table you just created.

Also, if you use TableEvaluator.java or CodeEvaluator.java with AncestryGivennamePairs.csv or AncestrySurnamePairs.csv, you'll see some warnings about invalid lines. This is to be expected. Some lines in the training data contain invalid names, like "nn", which is an abbrevation for "no name," not an actual name. The evaluators recognize this, skip over the line, and list it as a warning.

Service module

Use this module if you want to generate search and index tokens or score names by issuing REST calls. This module allows you to use the Variant names project as a REST service instead of as a library.

The service endpoints are:

  • index/type/name returns all tokens to index for the name.
  • search/type/name returns all tokens to search for the name.
  • score/type/name1/name2 returns the score of name1 and name2

In all cases, type is either "givenname" or "surname". You can request the result as XML or JSON by requesting a content type of "application/xml" or "application/json".

Building

mvn package creates a war file that you can deploy to a servlet container like Tomcat

Roadmap

Overall, given names had an average of 32 variants, and surnames had an average of 26 variants. For search engines like Lucene, that's not too bad, because each variant lookup results in the union of a generally-small result set. However, the most-frequent 1000 given names and surnames had an average of 61 and 51 variants respectively, which means that the most-frequently searched names require a lot of extra lookups.

We can do better.

One thing that can be done to reduce the number of extra lookups necessary when searching for variants is to cluster the variants into lots of small clusters. Then instead of looking up each variant separately, Searcher.getAdditionalIndexTokens would return the cluster code for each name, and Searcher.getAdditionalSearchTokens would return the distinct cluster codes for each of the variants. By clustering the names according to how frequently they appear together as variants, we would expect the number of distinct clusters for each set of variants to be much smaller than the number of variants themselves.

Another approach would be to index names using a high-precision, low-recall coder like Refined Soundex. Then at search time look up each of the distinct refined soundex codes for the variants.

I'd love some help tackling this problem. Mahout looks like a good tool to generate clusters.

Other projects

Check out other genealogy projects