This is the second genealogy standards proposal I am submitting to the FHISO Call For Papers. This current document was started back in August after submitting my last proposal (Asynchronous Collaboration), but life intervened and I didn’t finish the last 10% of the document until today.
The basic premise of this proposal is that names are very complicated and current genealogical standards are woefully inadequate in covering all naming conventions. The name of a single person might have over a dozen fields, and might evolve over time depending on where the person moves, or what titles the person acquires. If the person moves to a new country, the order of their name could change, or they might translate their name to adapt to the new country. Understanding all of these possibilities and being able to document them in such a way that they can be displayed properly and transferred between disparate applications and services is very important. The full proposal is attached in PDF format. I appreciate any feedback on the proposal in the comments below.
As part of the FHSIO Call for Papers, I am submitting the following proposal as my contribution to furthering genealogy. The full proposal is available in a downloadable PDF at the end of this post, but let me give a brief outline here, which I’ll excerpt from the introduction:
This proposal is a method for exchanging data between researchers that:
- Allows researchers to sync data between their trees without requiring the acceptance of all differences.
- Allows researchers to receive updates on changes to collaborator trees, even if their trees are not completely in sync.
- Enables the sharing of images and documents, even when those media files are very large (i.e. bigger than one could reasonably send via e-mail).
- Creates a decentralized system of sharing, that is not dependent on researching using the same application or service.
- Can facilitate finding other researchers that are researching the same individuals.
In addition to these benefits, the proposal also takes a look at the use of places, sources and events in genealogy, and how we can use external databases to assist us with these categories of data.
I welcome comments on the proposal and indeed hope it will spur a conversation on the ways we collaborate in genealogy. Please contribute your thoughts on the proposal below in the comments. Thank you.
When putting together a dictionary of names, it is useful to know the origin of names, as well as their relative popularity. No book (in print anyways) can have an unlimited number of names, and thus popularity can be useful in determining whether or not a particular name should be included. One very useful resource for name popularity in the US, is the Social Security Administration (SSA)’s Popular Baby Names section of their web site.
The site included names that have been registered with the SSA for the purpose of receiving a social security number. There are a few caveats to the data. First, the data does not include the names of everyone born in the US, but everyone who registered for a social security number. While most babies do receive social security numbers today, its not a complete database (but statistically it’s probably good enough). Second, Social Security didn’t start until 1935, and even then was more geared to those working. Thus while most people in the US today receive social security numbers, the earlier data skews heavily towards men, and is incomplete (not everyone registered). Data on names goes back to 1880, probably because that year is 55 years before the starting year, and I would guess people over the age of 55 probably could not register at the time. Lastly, the database is strict in terms of spelling, which means that if a name has five different spellings, it will show up 5 different times in the database. This obviously lowers the overall ranking of names with many spellings, but it does help see which specific versions of names are trending. I’ll give some good examples of that in a minute.
The reason this post is about the names of girls, is that it turns out girls names are much more interesting than boys names (I hope to write about boys names in a future post, however). Out of the top 50 boys names, 40 of them were in the top 1000 names in the SSA database a hundred years ago. Out of the top 50 girls names, only 26 existed in the top 1000 names. In other words many more girls names came into common usage more recently than boys names. One thing more recent names enable, is figuring out the origin of the names popularity. While I’m sure one can figure out why names became popular 100 years ago (they had celebrities then too), it is considerably easier to figure out why a name became popular with one’s lifetime.
Let’s start with a nice chart. This chart shows the popularity of the currently 10 most popular baby names for girls (as of 2012), over the past century.
You’ll notice there are some gaps in some of the lines, and some lines don’t start until fairly recently. That reflects the fact that names like Isabella and Ava are old names, but dropped out of the top 1000 for extended periods (thus the gaps), and that names like Mia entered the chart in 1964 and Madison only entered the chart in 1985.
I mentioned earlier that the spellings are strict, and thus multiple spellings show up as separate names. Let’s look at a few examples. Take a look at the table at the bottom of this article to see all the names in the top 50.
Number 9 on the 2012 list is Madison. Also on the list are Maddison (350), Madisyn (504) and Madyson (598). That follows a common pattern of one spelling of a name being much higher than other spellings.
I did notice a few exceptions to this pattern, however. Sophia is the number 1 on the list, while Sofia is number 18. Zoey is number 20, while Zoe is number 30. Interestingly Zoe (the less popular version) is an old name, while Zoey (the more popular version) only debuted in the top 1000 names in 1995.
Chloe is number 11 on the list, and Khloe is number 55. Chloe shows up in the first year of the data, 1880, and declined in popularity until it disappears from the list in 1940. Chloe reappears in 1982, and rises until its current position of 11. Khloe only emerged in 2006 on the list. Oddly, the most famous Khloe, Khloe Kardashian, debuted on the television show Keeping up with the Kardashians in 2007. Was she famous a year before her family’s television show? Is there another explanation for the emergence of this rare variant just one year before the television show? I’m not enough of an expert on popular culture to determine that, but it seems reasonable that the popularity of Khloe and the emergence of Khloe Kardashian as a celebrity would be linked.
One interesting girls name is Brooklyn. The name debuted on the list in 1990. In 1995 the variation Brooklynn shows up as well. Brooklyn is currently at number 29 on the list, and Brooklynn is at number 137. Interestingly the most famous Brooklyn is Brooklyn Beckham, the son of famous footballer David Beckham and his wife Victoria Adams (famous as Posh Spice of 1990’s girl band the Spice Girls). Brooklyn Beckham was born in 1999. Brooklyn is not on the popularity chart as a boy’s name, however.
Two more interesting stories of name popularity would seem to be clearer.
In 1970 Eric Clapton and his band Derek and the Dominoes released the love song Layla. In 1972 the name Layla debuts on the popularity list. By 1979 the name drops of the popularity list. In 1992 Eric Clapton releases his very popular album Eric Clapton Unplugged, which includes a version of Layla. In 1993 the name Layla re-emerges on the popularity chart, where it currently sits at number 31.
Another music-connected name is Aaliyah. The singer Aaliyah came out with her first album in 1994, and her unusual name debuts on the name list the same year. Her name was so unusual that if you look at the first album cover, you can see the pronunciation of her name is shown in the top right corner. In 2001 Aaliyah released her final album to great acclaim, before dying in a plane crash the same year. She was a rising star in both music and as an actress, and the popularity of the name made a huge jump from number 211 in 2000 to 95 in 2001. The popularity of the name has continued to rise, and is currently at number 36.
Movies also have an influence on names. Savannah is an old name that shows up in the earliest years of the database, but by 1932 the name disappears from the top 1000 list. In 1983 the name suddenly emerges at position 446 on the list, and rose to a peak of 30 in 2006-2007. So what happened to bring back this traditional name? In 1982, the movie Savannah Smiles was released, about a girl named Savannah. The name currently sits at number 42 on the list.
All of these examples are quite amazing in the staying power of the names based on media releases. The original debut of Layla only lasted eight years on the list, and peaked at number 741 on the list. The current streak since 1993 is 21 years, and the peak is this year at number 31. Aaliyah has now been on the list for 19 years and is also at its peak this year at number 36. After dropping off the list for more than 50 years, Savannah has now been on the list for 30 years, and peaked in 2006-2007 at number 30.
So what do you think? Share your favorite girl baby name stories in the comments. A table of the top 50 names from 2012 is shown below, with the name’s positions in the top 1000 from 10, 50 and 100 years earlier.
Top 50 Female Baby Names in 2012A look at the top 50 female baby names in the US in 2012, with a look back 10, 50 and 100 years at where in the rankings these names fell in 2002, 1962 and 1912.
|Isabella||3||14||Not in top 1000||361|
|Mia||8||43||Earliest 1964||Earliest 1964|
|Madison||9||2||Earliest 1985||Earliest 1985|
|Chloe||11||25||Not in top 1000||583|
|Avery||13||132||Earliest 1989||Earliest 1989|
|Addison||14||220||Earliest 1994||Earliest 1994|
|Aubrey||15||196||Earliest 1973||Earliest 1973|
|Sofia||18||112||Not in top 1000||899|
|Zoey||20||268||Earliest 1995||Earliest 1995|
|Harper||24||Earliest 2004||Earliest 2004||Earliest 2004|
|Samantha||26||9||Not in top 1000||Earliest 1958|
|Brooklyn||29||152||Earliest 1990||Earliest 1990|
|Zoe||30||60||Not in top 1000||832|
|Layla||31||290||Earliest 1972||Earliest 1972|
|Hailey||32||31||Earliest 1982||Earliest 1982|
|Kaylee||34||57||Earliest 1984||Earliest 1984|
|Aaliyah||36||64||Earliest 1994||Earliest 1994|
|Gabriella||37||76||Earliest 1974||Earliest 1974|
|Nevaeh||39||189||Earliest 2001||Earliest 2001|
|Savannah||42||39||Not in top 1000||657|
|Alyssa||44||12||Earliest 1963||Earliest 1963|
|Taylor||46||18||Earliest 1979||Earliest 1979|
|Riley||47||77||Earliest 1990||Earliest 1990|
|Camila||48||455||Earliest 1997||Earliest 1997|
|Arianna||49||114||Earliest 1982||Earliest 1982|
|Ashley||50||6||Earliest 1964||Earliest 1964|
One of the bigests tasks in planning the dictionary I want to make is just collecting and organizing the data I’m basing my definitions on, and putting them into some kind of searchable interface. Part of this is building a corpus, and part is organizing other sources and references that can help me. Some of the data sources I have are only available as print books, but in order to integrate them into my overall workflow, I’ve started to scan many of the books I am using so I can search them on my computer. One book, for example, is a book in Hebrew on Jewish names. The main steps in digitizing the book include:
- Scanning the book
- Splitting the scanned pages into two pages (since each scan covers two pages)
- Running OCR on the book to make it searchable
So let’s take a look at how I go about these steps.
I start by scanning the book. I have a Brother mutli-function scanner/printer which is particularly good because it offers full A3/Tabloid scanning (i.e. I can scan a full spread of two pages of a letter-size book). The scanner also has wireless networking which is nice, although in this case where I’m scanning many pages it’s better to use a USB cable as it is faster, and the scanning time is reduced. I use VueScan software to do all my scan work. It’s an amazing piece of software that I’ve used for many years, and works with almost every scanner in existence. I actually have three scanners, all from different manufacturers, and they all work flawlessly with VueScan. VueScan lets you scan multiple pages into a single PDF, so I use that option to create a single PDF of the entire book.
In general, scanning a book with the purpose of doing OCR needs to be done at no less than 300dpi, and would be better at 600dpi. The more resolution, the more information the OCR software has to interpret which letters it is seeing. In the case of Acrobat Pro at least, you can’t use a file that is scanner at more the 600dpi – if you do, it will first downsample the file to 600dpi, and then apply the OCR.
This next step is one I’m hoping to find a better solution for, but this is still a pretty neat way to solve this problem. The problem is that when scanning a book you usually scan two pages at once. So how do you split the pages so your document only shows one page at a time? If you’re planning on being able to load your book on a iPad or other tablet, then this step is very important.
What I discovered is a program called Briss that lets you crop PDF pages into multiple pages. Yes, that’s BRISS as in the Jewish circumcision ceremony (snip snip). What Briss does which is pretty neat is that it takes a look at all the pages in your PDF, and combines them into one overlaid image that shows you where the boundaries of all your pages are – in other words it allows you to crop the pages without losing any text just because one page was slightly shifted from the other pages. Here’s what that looks like:
Once you can see all the pages together, you just draw boxes over the pages so they will crop in the right places. See here:
The original idea for Briss was actually to allow people to scan books for digital readers, and besides spitting the pages, also cropping out the extra white space (i.e. the margins) which are not necessary on an e-reader. In most cases you would put the first crop box on the left, and the second on the right. In this case, as the book is in Hebrew and Hebrew is read from right-to-left, the crop boxes are reversed. The next step is to simply output a cropped file, which creates a new PDF with the cropped pages in the correct order (with double the number of pages).
There are a few problems with Briss. First, I’ve tried to crop larger files (my scanner can scan full tabloid) and it has not loaded the files properly. I’m not sure yet if it’s a problem with size, or with resolution. The other problem is that there doesn’t seem to be a way to insure the cropped pages are all the same size. It’s weird when every other page is slightly different in size. There are some other programs with try to address this problem, such as ScanTailor on Windows, but I haven’t found anything that splits pages without any hiccups. I hope there will be better software for this in the future (or that Briss will improve).
Making the document searchable
So now the book is scanned, and the pages are split up so it looks like a normal digital book. The only problem is that the PDF is essentially just a series of images. There is no searchable text in the document. What we need now is OCR (Optical Character Recognition) software to generate text from the images of the words. VueScan actually has built-in OCR software, although Hebrew is not one of its currently supported languages (although I’ve asked them about that, and they’ve said they will look into it). OCR software for English is relatively easy to find. There are many options, ranging from essentially free to more expensive options with more advanced features. One of the more advanced features that one can pay for is the ability to process more than one language at once. Some of the documents I’m looking at can have some combination of English, Hebrew, Russian, Polish and Yiddish. Finding a program that can do all of that at once is probably not realistic.
Most OCR programs will ask you what the ‘primary’ language is in the document. Part of the reason for this is that the OCR software can use dictionaries to improve its accuracy, but it needs to know which dictionary to use in order to do that…
Hebrew is not one of the most supported languages in OCR software (it adds to the complexity by being read right-to-left like Arabic). Of the major OCR software companies, even fewer have full products on the Mac. In something all too typical, ABBYY which makes the well-regarded FineReader OCR software, provides a version for the Mac which is based on version 8 of their software, which debuted in 2005. The current version, 11, which is much improved and supports many more languages including Hebrew, is not available on the Mac. Am I wrong to not to want to pay for software that is 8 years out of date?
Luckily for a document that is solely Hebrew, there are some options. Adobe Acrobat Pro offers its own OCR software (apparently based on the Readiris engine), which does a reasonable job.
Acrobat also offers a feature that they call ‘Optimize PDF’ which does OCR, then compresses the images so the file is much smaller. Once you have the text recognized, there is no good reason to keep a very high resolution and uncompressed version of the pages in the document. It also does something else – it analyzes each page and rotates them if it thinks the page is not straight. This is a common problem when scanning books, since the book doesn’t always lay perfectly flat. The result now is fully-searchable document that is formatted to be viewable on both a computer and on a tablet if you want to be able to take the document with you.
Some OCR software can detect double-page spreads and crop them automatically, which would be nice as it would eliminate the need for using a second program like Briss. I’d like to see if programs like Readiris 14 Pro can do that, and whether they support multiple languages (English, Hebrew, Yiddish, Polish, Russian, etc.) and multiple character sets (Latin, Hebrew, Cyrillic, etc.). Unfortunately, Readiris 14 Pro only supports scanning 50 pages at a time, not something conducive to scanning books. It seems an odd restriction for something labelled ‘Pro’. To scan unlimited pages you need to buy their $599 Corporate version, not something I’m planning to do.
In short there isn’t an ideal solution to the problem of digitizing books for research and corpus inclusion. I’ll continue to look at options available, and post about solutions I find that help streamline the process. If you scan books and run OCR on them, what solution have you come up with?
Web design has come a long way since the days of black text in a single font on a white background. Over the years typography on the web has jumped ahead, first to what were called ‘web safe’ fonts, which were a group of 18 fonts that were found on both Mac and Windows machines, and later to ‘Web Fonts’ which were essentially fonts that could be downloaded from servers whenever a page was loaded, allowing the introduction of hundreds, even thousands of new font options for web sites, without requiring the user to have the font installed on their computer.
One of the major breakthroughs in Web Fonts was Google Web Fonts, which provided a large font server populated with what Google described as ‘open source fonts’ which people could use without any restrictions. Over time, other free web font services also came online, such as the recent offering from Adobe, called Adobe Edge Web Fonts.
One thing I noticed when looking at Google Web Fonts in the past, was that there was no reason the fonts couldn’t be used on your desktop and printed as well. Many commercial web fonts restricted their usage, and required additional licenses to use them in print, but Google’s fonts were all free to use. This obviously occurred to other people as well, and Google has actually renamed Google Web Fonts to the simpler Google Fonts, and made them available on the desktop via Monotype’s SkyFonts service. SkyFonts is a program you install on your computer that syncs fonts from fonts.com (a Monotype site) to your desktop. For example, if a font is improved by adding new characters, then SkyFont will download and automatically update the font on your desktop.
This is an interesting reversal. First, desktop fonts were made available on the web. Now fonts created specifically for the web are making their way onto the desktop.
Will any of these fonts be useful for dictionary-making? Probably not. Still, finding the right font or fonts for making a readable dictionary will be a challenge, and any and all resources available to make that happen is welcome.
I’ve been blogging about genealogy for a few years on my other blog, Blood and Frogs: Jewish Genealogy and More. I’ve found that when I go off-topic too much some readers sometimes get annoyed, so I’m starting this new blog as a way to document my combined interest in the fields of Lexicography, Genealogy and Technology.
Merriam-Webster defines these three terms as:
1 : the editing or making of a dictionary
2 : the principles and practices of dictionary making
1 : an account of the descent of a person, family, or group from an ancestor or from older forms
2 : regular descent of a person, family, or group of organisms from a progenitor or older form : pedigree
3 : the study of family pedigrees
4 : an account of the origin and historical development of something
1 a : the practical application of knowledge especially in a particular area : engineering 2 <medical technology>
b : a capability given by the practical application of knowledge <a car’s fuel-saving technology>
2 : a manner of accomplishing a task especially using technical processes, methods, or knowledge <new technologies for information storage>
3 : the specialized aspects of a particular field of endeavor<educational technology>
I have a goal in combining these interests, which is to publish a dictionary of Jewish first names. While I’ve toyed with this idea for many years, and have been collecting data towards that end for some time, I have not yet begun to compile anything remotely looking like a dictionary. That’s because the art of compiling a dictionary, lexicography, is not something one reads a Dummies book about and then jumps into. If anything, lexicography is something of an arcane art. The big dictionary publishers like Merriam-Webster in the US and Oxford in the UK train their employees directly, and while there is some outside training available, the total number of lexicographers worldwide is small, and the knowledge is not widespread.
You might be thinking that a dictionary of names actually falls under a fourth term:
1 a : the science or study of the origins and forms of words especially as used in a specialized field
b : the science or study of the origin and forms of proper names of persons or places
2 : the system underlying the formation and use of words especially for proper names or of words used in a specialized field
Technically that’s true, but I personally associate the term Onomastics with a scientific study of names, and I make no claims to my own study being scientific, even if I plan on using some scientific techniques for processing the data I collect.
Lexicography has undergone tremendous upheavals in recent decades, first through the introduction of computerized lexicography using massive electronic corpuses, and second through the ongoing elimination of the printed dictionary as the primary method of accessing dictionary content. Another shift that has been a long time coming, is the shift from prescriptive to descriptive dictionaries. While dictionaries in the 18th and 19th century tried to define the proper usage of all words in the English language, the more recent descriptive dictionaries, exemplefied by Webster’s Third International which came out in 1961, describe the current usage of words instead of trying to explain the proper usage of the terms.
Let’s take a look at all three changes in reference to my work here.
The concept of an electronic corpus, which collects large amounts of source literature and allows quick searching of real usage of specific words, is something that can be very useful for researching names as well. Certainly in terms of Jewish names, just making the bible and other Jewish texts searchable by name is something that is very useful. The problem with the traditional corpus is the assumption that all data is based on original texts (i.e. throwing in massive databases of written works such as books, newspapers, magazines, etc.). I’ve spent a number of years developing my own curated sources on names, that are already in database form. Integrating those into a traditional corpus is not so simple, and how I use my current research in conjunction with building a corpus is something that will take some time to work out.
As a fan of printed books, it is of course sad to see the continual shift to electronic publishing for everything from novels to magazines to newspapers to dictionaries. In the case of encyclopedias, the general purpose encyclopedia has essentially been eliminated by the rise of Wikipedia. People read their books on the their Kindles, peruse their magazines on their iPads, and read their news on the web. Print dictionaries are still selling, but like printed encyclopedias which are almost extinct (except specialist encyclopedias) is it reasonable to expect that print dictionaries won’t similarly disappear? Certainly there is a place for well curated dictionaries, even online, but the perceived value of one dictionary versus the value of another is lowering all the time. People just search for definitions online, or use the built-in dictionary look-ups in their operating system. On Mac OS X, for example, any word can be selected, and with a simple right-click, a menu item allows the user to look up the word in a dictionary, a thesaurus and Wikipedia simultaneously. Who provides the definitions in this built-in dictionary and thesaurus? I don’t even know. Luckily for me, my pursuit is very specialist, so I think there’s more time for me to publish a printed book. I can also publish the dictionary on Kindle, as an iBook, or as an iPad app.
The concept of prescriptive and descriptive dictionaries is an interesting one, and certainly has a lot of relevance to name dictionaries. In some countries (as I described a bit tongue-in-cheek in a recent article on my other blog) parents can only name their children according to very specific rules, and in some cases from specific lists of approved names. You might consider that the prescriptive approach. A book that attempted to choose what was an appropriate name, and what wasn’t, would be prescriptive. A book that just documented what names are being chosen, would be descriptive.
One of the goals of the dictionary will be to look at the usage of Jewish given names over the past century, and how they have changed. This can be very useful for genealogists who might know the name of a relative in one country, but lost track of them when they moved to another. Jews who moved from Eastern Europe to the United States and Israel in the past century many times changed their names, and trying to figure out what their new names became can be a guessing game. Hopefully the work I am doing will be make that an educated guessing game.
So here this effort begins. I will be documenting my efforts to build a name corpus, to learn how to craft a dictionary, to figure out the intricacies of taking the final data and exporting them to a publishing program like Adobe InDesign for printing, etc. Hopefully other people out there will find some of my efforts useful for their own projects.