Quantcast
Viewing all articles
Browse latest Browse all 224

I need help with a complex regex

Challenge:
I have a file with genealogy information which I would like to extract (in Google Sheets) using regex.

Data:
One cell contains text information. Basically it is four main parts, two of which are optional and can have slightly different formats and contents

First comes always a number followed by a period. (This is the generation number.)
Second comes the name. It consists of one or more first and last names
These two are always there

They can be followed by birth and/or death information
If there is birth information, it always comes directly after the name and starts with "b. ".
It can have a date, and or a location
The date can be preceded by "circa", "before" and "before circa". It is then followed by either a 4 digit year, or more commonly by the month name, date, and year. Example: "March 4, 1888"
After the year might follow a location (free text)

If there is death information, it starts with "d. " and can contain the same information as above, i.e. a date and/or a location.

My best shot is close, but not handling the special cases of "before" etc too well:
=ARRAYFORMULA(IFERROR(SPLIT(REGEXREPLACE(A:A,"^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$","$1|$2|$3|$4|$5|$6|$7|$8|$9"),"|")))


So the regex part of it is:
^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$


It works well for entries like this one:
2. Gunnar Helg Andersson b. October 22, 1921, Ormöga No. 3, Bredsättra, Kalmar, Sweden; d. January 1, 2021, Köpingsvik


But not for entries like:
7. Kierstin Danielsdotter b. before circa 17069. Lussa Elofsdotter b. circa 1680; d. May 16, 1758, Bredsättra7. Olof Jönsson b. 1742, Sverige (Sweden); d. September 4, 18119. Nils Knutsson b. circa 1676, Istad, Alböke; d. circa April 17, 1729

Viewing all articles
Browse latest Browse all 224

Trending Articles