Data structure and file formats for kinship data
Kinship data can be stored in files of different formats:
Open/Save data
Import/Export
- BAR Text, OpenOffice and Excel format (".bar.txt", ".bar.ods", ".bar.xls")
- Pajek Network format (".paj")
- Gedcom format (".ged")
- Kinship editor xml format
- Prolog format
The file extension is required to allow Puck and Kinsource correctly load dataset.
Puc format (file extension .puc)
The Puc format is an XML compressed format. Proposed by the ANR Project Kinsources Team, this data model was developed to be able to represent rich kinship information in a dataset. It is hosted on the GitLab repository manager: https://gitlab.com/kinsources/Puc_format
All the information coded with Puck software's editor can be stored in ".puc" format.
The XML format has been choosen because of its simplicity, openess, flexibility, extensibility, and durability. Indeed, XML was created about 20 years ago (XML 1.0 in 1998) and files of the '90s are still perfectly readable by a large scale of software, whatever the operating system (Windows, Mac, Linux, Unix...). Moreover, the XML format is widely used, supported by the W3C, and almost all programming languages include libraries, packages, or special functions for XML managment (create, read, modify, look into...).
The Puc format can encode very large dataset (no size limit), thus we choose to compress the XML file in a zip archive (about 10 times lighter). So a .puc file can be open with whatever archive manager: the uncompress file is the XML file.
|
⇒ uncompress with an archive manager software ⇒ |
|
The XML follow an XML Schema Definition available on the Gitlab repository: https://gitlab.com/kinsources/Puc_format
The XML file look like:
<?xml version="1.0" encoding="UTF-8"?> <corpus version="PUCK-1.0" generator="PUCK" date="2013-05-17T04:04:32.672+02:00" filename="exemple.puc"> <individuals size="20"> <individual id="1"> <name>H 1</name> <gender>MALE</gender> <attributes size="2"> <attribute> <label>SECTOR</label> <value>F</value> </attribute> <attribute> <label>HOUSE</label> <value>1</value> </attribute> </attributes> </individual> <individual id="2"> <name>F 2</name> <gender>FEMALE</gender> <attributes size="2"> <attribute> <label>SECTOR</label> <value>F</value> </attribute> <attribute> <label>HOUSE</label> <value>1</value> </attribute> </attributes> </individual> (...) </individuals> <families size="103"> <family id="2"> <unionStatus>MARRIED</unionStatus> <father>1</father> <mother>2</mother> <children>3 4</children> <attributes /> </family> (...) </families> <relationModels size="2"> <relationModel> <name>Parrainage</name> <roles size="2"> <role> <name>PARRAIN</name> </role> <role> <name>FILLEUL</name> </role> </roles> </relationModel> (...) </relationModels> <relations size="8"> <relation> <id>1</id> <name>alpha</name> <model>Parrainage</model> <actors size="2"> <actor> <role>PARRAIN</role> <individualId>10</individualId> </actor> <actor> <role>FILLEUL</role> <individualId>20</individualId> </actor> </actors> <attributes size="2"> <attribute> <label>LIEU</label> <value>Paris</value> </attribute> <attribute> <label>STATUS</label> <value>Civil</value> </attribute> </attributes> </relation> (...) </relations> </corpus>
Tabular formats (file extensions .txt , .ods and .xls)
The table format is the most simple format for entering data " manually " (without using the Puck data entry form or another genealogy program).
The text format (.txt) is a tab delimited text file, composed of two blocks (or more) separated by an empty line. It can be opened with any text editor. Table format can be opened with OpenOfice (.ods) or Excel (.xls) applications.
Table and text formats are organized in exactly the same way, except that the different spreadsheets of the table format become different blocks in the text format.
There are actually two different data structure of the table (or text) format :
IUR-Format (Individuals-Unions-Relations) - (File extensions ".iur.txt", ".iur.ods", ".iur.xls")
The first sheet / block contains information on each individual, organized in separate columns:
- A unique identity number (ID)
- Name(s), where different name parts are separated by a slash (/)
- Gender: M or H (man), W or F(woman), X (gender unknown). Gender letters are not case sensitive
- Supplementary information concerning the attributes of the individual. These items are the values of individual properties, columns being labelled according to individual property codes
Example:
Id | Name | Gender | BIRT_DATE | OCCU |
---|---|---|---|---|
34 | John / Smith | M | 2/12/1934 | Taxi driver |
- A unique identity number (ID)
- Status: M (Married), U (Unmarried), D (Divorced)
- Husband's ID number
- Wife’s ID number
- Children’s ID numbers, separated by semicolons
- Supplementary information concerning the attributes of the union. These items are the values of individual properties, columns being labelled according to relation property codes
Example:
Id | Status | HusbandID | WifeID | ChildrenID |
---|---|---|---|---|
28 | M | 4 | 22 | 23;24;25;26 |
- A unique identity number (ID)
- Name
- Information on the individuals which partake in the relation: separate columns correspond to separate roles; in each column the ID numbers of the individuals are separated by semicolons
- Supplementary information concerning the attributes of the relational node. These items are the values of relation properties, columns being labelled according to relation property codes
Examples:
Relation “Baptism”
Id | Name | Candidate | Godparent | #DATE |
---|---|---|---|---|
271 | John’s Baptism | 2 | 34;45 | 4/11/2012 |
Relation “Director’s Board”
Id | Name | Chairman | Members | #SEAT |
---|---|---|---|---|
22 | Lehman Brothers | 8 | 22;45;553;672 | New York |
Relation “Age set”
Id | Name | Members | Initiators | #YEAR |
---|---|---|---|---|
34 | Generation X | 2;4;22;44;45;67 | 34;450 | 1968 |
BAR-Format (Basic Informations-Attributes-Relations) - (File extensions ".bar.txt", ".bar.ods", ".bar.xls")
The first sheet / block contains the basic information for each individual (including its genealogical links) in separate columns:
- A unique identity number (ID)
- Name(s), where different name parts are separated by a slash (/)
- Gender: M or H (man), W or F(woman), X (gender unknown). Gender letters are not case sensitive
- Father's ID number
- Mother's ID number
- Spouse(s) ID number, where the ID numbers of different spouses appear in different column
A headline may be convenient for data entry, but is not necessary for Puck.
Attention! If you use a headline, do not call the column of identity numbers " ID " (rather use " Id " or " Nr " or something similar), otherwise it will not be opened by Microsoft Excel (this has nothing to do with Puck, but is a general Microsoft Excel Bug).
In the case of multiple spouses, there are two possibilities (which may be combined):
- either the ID numbers of an individual's spouses appear in one single line but different columns (from the sixth column on). This is the output produced by Puck and the most convenient solution if an individual's spouses are immediately known
- or the ID numbers of an individual's spouses appear in the same column (the sixth) but in different lines, which means that the individual's ID number has to be entered several times. This solution may be more comfortable if information on an individual's spouses is dispersed.
Example:
Id | Name | Gender | FatherID | MotherID | SpouseID | - | - |
---|---|---|---|---|---|---|---|
34 | John / Smith | M | 4 | 22 | 48 | 73 |
53 |
In the second sheet / block, each line contains an individual's ID number and, in successive columns, items of supplementary information concerning the attributes of the individual. These items are the values of individual properties, columns being labelled according to individual property codes.
Example:
Id | BIRT_DATE | BIRT_ORD | OCCU |
---|---|---|---|
34 | 2/12/1934 | 3 | Taxi driver |
Pajek network format (file extension .paj)
Kinship data in pajek format can be used to transform, manipulate and analyze them with the computer program pajek.
For a free download of pajek click here.
For an introduction to pajek see the pajek manual (in pdf format).
For a series of macros that can be used to analyze kinship data with Pajek see Tip4Pajek.
Puck exports data into a pajek project file (that is, a package of network, partition and vector files), which contains the basic information for each individual in a network, and the supplementary information in partitions and vectors.
There are three different versions of Pajek kinship files:
Ore-Graph-Format
The basic information is stored as a network which consists of two parts:
- A vertex list, where each individual is represented by a line containing
- A current index (which must be continuous)
Warning! This index is generally not identical with the individual's ID number. In particular, it is never identical with it if the network represents a subcorpus of the original corpus. In order to save original ID numbers when exporting to a paj file, use the "numbered" option in the Export window.
- The individual's name between apostrophs, where different name parts are separated by a slash (/)
Warning! Make sure that there are no apostrophes within any individual's name! Pajek interprets them as the name's end!
- A geometrical figure name indicating the individual's gender: triangle for male, ellipse or circle for female, square for unknown gender
- An arc list representing parent-child links: each arc of the network is represented by three numbers: the indices of the two vertices connected by the arc, and the value of the arc (which is always 1 in this format)
- An edge list representing marriages: each edge of the network is represented by three numbers: the indices of the two vertices connected by the arc, and the value of the arc (which is always 1 in this format)
P-Graph-Format
In P-graph format, vertices represent unions and arcs represent individuals. The two parts of the network thus have a different meaning:
- A vertex list, where each union is represented by a line containing
- A current index (which must be continuous).
- The name of the union between apostrophs, which is identical to the name of the husband followed by the name of the spouse
- An arc list representing individuals linking their family of origin and their family of destination: each arc of the network is represented by three numbers: the indices of the two vertices (families) connected by the arc, and the value of the arc: 1 for men, -1 for women
Tip-Graph-Format
The TIP-Format has exactly the same vertex part as the ORE-Graph-Format, but a different arc part:
There are five arc lists, where the value of the arc (at the same time the relation number of the arc list) corresponds to the type of kinship link:
- 1 for an arc connecting wife and husband
- 2 for an arc connecting mother and daughter
- 3 for an arc connecting mother and son
- 4 for an arc connecting father and daughter
- 5 for an arc connecting father and son
Any supplementary information is stored in partitions (.clu) or vectors (.vec): These are simple lists of numbers, where each number corresponds to a different cluster of the partition or a different value of the vector. Each partition or vector should be named according to individual property codes.
Vectors are used for properties with numeric values, partitions for all others. As a consequence, a cluster value is just a label to identify clusters without any intrinsic meaning.
Warning: Original cluster labels (property values) are lost in pajek format!
Gedcom format (file extension .ged)
This is the format used by most genealogy programs (commercial and noncommercial).
For an introduction to gedcom formats click here. The individual property codes used by Puck correspond, as far as possible, to standard gedcom codes.
Please note that Gedcom format not be used for saving a dataset coded in Puck with complex informations (relations and some attributes). Please use PUC and IUR formats for this purpose.
- You can use Puck to import GEDCOM files produced from several genealogical softwares (Heredis, Geneatique,...). Genealogical informations (Individuals and families) are normaly imported but some gedcom properties and objects could be missing.
- The Puck GEDCOM export functionality assure the export of genealogical data, but several attributes and all the relations coded with Puck are actually not exported.
XML format for Kinship Editor (file extension .xml)
XML (eXtendible Markup Language) is a widely used language for the representation of arbitrary data structures, for example in web services (click here for a detailed description of its encoding rules). It is used as the standard output format of the Kinship Editor.
For kinship datasets, the following markup terminology has been defined:
- <kindata>: all data other than metadata. <kindata> is made up of data on <people> and <unions>.
- <people>: all data related to individuals: <people> is made up of data on each individual <person>
- <person>: all data related to a single individual: <person> is made up of the individual's <id>, <name> and <sex>, as well as statistical data (<stats>) and data on <location> on the screen
- <id>: an person's identity number
- <name>: a person's name
- <sex>: a person's gender
- <stats>: statistical data on a person. By default, <stats> includes data on the dates when the person is <born> and when she <died>
- <born>: a person's birth data
- <died>: a person's death date
- <location>: the location of the vertex representing the person on the screen. A <location> consists of an <x> and an <y> coordinate
- <x>: a person's location on the x axis
- <y>: a person's location on the y axis
- <person>: all data related to a single individual: <person> is made up of the individual's <id>, <name> and <sex>, as well as statistical data (<stats>) and data on <location> on the screen
- <unions>: all data related to unions (the equivalent of "families" in gedcom format): <unions> is made up of data on each individual <union>
- <union>: all data related to a single union. <union> is made up of the union's<id>, statistical data (<stats>), data on <location> on the screen, and data on the <partners> of the union and the <siblings> issued from the union
- <id>: the union's identity number
- <stats>: statistical data on a union. By default, <stats> includes data on the <begin> and the <end> of the union
- <begin>: the union's begin date (for example, the date of marriage)
- <end>: the union's end date (for example, the date of divorce)
- <location>: the location of the equality sign representing the union on the screen. A <location> consists of an <x> and an <y> coordinate
- <x>: a union's location on the x axis
- <y>: a union's location on the y axis
- <partners>: the persons united in the union (for example, spouses). <parnters> is made up of the identity numbers of each individual< partner>
- <partner>: the partner's identity number
- <siblings>: the siblings issued from the union (for example, children of the same parents). <siblings> is made up of the identity numbers of each individual<sibling>
- <sibling>: the sibling's identity number
- <union>: all data related to a single union. <union> is made up of the union's<id>, statistical data (<stats>), data on <location> on the screen, and data on the <partners> of the union and the <siblings> issued from the union
- <people>: all data related to individuals: <people> is made up of data on each individual <person>
Prolog format (file extension .pl)
Prolog is a general purpose logic programming language The program logic is expressed in terms of relations.
For more informations on the prolog language click here.
For an introduction to the use of prolog for reprsenting kinship relations click here.
To download a free prolog click here.
In prolog format, all relations and attributes are representing as pairs of the form r(a, b), where r is the name of the attribute or relation, a is the identity number of ego (preceded by the letter p), and b is either an attribute (between simple parentheses) or the identity number of alter (preceded by the letter p).
Example
daughter(p1,p2) means that p1 is p2's daughter
gname(p1,'Mary') means that p1 has the name 'Mary'
The prolog format for kinship data readable by Puck or the Kinship Editor use the followig terminology:
relations: father, mother, daughter, son, husband, wife
attributes: gname, sex, info1, info2 etc.