Setup Environment

Load API credentials and install required libraries.

%%capture capt  

Create and Populate Graph

The below code extracts all of the parties from the dataset and incrementally builds a graph. The nodes are created for; PartyIds, Names, Addresses, Accounts and Identifications.


API Calls :  10  Graph Nodes Created:  456851  Graph Edges Created:  399826
API Calls :  20  Graph Nodes Created:  881494  Graph Edges Created:  799646
API Calls :  30  Graph Nodes Created:  1290597  Graph Edges Created:  1199456
API Calls :  40  Graph Nodes Created:  1689368  Graph Edges Created:  1599276
API Calls :  50  Graph Nodes Created:  2080133  Graph Edges Created:  1999102
API Calls :  60  Graph Nodes Created:  2462866  Graph Edges Created:  2398950
API Calls :  70  Graph Nodes Created:  2839956  Graph Edges Created:  2798792
API Calls :  80  Graph Nodes Created:  3210793  Graph Edges Created:  3198626
API Calls :  90  Graph Nodes Created:  3575489  Graph Edges Created:  3598464
API Calls :  100  Graph Nodes Created:  3935553  Graph Edges Created:  3998288
API Calls :  110  Graph Nodes Created:  4290451  Graph Edges Created:  4398110
API Calls :  120  Graph Nodes Created:  4640102  Graph Edges Created:  4797912
API Calls :  130  Graph Nodes Created:  4985210  Graph Edges Created:  5197768
API Calls :  140  Graph Nodes Created:  5325816  Graph Edges Created:  5597592
API Calls :  150  Graph Nodes Created:  5661076  Graph Edges Created:  5997416
API Calls :  160  Graph Nodes Created:  5993539  Graph Edges Created:  6397238
API Calls :  170  Graph Nodes Created:  6321746  Graph Edges Created:  6797072
API Calls :  180  Graph Nodes Created:  6646588  Graph Edges Created:  7196910
API Calls :  190  Graph Nodes Created:  6967955  Graph Edges Created:  7596730
API Calls :  200  Graph Nodes Created:  7285693  Graph Edges Created:  7996608
Extraction Runtime : 54.07 minutes

Network Summary Stats

A relatively large graph has been produced. The stats of the graph are printed below.

print(nx.info(G)) ## Alternatively print('Network Nodes : ' , G.number_of_nodes()) print('Network Edges : ' , G.number_of_edges())

Name: Party Resolution Graph
Type: Graph
Number of nodes: 7489246
Number of edges: 8255566
Average degree:   2.2046
Graph Density:  0.0000002944

Select Example Reported Party

Grab an example Reported Party, by searching through nodes by type and degree.


Example Node Id Found :  partyId|0000138293b4e10d395cceca432d06f20a6cee8772563dd919e4143b53d72550

Visualise Example Reported Party

Visualise the example Reported Party, with nodes for the partyId, name, address, account and identification.


Select Two Linked Reported Parties

Grab an example of two Reported Parties that share two characteristics. Indicating that they might be the same same real world entity.


Example Node Id Found :  account|paypal||paypal|535404706

Visualise the Two Example Linked Reported Parties

These two reported parties appear to be the same real world reported party. As they share the same name and account details.


Breakdown of Degree

Lets create a graph that breaks down the number of notes per degree. I.e. a count of how many nodes are linked to 1 other node, 2 other nodes, 3 other nodes, etc...


Review Some High Degree Nodes

Take a Look at some of the nodes with high degrees to determine if they should be included in the analysis (or they are noise). As expected they are noise.


Degree:  4218 NodeId:   address|unknown|Sydney|2000|NSW|AU
Degree:  4067 NodeId:   address||Hong Kong|||HK
Degree:  4016 NodeId:   address||Singapore|||SG
Degree:  3373 NodeId:   address||Taipei|||TW
Degree:  2682 NodeId:   address||Istanbul|||TR
Degree:  1952 NodeId:   address||Moscow|||RU
Degree:  1699 NodeId:   address||Tokyo|||JP
Degree:  1689 NodeId:   address||Jakarta|||ID
Degree:  1458 NodeId:   address||Zürich|||CH
Degree:  1415 NodeId:   address||Cairo|||EG

Remove Nodes

In this step we reduce the graph to only include nodes required for contraction. Firstly we remove nodes with a high degree - which means that they have a lot of linkages - indicating that they would be poor evidence that a party should be contracted. Secondly we drop nodes with short addresses (as people sharing a common short address generally does not mean much - e.g. two people have an address of Sydney). Finally we remove nodes with a degree of one or zero. Basically if a name, address, account or identification is only linked to one partyId then they can not be used for contraction.


Name: Party Resolution Graph
Type: Graph
Number of nodes: 2515419
Number of edges: 2961816
Average degree:   2.3549
Graph Density:     0.0000009362

Review the Degree Breakdown

See how the degree breakdown has changed as a result of eliminating the nodes above.


Review Highest Degree Nodes (again)

After the removal of nodes. Looks a lot less like noise.


Degree:  646 NodeId:   name|Lafarge Building Materials Inc
Degree:  646 NodeId:   identification|lei|549300IESMY44ZM8D969
Degree:  646 NodeId:   address|C/O The Prentice Hall Corporation System, 150 S Perry St.|Montgomery|36104|US-AL|US
Degree:  609 NodeId:   address|C/O The Corporation Trust Company Corporation Trust Center 1209 Orange Street|Wilmington|19801|US-DE|US
Degree:  590 NodeId:   address|C/O The Corporation Trust Company Corporation Trust Center 1209 Orange St|Wilmington|19801|US-DE|US
Degree:  459 NodeId:   address|C/O Corporation Service Company 251 Little Falls Drive New Castle|Wilmington|19808|US-DE|US
Degree:  427 NodeId:   address|Via V. Alfieri, 1|Conegliano|31015|IT-TV|IT
Degree:  404 NodeId:   address|C/O The Corporation Trust Company, Corporation Trust Center 1209 Orange St|Wilmington|19801|US-DE|US
Degree:  377 NodeId:   address|Theodor-Heuss-Allee 70 C/O Universal-Investment-Gesellschaft Mit Beschränkter Haftung|Frankfurt Am Main|60486|DE-HE|DE
Degree:  371 NodeId:   name|HPET REIT I, LLC

Identify Reported Parties For Consolidation

Loop though each PartyId node to determine if it it can be consolidated with is neighbouring PartyIds. Then create a series of files containing the partyIds to be consolidated. The writting out of files is a resource management activity.


Nodes to process : 2515419
Processed all nodes. Count of Nodes to contract :  4539774

Create List of Reported Party Groups

Load files created in previous step and create list of consoldiated parties


Selected a Contracted Party Id for Review

Select a consolidated node and review the consolidation.


Following Example PartyIds Selected :  ['1f7e630ac73a0028fdb87a020dedbcf42042aa974aa7097e52989018c06ab28f', '0000159169c938cf6a867691680374ed9f6030d27ac0023ef96f26c5dc0e754e', '0000403cb5e1a52bed96b0248c8b97d1e3cdae0052b2ac90d1c6268feef4015c', 'c732628093843dd10f30fc243921deea0b9abed26bbff7052547961f5bafe2b2', '00006d92f76adefce58a3810cec86320cdfb66576eadb417bfe6f447a124324f', '206e795282759b8ec2e8b166544ac29db1e45e480dc0a8515b4460dcda65a48a', '68fbf66a1ca4f5b410920733e5d50cd9070cc1a571698249698fef7f6cc7a95a', '00007988ce08705b9a18c96d20318a6af9ce7e6fbd643178bb5ac0559b19fd59', '00007aa0cd8f9869365fa27bd5077fc10074d1853481d99385cac554ca13c6aa', '1b08436958e35e98df2ffe58c1e119b3608f841064c1df859961eb58302c355f', '13936989241b85f7167bc09c073c764876305a5705ab713b1c8998bb23453867', '000084d58e84a3aa60d51b0a0d30a6fb12412a6f8810df290b477c8e68f90708', 'eab87d8b478ddbd53571b6537a441f462cc1717a081d0195db8eb4581b0e4734', '0000c98193502a4154269771aefe511f43504d75a45a77794fec2e6eb6bd888a', '0000f8dcdad09cf17f67d2a76f3d89e3cb0da134de88db6e0daa5753abc84fef', '64195fa695eb4ddd1bfe9bccbd421411de5cc62c85e0807f0b2f8ca144dc086b', '59c994961ccbda05ef63d89cfa88da52b6056fd8e72506a1b255ef9518859971', '000177eb7ca5aed1c0866d792b2438b38da7e48ae5ebd3b2bdde0099f273a76b', '2527d9b337ea83d5e5edadeff7e7e2a97ba04c6888f45261cd8c52556ef57f94', '000181bc9d4bedf69b9251ed2370b0878bf320de13ffa7ff5f3dc055f04f530b', 'bc85d9ccb7ada17ecc122cb53e9e49499897b30533c99a55581994b9bf6f1fb6', '0001b8ed601a12a9165da0f47f0617762acdf5d60133b3c766978bd87dcff603', '0001bc74be085ce4d85255754523a341073c3765312bb82650038b5843c4eb70', 'c84d206c628dac7bcc61cbfaae6c7f0d6551c54b087167c6a8910bd65e60ef21', '00021a65b5039131cf6f048a1cb0849848e778e489b6138ebcd2006c478528e3', '617df7644494842300f9a09c0e2abd3128525dd4d448029b7107ccc262637ff6']

Visualise Reported Parties that Have been consolidated

Visualise some example Reported Parties that have been consolidated.


Load the Consolidatex Parties

So the consoldiated parties can be leveraged in profile analysis.


Conclusion

The above approach logically works, however it needs to load all the nodes and edges into memory, then sequentially process each node to find potential party records for consolidation. Thus significnat optimisations are required to scale this type of a approach to billions of records.