Modeling Identity in Large Collection of Email: A Preliminary ...

Modeling Identity in Large Collection of Email: A Preliminary ...

Modeling Identity in Archival Collections of Email: A Preliminary study Tamer Elsayed and Douglas W. Oard Institute for Advanced Computer Studies Department of Computer Science College of Information Studies Conference on Email and Anti-Spam (CEAS), July 28th, 2006 Real Problem Clinton White House 32 million emails search request ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ Tobacco Policy

~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ 80,000 National Archives hired 25 persons for 6 months ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ 200,000 Modeling Identity in Archival Collections of Email: A Preliminary Study Email Search Searcher Participant Non-participant Personal My own emails Shneidermans Postels

Organizational CS UMIACS White House Enron TREC Enterprise Usenet news W3C Public Meaning Modeling Content People Modeling Identity Modeling Identity in Archival Collections of Email: A Preliminary Study Identity Nickname sent email to Name Nickname Name

Email Address Email Address Sender Receivers sent mentioned ~~~~~~~~~ ~~Email~~ Email ~~~~~~~~~ ~~~~~~~~~ ~~~~~~~~~ received mentioned to mentions Mentioned Email Address Name Nickname Modeling Identity in Archival Collections of Email: A Preliminary Study Outline Problem

Identity Resolution Architecture Evaluation Conclusion Modeling Identity in Archival Collections of Email: A Preliminary Study Entity Example Nickname Name Robert Bruce Bob Main Headers (915) Quoted Headers (8) Salutations (7) Free Signatures (9) Email Address [email protected] Static Signature (140) Robert E. Bruce Senior Counsel Enron North America Corp. T (713) 345-7780 F (713) 646-3393 [email protected] Signature Block

Modeling Identity in Archival Collections of Email: A Preliminary Study Enron Collection Example of large organizational collection CMU version about half million emails 133,581 unique email addresses ~52% of emails are duplicates! same address, subject, body Modeling Identity in Archival Collections of Email: A Preliminary Study Typical Enron Email Message-ID: <[email protected]> Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: [email protected] To: [email protected] Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth X-To: '[email protected]@ENRON'

Hi Shari Salutation Main Body Hope all is well. Count me in for the group present. See ya next week if not earlier Liza Elizabeth Sager 713-853-6349 Message Body Signature Block -----Original Message----Quoted Header From: [email protected]@ENRON Sent: Monday, July 30, 2001 2:24 PM To: Sager, Elizabeth; Murphy, Harlan; [email protected]; [email protected] Cc: [email protected] Subject: Shhhh.... it's a SURPRISE ! Please call me (713) 207-5233 Thanks! Shari Message Header

Quoted Text Quoted Main Body Quoted Signature Modeling Identity in Archival Collections of Email: A Preliminary Study Identity Resolution Architecture Entities Clustering Associations Address-Address Associations Address-Name Associations Address-Nickname Associations Nickname Extraction Salutation lines Signature lines Extraction from Quoted Header Quoted

headers Extraction from Main Header Unique emails Signature Line Detection Salutation Line Detection Main body Body and Quoted Text Separation Duplicate Detection Modeling Identity in Archival Collections of Email: A Preliminary Study Extraction From Main Headers Name-Address Message-ID: <[email protected]> Date: Wed, 26 Sep 2001 09:25:19 -0700 (PDT) Association From: [email protected] To: [email protected], [email protected], [email protected], o'[email protected], [email protected] Subject: New Email Address

X-From: Jim Mathes X-To: Vandini, Mark , Urbon Steve , Tony Sapienza , Tom O'Rourke , Tom Lyons , Tom Hodgson X-cc: X-bcc: We have just launched our "New & Improved Website", www.newbedfordchamber.com and I have a new email address: [email protected] Address-Address Association Name-Address Please make the appropriate changes in your email address book. Association Thank you, Jim Mathes, President New Bedford Area Chamber of Commerce Modeling Identity in Archival Collections of Email: A Preliminary Study Extraction From Quoted Headers Hi Jeff, Did you get our registration packet? If not,Name-Address stop by and pick one up because you need it. Make sure you get theAssociation one for new students. Shawn

On Wednesday, November 03, 1999 11:18 AM, Jeff Dasovich [SMTP:[email protected]] wrote: > > > ok, don't shoot me, but what's the deadline for scheduling for classes? > > signed, > clueless ---------------------- Forwarded by Elizabeth Sager/HOU/ECT on 02/09/2000 12:02 PM --------------------------"Patricia Young" on 02/09/2000 08:50:59 AM To: Elizabeth Sager/HOU/[email protected] cc: Subject: If possible, would you forward your resume to me electronically? Thanks. Name-Address Association If possible, would you forward your resume to me electronically? Thanks. Modeling Identity in Archival Collections of Email: A Preliminary Study Signature & Salutation Detection From: [email protected] Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem good have a lotand of energy, buthas

youleft have shut down The weekbefore. is goingIt'sOK. Alltothe tennis swimming metowith sore sometime. muscles so this is my night off. Am planning to do some more house chores so I do not end up with another weekend like the last. Am sending you my travel schedule for next week. The following week (May 29 -I'm June 2) I'mare planning to be to in school SF also, butweekend, I'msonot sure I'll actually have The kiddies going

back already now would be good time to still planning on coming to Austin next I'm just notasure when, to long. plan tripyou tothat D.C. at last. Maybe early Sept? butbe I'llathere let know. Also I'd be game for a girls' trip to Destin. Have good Call if ayou

getafternoon! lonely! Time to work! love, Love, Love, sooz -Sooz Sooz Procurement, and Contracts Procurement,Logistics, Logistics, and Contracts Enron Inc. EnronBroadband BroadbandServices, Services, Inc. 1400 Smith, Suite EB-4573A 1400 Smith, Suite EB-4573A Houston, Houston,TX TX 77002 77002

Modeling Identity in Archival Collections of Email: A Preliminary Study Nickname Extraction From: [email protected] Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, but you have to shut down sometime. Am sending you my travel schedule for next week. The following week (May 29 - June 2) I'm planning to be in SF also, but I'm not sure I'll actually have to be there that long. Have a good afternoon! love, sooz nickname Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002 3,151 address-nickname associations Modeling Identity in Archival Collections of Email: A Preliminary Study Identifying Entities Nickname Name Robert Bruce Bob

Main Headers (915) Quoted Headers (8) 82,084 addr-name Salutations (7) Free Signatures (9) 3,151 addr-nickname Email Address [email protected] Static Signature (140) Robert E. Bruce Senior Counsel Enron North America Corp. T (713) 345-7780 F (713) 646-3393 [email protected] Signature Block Main Headers (7) 19,708 addr-addr Email Address [email protected]

Quoted Headers (5) Robert 66,715 entities Modeling Identity in Archival Collections of Email: A Preliminary Study Name Outline Problem Identity Resolution Architecture Evaluation Conclusion Future Work Modeling Identity in Archival Collections of Email: A Preliminary Study Stratified Sampling Weakest Evidence Stronger Evidence Address-Name Associations Main headers only 50 / 29677

50 / 31248 Quoted headers only 50 / 8042 50 / 3828 Both headers 50 / 9289 Address-Nickname Associations Salutations only 50 / 272 50 / 465 Signatures only 50 / 172 50 / 1754 Both Address-Address Associations 50/490 50 / 6514

50 / 4194 Modeling Identity in Archival Collections of Email: A Preliminary Study Judgment Process Incorrect [email protected] "home email"home email"home email" [email protected] "home email"alexis james-petty"home email" Correct but not informative [email protected] june deadrick [email protected] robbie lewis Correct and somewhat informative [email protected] "home email"terrie covarrubias"home email" [email protected] "home email"randy"home email" Correct and very informative [email protected] "home email"phyllis"home email" [email protected] "home email"tom"home email" Modeling Identity in Archival Collections of Email: A Preliminary Study Evaluation Measures Correct Judged Associations Very Informative Informative Modeling Identity in Archival Collections of Email: A Preliminary Study

Accuracy 80 60 Weakest evidence Average evidence Stronger evidence 40 20 0 Main Headers Quoted Headers Both Overall Address-Name Associations 100 Percent Accuracy

100% accuracy with multiple sources of evidence. Address-name association was nearly perfect 80% minimum accuracy in address-nickname 96.7% entity accuracy 80 60 40 20 0 Salutation Signature Both Overall Address-Nickname Associations 100 Percent Accuracy Percent Accuracy 100 80

60 40 20 0 Main Headers Address-Address Associations Modeling Identity in Archival Collections of Email: A Preliminary Study Informativeness 80 60 40 20 0 Quoted Headers Both Overall 80 60 40 20 0 Main Headers Address-Name Associations Quoted

Headers Both Overall 100 100 Percent Very Informative P erce n t In fo rm a tive Main Headers 100 Percent Very Informative Percent Informative 100 80 60 40 20 0 Salutation Signature Both

Weakest evidence 80 Average evidence 60 Stronger evidence 40 20 0 Salutation Ove rall Signature Both Overall 100 80 80 60 40

20 100 Percent Accuracy 100 Percent Very Informative Percent Informative Address-Nickname Associations 60 40 20 0 0 Main Headers Main Headers 80 60 40 20 0 Salutation

Address-Address Associations Modeling Identity in Archival Collections of Email: A Preliminary Study Signature Outline Problem Identity Resolution Architecture Evaluation Conclusion Modeling Identity in Archival Collections of Email: A Preliminary Study Conclusion Introduced a computational model of identity a set of simple techniques put together provide a useful baseline assessed its potential utility in the context of one fairly complex email collection Automatic detection of nicknames in salutations and signature lines. Most informative results from weakest evidence & least accurate

Accuracy and informativeness are both important Modeling Identity in Archival Collections of Email: A Preliminary Study Limitations Email address associated with single identity Strength of evidence not exploited Heuristics hand-tuned for Enron collection Focus on personal attributes No reconciliation of multiple identities for single person No attempt to classify identities as machines or groups Recall? Modeling Identity in Archival Collections of Email: A Preliminary Study Thank You! Questions? Modeling Identity in Archival Collections of Email: A Preliminary Study Backup Modeling Identity in Archival Collections of Email: A Preliminary Study Future Work

extend the model to exploit temporal features and behavioral evidence implement machine learning techniques perform ablation studies characterize the coverage of our methods in more detail replicate this work in other contexts integrate these techniques with the ultimate applications for which computational models of identity are needed (e.g., social network analysis). Modeling Identity in Archival Collections of Email: A Preliminary Study Helping in Judgments Modeling Identity in Archival Collections of Email: A Preliminary Study Identity Framework Entity Group Person Machine Identity

Identity Identity Entity Entity Entity Entity Entity Candidates Modeling Identity in Archival Collections of Email: A Preliminary Study Modeling Identity Attributes (stable explicit features) Associations

email addresses, names, nickname, contact info Link attributes together Based on observations Entities Representation of an identity Set of attributes in undirected graph Linked by weighted associations Modeling Identity in Archival Collections of Email: A Preliminary Study Identifying Entities First round limited transitive closure Merging associations based on unique attributes Address-address associations

No use of strength of evidence yet 66,715 entities Covering 77,420 unique email address (58% of all addresses) Modeling Identity in Archival Collections of Email: A Preliminary Study Related Work Attribute/association extraction Name recognition and reference resolution Applications: Social network analysis Finding experts Modeling Identity in Archival Collections of Email: A Preliminary Study Unjudged Associations Unjudged Associations 5 4

Weakest evidence Stronger evidence 3 2 1 0 Main Headers Quoted Headers Address-Name Associations Salutations Signatures Address-Nickname Associations Main Headers Address-Address Associations Only 19 ~3% Modeling Identity in Archival Collections of Email: A Preliminary Study

Recently Viewed Presentations

  •  Vlerick Business School THE SALES PROCESS ADAPTIVE SELLING

    Vlerick Business School THE SALES PROCESS ADAPTIVE SELLING

    Define value in customer terms . Seek ways to add value, not cost . Sell to the customer's needs, not against the competition's package . Seek ways to deliver proactive value-added service . Value-added selling is a team effort. Responsiveness...
  • The End of the Cold War SS6H7 The

    The End of the Cold War SS6H7 The

    1945 was the beginning of a long period of distrust & misunderstanding between the Soviet Union and its former allies in the West (particularly ... The US would back one side, with the USSR backing the other. This is also...
  • Architecture-Roof Styles and Distinguishing Features

    Architecture-Roof Styles and Distinguishing Features

    Architecture-Roof Styles and Distinguishing Features DDP-2010 Roof Types Roof shapes Roof Shapes Roofs are broken into two basic shape families: gabled and hipped. Gabled Gabled refers to the family of houses classified by the straight slope falling from ridge to...
  • JUBAIL UNIVERSITY COLLEGE STUDENT SELF-STUDY ASSESSMENT www.ucj.edu.sa JUC

    JUBAIL UNIVERSITY COLLEGE STUDENT SELF-STUDY ASSESSMENT www.ucj.edu.sa JUC

    JUBAIL UNIVERSITY COLLEGE STUDENT SELF-STUDY ASSESSMENT www.ucj.edu.sa JUBAIL UNIVERSITY COLLEGE STUDENT SELF-STUDY ASSESSMENT JUC Reading & Writing 001 Q:Skills Reading & Writing 2 Unit 3 review PPT
  • Maine Regional Railways Project TIGER 7 Award Goals

    Maine Regional Railways Project TIGER 7 Award Goals

    Maine Regional Railways Project TIGER 7 Award Goals of the Maine Regional Rail Project a) Restore a critical component of the national rail system to a state of good repair improving both the reliability and resiliency in a key transportation...
  • Performance Spaces - Weebly

    Performance Spaces - Weebly

    Amphitheatre . An amphitheatre is an open-air venue used for entertainment, performances, and sports.The term derives from the ancient Greek ἀμφιθέατρον (amphitheatron), from ἀμφί (amphi), meaning "on both sides" or "around" and θέατρον (théātron), meaning "place for viewing".
  • Onemocnění tepen dolních končetin - chronický uzávěr

    Onemocnění tepen dolních končetin - chronický uzávěr

    corona. phlebectatica - dilatované rudé až modré žilky. při vnitřním kotníku . nejtěžší stádium nedostatečnosti - ulcus. crurum. Terapie.
  • Welcome to the University of Wyoming

    Welcome to the University of Wyoming

    Mini-Sumo. Robots are designed to force another robot out of a circle. ... Help with raising funds - don't let money stand in their way. Coach a team. Serve as a chaperone. Invite youth to share their experience at a...