Team member: Team member


Download 445 b.
Sana17.10.2017
Hajmi445 b.
#18050



Team member:

  • Team member:

  • Elisee Habimana

  • Jicong Wang

  • Sridevi Maharaj Ronald Doku

  • Mingjia Zhang

  • Tobias Kin Hou Lei

  • Ravi Khadiwala Duber Gomez Rui Yang

  •  

  • Project leader:

  • Yizhou Sun

  • Rui Li



          Real Time

  •           Real Time

  •  

  •  

  •  

  •  

    •  
  •                            

  •                            



An earthquake happened in Chile at 03:34 local time, Sat Feb 27, 2010

    • An earthquake happened in Chile at 03:34 local time, Sat Feb 27, 2010
    • Traditional communication almost impossible for 2-3 hours, first video image available 6-7 hours after quake
  • Source: , by Carlos Castillo et al.



Tweet posted at 2:22pm, June 28th, 20 minutes after the shot, while first news report appears almost 3 hours later

    • Tweet posted at 2:22pm, June 28th, 20 minutes after the shot, while first news report appears almost 3 hours later


Twitter reshape the way people spread and receive information

    • Twitter reshape the way people spread and receive information
    • The real time feature makes twitter a good source of breaking news
    • The official and verified accounts on twitter provides reliable information
    • We propose to build up a web application that provide reliable real time crime related information


 

  •  





Major Challenges

    • Major Challenges
    • Crime Focused Crawling
    • Tweet Classification
    • Event Extraction
    • Tweet Ranking 
    • Clustering
    • Tools
    • Summary


Most tweet contents are useless for us

    • Most tweet contents are useless for us
      • Pointless babble – 40%
      • Conversational – 38%
      • Pass-along value – 9%
      • Self-promotion – 6%
      • Spam – 4%
      • News – 4%
      • Crime related - 0.005%
    • Roughly 10,000 crime related tweets each day
    • Information like location and time not always explicit
    • Display only the most important tweets
    • Present results in an organized fashion
  • Source:  Kelly, Ryan, ed (August 12, 2009)



    

  •     





 

  •  

  • USERID    43893075

  • ID    68542312782905344

  • TEXT    Break shooting scene 1 "No More" with @dindamanda @yuyayuyi http://lockerz.com/s/100883315

  • LOCATION    GeoLocation latitude=-6.196612, longitude=106.829552 

  • PLACE

  • TIME     Thu May 12 00:05:35 CDT 2011

  • URLS      url=http://lockerz.com/s/100883315,

  • MentionedEntities: 37623286    66072730    

  • Hashtags:

  •  

  • also number of Followers, number of Friends, name of User, etc





 

  •  



 

  •  

  •  

  •  

  •  

  •  

  •  

  •  

  •  

  •  

  •  

  •  

    • Repeat the above procedures until an ideal rule is obtained


However, there are STILL many "fake" crime tweets

  • However, there are STILL many "fake" crime tweets



Single Keyword

  • Single Keyword

  • Combination of Keywords

  • Key Phrases











    • Concept clusters
      • Natural disaster: {earthquake,tornado, ...}
      • Weapon: {weapon,weapons,gun,guns,gunshot, ...}
      • Injure: {...}
      • Burglar: {...}
      • ...
      • Non-Fire : {hilarious,weather,red,moon,sun, ... ,musician,
      • pizza,cook,music,dance justin bieber}
    • Could predict unseen words. e.g. Train on tornado warning, could predict earthquake warning.


Only Text Classification

  • Only Text Classification

  • But Tweets are short and noisy.

    • at most 140 words
    • contain noisy words,
    • contain urls, tags;


Special tags:

    • Special tags:
      • #hpd
      • #breaking news


 

  •  





 Naive Bayes

    •  Naive Bayes
      • Easy and good-performance model for online classification.
      • Many meaningful features and training data, different classification models will performance the similar result.


Crawled in from Twitter at different period of times

    • Crawled in from Twitter at different period of times
    • Manually labeled by our team
    • 2000 samples for training, among them:
    • 1000 samples for testing
      • 65% positive samples
      • 35% negative samples


About 100 concept clusters covers in different areas of the feature space

    • About 100 concept clusters covers in different areas of the feature space
    • Average accuracy on test set is 83.788%




Within the text of an individual tweet there may be information not previously found in through data crawling

    • Within the text of an individual tweet there may be information not previously found in through data crawling
    • This information is often useful to the user
      • Allows user to visualize where crime occurred
      • Allows user to view filter by category
      • Decreases the amount of raw tweets the user must read
    • This information is also useful to improve performance




Five potential sources of locations, listed in descending order of perceived usefulness:

  • Five potential sources of locations, listed in descending order of perceived usefulness:

    • GPS tagged tweets   latitude=57.8433342, longitude=12.6506338
    • 'Place' tagged tweets (57.6190897,12.427637),(57.6190897,12.7635394)      (57.8653997, 12.7635394),(57.8653997,12.427637)
    • User location
    • Textual Location Extraction 
      • Named Entity Recognition
      • Regular Expressions




"[0-9]+ ([A-Z][A-Za-z]* )+ (ALLEE|ALLEY|ALLY|ALY|ANEX|ANNEX|ANNX|ANX|ARC|ARCADE|AV|AVE|AVEN|AVENU|AVENUE|AVN| AVNUE|BAYOO|BAYOU|BCH|BEACH|BEND|BND|BLF|BLUF|BLUFF|BLUFFS|BOT|BOTTM|BOTTOM|BTM|BLVD|BOUL|BOULEVARD|BO ULV|BR|BRANCH|BRNCH|BRDGE|BRG|BRIDGE|BRK|BROOK|BROOKS|BURG|BURGS|BYP|BYPA|BYPAS|BYPASS|BYPS|CAMP|CMP| CP|CANYN|CANYON|CNYN|CYN|CAPE|CPE|CAUSEWAY|CAUSWAY|CSWY|CEN|CENT|CENTER|CENTR|CENTRE|CNTER|CNTR|CTR|C ENTERS|CIR|CIRC|CIRCL|CIRCLE|CRCL|CRCLE|CIRCLES|CLF|CLIFF|CLFS|CLIFFS|CLB|CLUB|COMMON|COR|CORNER|CORNERS|CORS| COURSE|CRSE|COURT|CRT|CT|COURTS|CTS|COVE|CV|COVES|CK|CR|CREEK|CRK|CRECENT|CRES|CRESCENT|CRESENT|CRSCNT|C RSENT|CRSNT|CREST|CROSSING|CRSSING|CRSSNG|XING|CROSSROAD|CURVE|DALE|DL|DAM|DM|DIV|DIVIDE|DV|DVD|DR|DRIV|DRI VE|DRV|DRIVES|EST|ESTATE|ESTATES|ESTS|EXP|EXPR|EXPRESS|EXPRESSWAY|EXPW|EXPY|EXT|EXTENSION|EXTN|EXTNSN|EXTE NSIONS|EXTS|FALL|FALLS|FLS|FERRY|FRRY|FRY|FIELD|FLD|FIELDS|FLDS|FLAT|FLT|FLATS|FLTS|FORD|FRD|FORDS|FOREST|FORE  STS|FRST|FORG|FORGE|FRG|FORGES|FORK|FRK|FORKS|FRKS|FORT|FRT|FT|FREEWAY|FREEWY|FRWAY|FRWY|FWY|GARDEN|GA RDN|GDN|GRDEN|GRDN|GARDENS|GDNS|GRDNS|GATEWAY|GATEWY|GATWAY|GTWAY|GTWY|GLEN|GLN|GLENS|GREEN|GRN|G REENS|GROV|GROVE|GRV|GROVES|HARB|HARBOR|HARBR|HBR|HRBOR|HARBORS|HAVEN|HAVN|HVN|HEIGHT|HEIGHTS|HGTS|HT |HTS|HIGHWAY|HIGHWY|HIWAY|HIWY|HWAY|HWY|HILL|HL|HILLS|HLS|HLLW|HOLLOW|HOLLOWS|HOLW|HOLWS|INLET|INLT|IS|ISL AND|ISLND|ISLANDS|ISLNDS|ISS|ISLE|ISLES|JCT|JCTION|JCTN|JUNCTION|JUNCTN|JUNCTON|JCTNS|JCTS|JUNCTIONS|KEY|KY|KEYS |KYS|KNL|KNOL|KNOLL|KNLS|KNOLLS|LAKE|LK|LAKES|LKS|LAND|LANDING|LNDG|LNDNG|LA|LANE|LANES|LN|LGT|LIGHT|LIGHTS|

  • "[0-9]+ ([A-Z][A-Za-z]* )+ (ALLEE|ALLEY|ALLY|ALY|ANEX|ANNEX|ANNX|ANX|ARC|ARCADE|AV|AVE|AVEN|AVENU|AVENUE|AVN| AVNUE|BAYOO|BAYOU|BCH|BEACH|BEND|BND|BLF|BLUF|BLUFF|BLUFFS|BOT|BOTTM|BOTTOM|BTM|BLVD|BOUL|BOULEVARD|BO ULV|BR|BRANCH|BRNCH|BRDGE|BRG|BRIDGE|BRK|BROOK|BROOKS|BURG|BURGS|BYP|BYPA|BYPAS|BYPASS|BYPS|CAMP|CMP| CP|CANYN|CANYON|CNYN|CYN|CAPE|CPE|CAUSEWAY|CAUSWAY|CSWY|CEN|CENT|CENTER|CENTR|CENTRE|CNTER|CNTR|CTR|C ENTERS|CIR|CIRC|CIRCL|CIRCLE|CRCL|CRCLE|CIRCLES|CLF|CLIFF|CLFS|CLIFFS|CLB|CLUB|COMMON|COR|CORNER|CORNERS|CORS| COURSE|CRSE|COURT|CRT|CT|COURTS|CTS|COVE|CV|COVES|CK|CR|CREEK|CRK|CRECENT|CRES|CRESCENT|CRESENT|CRSCNT|C RSENT|CRSNT|CREST|CROSSING|CRSSING|CRSSNG|XING|CROSSROAD|CURVE|DALE|DL|DAM|DM|DIV|DIVIDE|DV|DVD|DR|DRIV|DRI VE|DRV|DRIVES|EST|ESTATE|ESTATES|ESTS|EXP|EXPR|EXPRESS|EXPRESSWAY|EXPW|EXPY|EXT|EXTENSION|EXTN|EXTNSN|EXTE NSIONS|EXTS|FALL|FALLS|FLS|FERRY|FRRY|FRY|FIELD|FLD|FIELDS|FLDS|FLAT|FLT|FLATS|FLTS|FORD|FRD|FORDS|FOREST|FORE  STS|FRST|FORG|FORGE|FRG|FORGES|FORK|FRK|FORKS|FRKS|FORT|FRT|FT|FREEWAY|FREEWY|FRWAY|FRWY|FWY|GARDEN|GA RDN|GDN|GRDEN|GRDN|GARDENS|GDNS|GRDNS|GATEWAY|GATEWY|GATWAY|GTWAY|GTWY|GLEN|GLN|GLENS|GREEN|GRN|G REENS|GROV|GROVE|GRV|GROVES|HARB|HARBOR|HARBR|HBR|HRBOR|HARBORS|HAVEN|HAVN|HVN|HEIGHT|HEIGHTS|HGTS|HT |HTS|HIGHWAY|HIGHWY|HIWAY|HIWY|HWAY|HWY|HILL|HL|HILLS|HLS|HLLW|HOLLOW|HOLLOWS|HOLW|HOLWS|INLET|INLT|IS|ISL AND|ISLND|ISLANDS|ISLNDS|ISS|ISLE|ISLES|JCT|JCTION|JCTN|JUNCTION|JUNCTN|JUNCTON|JCTNS|JCTS|JUNCTIONS|KEY|KY|KEYS |KYS|KNL|KNOL|KNOLL|KNLS|KNOLLS|LAKE|LK|LAKES|LKS|LAND|LANDING|LNDG|LNDNG|LA|LANE|LANES|LN|LGT|LIGHT|LIGHTS|



Search extracted locations through a city to GPS lookup table

    • Search extracted locations through a city to GPS lookup table
    • Many American city names are repeated (Atlanta,IL vs Atlanta,GA)
      • Check for well formated locations (city,state)
      • If not, resolve by selecting matched city with the largest population
    • Give preferences to other location sources (like user location and GPS) when there are multiple matches






We only want to display best "n" tweets

    • We only want to display best "n" tweets
      • Nature of twitter may result in an extremely variable amount of data
      • Serves as another way to filter non-crime tweets
      • May be able to highlight important events
    • Summarize the most important data points
      • Avoid overwhelming the user with results


Goal: Learn a function f: X -> r

  • Goal: Learn a function f: X -> r

  •         where X is a vector of features 

  •         and r is a importance score

  • Strategy:

  •     Take pointwise approach and use a sample of manually scored data find the curve that fits our labeled data

  •     We use linear regression using the simple least squares method to find weights such that

  •         r = w1x1 + w2x2 + w3x3 + . . .  wnxn



Selected from a large pool of potential features

    • Selected from a large pool of potential features
    • Social
    • Contextual
      • Tweet length, category, mentioned locations
    • User Credibility
      • Age of user account, friends, followers, status count, verification
    • Classifier Confidence


Labeled ~500 tweets with a ranking (integer from 1 to 5)

    • Labeled ~500 tweets with a ranking (integer from 1 to 5)
    • Linear regression on all features (normalized)
      • Examined correlation coefficients
      • Examined weights
      • Pruned features
    • Repeated until we had an adequate feature set with logical weights


Weights                     Features

  • Weights                     Features

  • -0.996904004778        category

  • 2.87974471144           account age

  • 1.71671010105           favorites

  • 1.17242993534           status count

  • 2.67005302808           followers

  • -3.97882564778          confidence





Clustering of tweets means to group overlapping tweets found in the same location into one category.

    • Clustering of tweets means to group overlapping tweets found in the same location into one category.


Clustered tweets inform the user about where most events are happening at a particular time. 

    • Clustered tweets inform the user about where most events are happening at a particular time. 
    • The sizes of the clustered tweets also convey how relevant or important the tweets are.
    • eg. A user may want to find out how far a wild fire outbreak is spreading or has spread to. Clustered tweets of the wildfire on the map shows the user  where the fire is or has spread to.


 

  •  



 

  •  







 

  •  



Conceptual Level 

  • Conceptual Level 

    • Detects and monitors crime via a popular 21st century social media 
  •  

  • Technical Level  

    •  Developed crawler to obtain data 
    • Identified and explored useful features from social network to rank and classify crime
  •                            

  • System Level

  •      

  •  

  •  





Download 445 b.

Do'stlaringiz bilan baham:




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling