Team member: Team member: Elisee Habimana Jicong Wang Mingjia Zhang Tobias Kin Hou Lei Ravi Khadiwala Duber Gomez Rui Yang Project leader: Yizhou Sun Rui Li
Real Time
An earthquake happened in Chile at 03:34 local time, Sat Feb 27, 2010 - An earthquake happened in Chile at 03:34 local time, Sat Feb 27, 2010
- Traditional communication almost impossible for 2-3 hours, first video image available 6-7 hours after quake
Source: , by Carlos Castillo et al.
Tweet posted at 2:22pm, June 28th, 20 minutes after the shot, while first news report appears almost 3 hours later - Tweet posted at 2:22pm, June 28th, 20 minutes after the shot, while first news report appears almost 3 hours later
Twitter reshape the way people spread and receive information - Twitter reshape the way people spread and receive information
- The real time feature makes twitter a good source of breaking news
- The official and verified accounts on twitter provides reliable information
- We propose to build up a web application that provide reliable real time crime related information
Major Challenges - Major Challenges
- Crime Focused Crawling
- Tweet Classification
- Event Extraction
- Tweet Ranking
- Clustering
- Tools
- Summary
Most tweet contents are useless for us - Most tweet contents are useless for us
- Pointless babble – 40%
- Conversational – 38%
- Pass-along value – 9%
- Self-promotion – 6%
- Spam – 4%
- News – 4%
- Crime related - 0.005%
- Roughly 10,000 crime related tweets each day
- Information like location and time not always explicit
- Display only the most important tweets
- Present results in an organized fashion
Source: Kelly, Ryan, ed (August 12, 2009)
USERID 43893075 ID 68542312782905344 TEXT Break shooting scene 1 "No More" with @dindamanda @yuyayuyi http://lockerz.com/s/100883315 LOCATION GeoLocation latitude=-6.196612, longitude=106.829552 PLACE TIME Thu May 12 00:05:35 CDT 2011 URLS url=http://lockerz.com/s/100883315, MentionedEntities: 37623286 66072730 Hashtags: also number of Followers, number of Friends, name of User, etc
- Repeat the above procedures until an ideal rule is obtained
However, there are STILL many "fake" crime tweets However, there are STILL many "fake" crime tweets
Single Keyword Single Keyword Combination of Keywords Key Phrases
- Concept clusters
- Natural disaster: {earthquake,tornado, ...}
- Weapon: {weapon,weapons,gun,guns,gunshot, ...}
- Injure: {...}
- Burglar: {...}
- ...
- Non-Fire : {hilarious,weather,red,moon,sun, ... ,musician,
- pizza,cook,music,dance justin bieber}
- Could predict unseen words. e.g. Train on tornado warning, could predict earthquake warning.
Only Text Classification Only Text Classification But Tweets are short and noisy. - at most 140 words
- contain noisy words,
- contain urls, tags;
Special tags:
Naive Bayes - Naive Bayes
- Easy and good-performance model for online classification.
- Many meaningful features and training data, different classification models will performance the similar result.
Crawled in from Twitter at different period of times - Crawled in from Twitter at different period of times
- Manually labeled by our team
- 2000 samples for training, among them:
- 1000 samples for testing
- 65% positive samples
- 35% negative samples
About 100 concept clusters covers in different areas of the feature space - About 100 concept clusters covers in different areas of the feature space
- Average accuracy on test set is 83.788%
Within the text of an individual tweet there may be information not previously found in through data crawling - Within the text of an individual tweet there may be information not previously found in through data crawling
- This information is often useful to the user
- Allows user to visualize where crime occurred
- Allows user to view filter by category
- Decreases the amount of raw tweets the user must read
- This information is also useful to improve performance
Five potential sources of locations, listed in descending order of perceived usefulness: Five potential sources of locations, listed in descending order of perceived usefulness: - GPS tagged tweets latitude=57.8433342, longitude=12.6506338
- 'Place' tagged tweets (57.6190897,12.427637),(57.6190897,12.7635394) (57.8653997, 12.7635394),(57.8653997,12.427637)
- User location
- Textual Location Extraction
- Named Entity Recognition
- Regular Expressions
"[0-9]+ ([A-Z][A-Za-z]* )+ (ALLEE|ALLEY|ALLY|ALY|ANEX|ANNEX|ANNX|ANX|ARC|ARCADE|AV|AVE|AVEN|AVENU|AVENUE|AVN| AVNUE|BAYOO|BAYOU|BCH|BEACH|BEND|BND|BLF|BLUF|BLUFF|BLUFFS|BOT|BOTTM|BOTTOM|BTM|BLVD|BOUL|BOULEVARD|BO ULV|BR|BRANCH|BRNCH|BRDGE|BRG|BRIDGE|BRK|BROOK|BROOKS|BURG|BURGS|BYP|BYPA|BYPAS|BYPASS|BYPS|CAMP|CMP| CP|CANYN|CANYON|CNYN|CYN|CAPE|CPE|CAUSEWAY|CAUSWAY|CSWY|CEN|CENT|CENTER|CENTR|CENTRE|CNTER|CNTR|CTR|C ENTERS|CIR|CIRC|CIRCL|CIRCLE|CRCL|CRCLE|CIRCLES|CLF|CLIFF|CLFS|CLIFFS|CLB|CLUB|COMMON|COR|CORNER|CORNERS|CORS| COURSE|CRSE|COURT|CRT|CT|COURTS|CTS|COVE|CV|COVES|CK|CR|CREEK|CRK|CRECENT|CRES|CRESCENT|CRESENT|CRSCNT|C RSENT|CRSNT|CREST|CROSSING|CRSSING|CRSSNG|XING|CROSSROAD|CURVE|DALE|DL|DAM|DM|DIV|DIVIDE|DV|DVD|DR|DRIV|DRI VE|DRV|DRIVES|EST|ESTATE|ESTATES|ESTS|EXP|EXPR|EXPRESS|EXPRESSWAY|EXPW|EXPY|EXT|EXTENSION|EXTN|EXTNSN|EXTE NSIONS|EXTS|FALL|FALLS|FLS|FERRY|FRRY|FRY|FIELD|FLD|FIELDS|FLDS|FLAT|FLT|FLATS|FLTS|FORD|FRD|FORDS|FOREST|FORE STS|FRST|FORG|FORGE|FRG|FORGES|FORK|FRK|FORKS|FRKS|FORT|FRT|FT|FREEWAY|FREEWY|FRWAY|FRWY|FWY|GARDEN|GA RDN|GDN|GRDEN|GRDN|GARDENS|GDNS|GRDNS|GATEWAY|GATEWY|GATWAY|GTWAY|GTWY|GLEN|GLN|GLENS|GREEN|GRN|G REENS|GROV|GROVE|GRV|GROVES|HARB|HARBOR|HARBR|HBR|HRBOR|HARBORS|HAVEN|HAVN|HVN|HEIGHT|HEIGHTS|HGTS|HT |HTS|HIGHWAY|HIGHWY|HIWAY|HIWY|HWAY|HWY|HILL|HL|HILLS|HLS|HLLW|HOLLOW|HOLLOWS|HOLW|HOLWS|INLET|INLT|IS|ISL AND|ISLND|ISLANDS|ISLNDS|ISS|ISLE|ISLES|JCT|JCTION|JCTN|JUNCTION|JUNCTN|JUNCTON|JCTNS|JCTS|JUNCTIONS|KEY|KY|KEYS |KYS|KNL|KNOL|KNOLL|KNLS|KNOLLS|LAKE|LK|LAKES|LKS|LAND|LANDING|LNDG|LNDNG|LA|LANE|LANES|LN|LGT|LIGHT|LIGHTS| "[0-9]+ ([A-Z][A-Za-z]* )+ (ALLEE|ALLEY|ALLY|ALY|ANEX|ANNEX|ANNX|ANX|ARC|ARCADE|AV|AVE|AVEN|AVENU|AVENUE|AVN| AVNUE|BAYOO|BAYOU|BCH|BEACH|BEND|BND|BLF|BLUF|BLUFF|BLUFFS|BOT|BOTTM|BOTTOM|BTM|BLVD|BOUL|BOULEVARD|BO ULV|BR|BRANCH|BRNCH|BRDGE|BRG|BRIDGE|BRK|BROOK|BROOKS|BURG|BURGS|BYP|BYPA|BYPAS|BYPASS|BYPS|CAMP|CMP| CP|CANYN|CANYON|CNYN|CYN|CAPE|CPE|CAUSEWAY|CAUSWAY|CSWY|CEN|CENT|CENTER|CENTR|CENTRE|CNTER|CNTR|CTR|C ENTERS|CIR|CIRC|CIRCL|CIRCLE|CRCL|CRCLE|CIRCLES|CLF|CLIFF|CLFS|CLIFFS|CLB|CLUB|COMMON|COR|CORNER|CORNERS|CORS| COURSE|CRSE|COURT|CRT|CT|COURTS|CTS|COVE|CV|COVES|CK|CR|CREEK|CRK|CRECENT|CRES|CRESCENT|CRESENT|CRSCNT|C RSENT|CRSNT|CREST|CROSSING|CRSSING|CRSSNG|XING|CROSSROAD|CURVE|DALE|DL|DAM|DM|DIV|DIVIDE|DV|DVD|DR|DRIV|DRI VE|DRV|DRIVES|EST|ESTATE|ESTATES|ESTS|EXP|EXPR|EXPRESS|EXPRESSWAY|EXPW|EXPY|EXT|EXTENSION|EXTN|EXTNSN|EXTE NSIONS|EXTS|FALL|FALLS|FLS|FERRY|FRRY|FRY|FIELD|FLD|FIELDS|FLDS|FLAT|FLT|FLATS|FLTS|FORD|FRD|FORDS|FOREST|FORE STS|FRST|FORG|FORGE|FRG|FORGES|FORK|FRK|FORKS|FRKS|FORT|FRT|FT|FREEWAY|FREEWY|FRWAY|FRWY|FWY|GARDEN|GA RDN|GDN|GRDEN|GRDN|GARDENS|GDNS|GRDNS|GATEWAY|GATEWY|GATWAY|GTWAY|GTWY|GLEN|GLN|GLENS|GREEN|GRN|G REENS|GROV|GROVE|GRV|GROVES|HARB|HARBOR|HARBR|HBR|HRBOR|HARBORS|HAVEN|HAVN|HVN|HEIGHT|HEIGHTS|HGTS|HT |HTS|HIGHWAY|HIGHWY|HIWAY|HIWY|HWAY|HWY|HILL|HL|HILLS|HLS|HLLW|HOLLOW|HOLLOWS|HOLW|HOLWS|INLET|INLT|IS|ISL AND|ISLND|ISLANDS|ISLNDS|ISS|ISLE|ISLES|JCT|JCTION|JCTN|JUNCTION|JUNCTN|JUNCTON|JCTNS|JCTS|JUNCTIONS|KEY|KY|KEYS |KYS|KNL|KNOL|KNOLL|KNLS|KNOLLS|LAKE|LK|LAKES|LKS|LAND|LANDING|LNDG|LNDNG|LA|LANE|LANES|LN|LGT|LIGHT|LIGHTS|
Search extracted locations through a city to GPS lookup table - Search extracted locations through a city to GPS lookup table
- Many American city names are repeated (Atlanta,IL vs Atlanta,GA)
- Check for well formated locations (city,state)
- If not, resolve by selecting matched city with the largest population
- Give preferences to other location sources (like user location and GPS) when there are multiple matches
We only want to display best "n" tweets - We only want to display best "n" tweets
- Nature of twitter may result in an extremely variable amount of data
- Serves as another way to filter non-crime tweets
- May be able to highlight important events
- Summarize the most important data points
- Avoid overwhelming the user with results
Goal: Learn a function f: X -> r Goal: Learn a function f: X -> r where X is a vector of features and r is a importance score Strategy: Take pointwise approach and use a sample of manually scored data find the curve that fits our labeled data We use linear regression using the simple least squares method to find weights such that r = w1x1 + w2x2 + w3x3 + . . . wnxn
Selected from a large pool of potential features - Selected from a large pool of potential features
- Social
- Contextual
- Tweet length, category, mentioned locations
- User Credibility
- Age of user account, friends, followers, status count, verification
- Classifier Confidence
Labeled ~500 tweets with a ranking (integer from 1 to 5) - Labeled ~500 tweets with a ranking (integer from 1 to 5)
- Linear regression on all features (normalized)
- Examined correlation coefficients
- Examined weights
- Pruned features
- Repeated until we had an adequate feature set with logical weights
Weights Features Weights Features 2.87974471144 account age 1.71671010105 favorites 1.17242993534 status count 2.67005302808 followers -3.97882564778 confidence
Clustering of tweets means to group overlapping tweets found in the same location into one category. - Clustering of tweets means to group overlapping tweets found in the same location into one category.
Clustered tweets inform the user about where most events are happening at a particular time. - Clustered tweets inform the user about where most events are happening at a particular time.
- The sizes of the clustered tweets also convey how relevant or important the tweets are.
- eg. A user may want to find out how far a wild fire outbreak is spreading or has spread to. Clustered tweets of the wildfire on the map shows the user where the fire is or has spread to.
Conceptual Level Conceptual Level - Detects and monitors crime via a popular 21st century social media
Technical Level - Developed crawler to obtain data
- Identified and explored useful features from social network to rank and classify crime
System Level
Do'stlaringiz bilan baham: |