Talk:Social Network Intelligence BenchMark

From W3C Wiki
Jump to: navigation, search
=========== Ideas/Suggestions from Orri ==================

- Hello Kitty mind share over time, compared to other Japanese productized fictional characters.


There is a Dbpedia entity for this, linking to some similar branded characters. We look at their frequency of occurrence, groupping by geography and year.

This could also make a cube.


- Is Hello Kitty a prime mover?


Which online account associated with Hello Kitty initiates the longest conversation threads?

An account is associated if the owner is tagged with Hello Kitty or mentions this in more than 10 posts. For such accounts, we count the length of threads initiated by the account and take the top k, returning name, longest thread length and number of posts. The thread length calculation is a count over a transitive subquerye.

- Challengers of Hello Kitty - Similar fictional entities occurring together in posts, e.g. is Totoro more cute?


- Where and when to advertize Hello Kitty hairbells?

Find geographical areas, i.e. IP domains and times when Hello Kitty fans are online. Additionally find candidate ad words relating to the target demographic, e.g. frequent tags in posts, accounts or groups.

Eliminate the tags/conceptss which are not selective, e.g. likes music is not selective but likes Totoro is selective.

- Wildfire

Find the first mentions of a concept in the last day such that the concept is not mentioned before and the concept occurs in over 10% of new posts in groups involved with politics.

The data is organized so that a post is a repost or a reply to another post. Thus the path of propagation can be traced. Take the 'leaf' posts mentioning the event, e.g. Haiti earthquake and find the top 10 propagation paths accounting for the leaf posts. The propagation path is defined as the top 5 levels of the total path leading from the initial story to to the leaf. We see which sources are the most influential in the early stages of the propagation of the story. The sources are the parties posting the initial reposts or replies.


- iPhone vs. Blackberry and Hello Kitty with professionalss?

We guess which users might be lawyers based on the tags of their posts and compare the mentions of iPhone vs Blackberry as either user agent or word in posts. From Dbpedia we take a cloud of law or business related concepts, e.g. intellectual property or credit default swap and look for mentions of these in online content.

We look for relations between this group and Hello Kitty fans. We expect an anti-correlation first but a certain number of hits in close proximity, e.g. a certain percentage of children of lawyers, specially in Asia, will like Hello Kitty. Family relationships are usually not explicit, so we infer this from being a different account from the same IP, probably meaning household.

We evaluate the target market for exclusive platinum Hello Kitty hairbells based on how many fans we find in proximity to professional groups with presumably high income.


We further evaluate market size for Hello Kitty iPhone and Blackberry covers based on the penetration of these brands into the target households.

We have here follow up questions based on fairly selective but complex queries, thus there is an opportunity for result recycling.


- Troublemakers and Duplicates

We are looking for duplicate identities based on behavior patterns. A zealot is defined as a person that starts threads which may get a large number of fast replies due to inmflammatory content but where the threads do not last long Having identified one, we look for others with similar behavior pattern and subject matter. The most points are given for overlapping content in the initial firebrand post, a bit less for similarity of content in replies. The content is the set of concepts, i.e. tags.

Exxtra points are given for repetitive behavior and even more points for leaving/being banned from the groups where the behavior occurs. Repetitive behavior is posting a second time with the same concepts and getting a similar reaction.


A related profile is the attention seeker who generates a lot of traffic in a group right after joining and where the activity drops off after a time,, to reoccur in a related group. Here the activity may be more in the form of replies.


Use of many different accounts by one persons is more likely with these profiles.


- Expert Finding

Given two specialities, e.g. law and medicine we are looking for an expert in both. An expert has contributed in discussions on both domains for a long time. Points are given for many social connections to other people involved in the same domains. Points are substracted for volatility in social connections or group memberships.


As a follow up question we look for bridges between the lawyer and the doctor social networks, as defined by profession and social connections.


The data generation should favor social connections between people in the same occupation. Accounts may be tagged by industry or occupation but these may not always be filled in and not all queries need to use this information but will rather infer it from discussion content.


- Classifying forums

Using Yago or Umbel ontologies with Dbpedia, derive for each discussion group and account a set of characteristic top level topics. A professional discussion usually uses specialty terms and does not mention the top level concepts since these are implicit in the context.

For market research however the distinction between divorce and intellectual property lawyers is most often not essential, so the top level tags are more useful.


For generating conversation with relevant correlation in content, the Dbpedia influences and influenced by links may be useful, as well as the 200K or so classifications of things in Yago, Umbel and Open CYC. Like this, excerpting pieces from Dbpedia articles one can make a conversation that approximately stays on topic. Specially if the conversation concerns people, e.g. movie stars, pop music artists, athletes, there will be many such links, lables of sports, genres, historical periods etc. Picking pieces of texts and entities occurring therein will make almost realistic conversation.


With this information, one can also easily generate off-topic or disruptive posts.


If one decides for each thread the general subject matter in advance correlated content is sure to be generated. Some threads might be allowed to drift, like with a broken telephone effect, where only the 2-3 previous posts guided the content of of the reply.


- The purpose of the query for "Is Hello Kitty a prime mover" is not clear. Why do we need to find an account associated with Hello Kitty initiates the longest conversation threads.

For "iPhone vs. Blackberry and Hello Kitty with professionalss?"

How to identify a family relationship can be a difficulty since

- Most of the users do not have static IP address. In addition, users can browse SNs from many different places and with a high probability that they browse from their work places. Thus, based on IP address, it may provide misleading information that people working in the same place are in a family.

(Update: Solution for this problem. First, we need to know the region where each user browses the SN. From the regional information, we know whether user is browsing the SN in working hours or not. If not, it has a high probability that she is browsing the SN from home. Moreover, by considering this IP addess in 10 continuous browsing times, if the IP addess is unchanged, it can be considered as a static IP address. )

- There are not many families in which parents and their childrents both join a same SN. Each SN may be favored by a particular age group. For example, nearly 70% of Facebook users are at the age of 18-34.

Difficulties in creating Social graphs

Is there any solution for creating a social graph that can guarantee the distribution of social degree, and clustering coefficient?