- Online social network platofroms always find ways to connect more people inside their network. These new connections could be based on various factors like age, qualification, profession, likes-dislikes or common friends. Finding new frineds based on common friends is a very popular approach.
- Common friends approach: If two people who are not connected on a social network platform yet, but have got many common frieds, then they are are highly likeyly to be connected very soon. But people may not be aware of such friends in the network or may take very long time to disover and then get connected. So, social network platforms themselves run algorithms to find out such people in the network (auto discovery) and make frequent recommendations to connect.
- When the networks become very large and have hundreds of millions people in the network, discovering such implicit connections become very challenging. In this tutorial we will explain a simple map-reduce based algorithms to discover such connections and make recommendations. This algorithm is highly scalable as this is based on map reduce paradigm and can run on hundred of millions of connections.
- The example dataset contains names of the people and their explicit connections in the network. Every line in the file represents information about one person in the network, where the first columns is the name of the person and the rest of the columns represent the name of the persons whom he is already connected to.
- Based on this we need to make atleast 5 new friends recommendations to each person, in the order of number of common friends they have.
friends_rdd = sc.textFile( 'file:///home/hadoop/lab/data/friendsnetwork.csv')
friends_rdd.take( 10 )
['Kristina Chung,Glenda Baker,Bradley Duke,Deborah French,Priscilla Hayes', 'Paige Chen,Faye Sparks,Christopher McKay,Katie Katz,Sue Ellis,Pauline Coley,Mike Hill,Carl Nixon', 'Sherri Melton,Luis Hirsch,Marian Crane,Jessica Hester,Evan Grant,Chris Scott,Beth Hayes,Bobby Case', 'Gretchen Hill,Samantha Hardin,Maureen Kemp,Maurice Gibbons,Franklin Vick', 'Karen Puckett,Brenda Fitzpatrick,Chris Scott,Dawn Lutz,', 'Patrick Song,Bernard Pate,Peggy Schroeder,Edna Hoyle,Anne Little,Kristen Fisher,Melanie Diaz,Katharine Pappas', 'Elsie Hamilton,Franklin Vick', 'Hazel Bender,Gary Scarborough,Floyd Schroeder,Anne Little,Clyde Kenney', 'Malcolm Wagner,Annie Lindsay,Ricky Casey', 'Dolores McLaughlin,James Floyd,Ruth Tyson,Tara Wolfe,Sheryl Love,Priscilla Hayes,Lois Holmes,Sandra Kang,Jason Bowling,Carrie Watson,Brooke Boswell']
- Here Kristina Chung is connected to Glenda Baker,Bradley Duke,Deborah French,Priscilla Hayes
- Similarily there will be some persons who will be connected to Glenda Baker or Bradley Duke or Deborah French or Priscilla Hayes, but not connected to Kristina Chung
- Now, can we exploit this informationto find if there are any new friends that can be recommeded to Kristina Chung
- There are some extra commas at the end of each line, that need to be removed using rstrip() call
- Modify the data into a tuple of ( person name , [list of direct connectsion] )
friends_final_rdd = friends_rdd.map(lambda line: line.rstrip(',')) \ .map(lambda line: line.split(',')) \ .map(lambda oneFriend: (oneFriend, oneFriend[1:]))
friends_final_rdd.take( 4 )
[('Kristina Chung', ['Glenda Baker', 'Bradley Duke', 'Deborah French', 'Priscilla Hayes']), ('Paige Chen', ['Faye Sparks', 'Christopher McKay', 'Katie Katz', 'Sue Ellis', 'Pauline Coley', 'Mike Hill', 'Carl Nixon']), ('Sherri Melton', ['Luis Hirsch', 'Marian Crane', 'Jessica Hester', 'Evan Grant', 'Chris Scott', 'Beth Hayes', 'Bobby Case']), ('Gretchen Hill', ['Samantha Hardin', 'Maureen Kemp', 'Maurice Gibbons', 'Franklin Vick'])]
- If we look at each persons and his friends list, we can create some map key values that would represent the exisitng connections and possible connections.
- For example,
- A key value (('Kristina Chung','Glenda Baker'), 0), where 0 represents Kristina Chung and Glenda Baker are already connected.
- A key value (('Glenda Baker', 'Bradley Duke'), 1), where 1 repesents Glenda Baker and Bradley Duke are possible future friends (recommendations) as they have a common friend called Kristina Chung
- Now, we need to generate these candidate key and value combinations to represent all existing connections and possible future connections.
import itertools def createFrindsConnections( a_friend ): person = a_friend friends_list = list( set( a_friend ) ) ## Create existing connections all_connections = [ ( ( each_friend, person ), 0) \ if person > each_friend else ( ( person, each_friend ), 0) \ for each_friend in friends_list ] ## Create possible future connections for friends_pair in itertools.combinations( friends_list, 2 ): if friends_pair > friends_pair: all_connections.append( ( ( friends_pair, friends_pair ), 1 ) ) else: all_connections.append( ( friends_pair, 1 ) ) return all_connections
friend_connections_rdd = friends_final_rdd.flatMap( lambda rec: createFrindsConnections( rec ) )
PythonRDD at RDD at PythonRDD.scala:43
friend_connections_rdd.take( 10 )
[(('Glenda Baker', 'Kristina Chung'), 0), (('Kristina Chung', 'Priscilla Hayes'), 0), (('Bradley Duke', 'Kristina Chung'), 0), (('Deborah French', 'Kristina Chung'), 0), (('Glenda Baker', 'Priscilla Hayes'), 1), (('Bradley Duke', 'Glenda Baker'), 1), (('Deborah French', 'Glenda Baker'), 1), (('Bradley Duke', 'Priscilla Hayes'), 1), (('Deborah French', 'Priscilla Hayes'), 1), (('Bradley Duke', 'Deborah French'), 1)]
- Now, we can aggregate the 0's and 1's for each pair. If a 0 exists in the list for any pair, that means the pair is already connected and it can be ignored or need not be processes further.
- But if there is no 0, then we can sum up all the 1's and that would be the number of common friends they have got and they are not connected yet. This signifies possible future connections.
num_common_friends_rdd = friend_connections_rdd.groupByKey() \ .filter(lambda rec: 0 not in rec) \ .map(lambda x: (x, sum(x))) \ .sortBy(lambda rec: rec, ascending=False)
PythonRDD at RDD at PythonRDD.scala:43
num_common_friends_rdd.take( 10 )
[(('Jay Wang', 'Marion Puckett'), 3), (('Julian Woods', 'Melinda Weeks'), 3), (('Jamie Gallagher', 'Theodore McKay'), 3), (('Benjamin Hensley', 'Lewis Currin'), 3), (('Bob Fischer', 'Stephanie Hawkins'), 3), (('Harry Foster', 'Lois Joseph'), 2), (('Katharine Pappas', 'Peggy Schroeder'), 2), (('Crystal Powers', 'Eric Kane'), 2), (('Anne Stephens', 'Patsy Parrott'), 2), (('Florence Carver', 'Hannah Mueller'), 2)]
- It shows that Jay Wang and Marion Puckett have 3 common friends, but they are not connected yet. So, they can be recommended to each other.
- To make top 5 recommedations for each user, we need to separately create recommedation list for each user from the paired information above.
- For example, (('Jay Wang', 'Marion Puckett'), 3) should create two individual recommendations as below.
- ('Jay Wang', ('Marion Puckett', 3))
- ('Marion Puckett', ('Jay Wang', 3))
- That means Marion Puckett is recommeded to Jay Wang because of 3 common friends and Jay Wang recommended to Marion Puckett because of 3 common friends.
individual_recommends = num_common_friends_rdd.flatMap(lambda rec: [(rec, (rec, rec)), \ (rec, (rec, rec))] )
individual_recommends.take( 10 )
[('Jay Wang', ('Marion Puckett', 3)), ('Marion Puckett', ('Jay Wang', 3)), ('Julian Woods', ('Melinda Weeks', 3)), ('Melinda Weeks', ('Julian Woods', 3)), ('Jamie Gallagher', ('Theodore McKay', 3)), ('Theodore McKay', ('Jamie Gallagher', 3)), ('Benjamin Hensley', ('Lewis Currin', 3)), ('Lewis Currin', ('Benjamin Hensley', 3)), ('Bob Fischer', ('Stephanie Hawkins', 3)), ('Stephanie Hawkins', ('Bob Fischer', 3))]
- Making top 5 final recommendations for each users.
top_n_recommends = 5
final_recommends = individual_recommends.groupByKey().mapValues(lambda rec: \ sorted(rec, key=lambda x: x, \ reverse=True)[:top_n_recommends])
final_recommends.cache() final_recommends.take( 5 )
[('Regina Ellington', [('Annie Lindsay', 1), ('Suzanne Gould', 1), ('Patrick Song', 1), ('Zachary Lindsay', 1), ('Gladys Davidson', 1)]), ('Kim Hensley', [('Norman Lam', 1), ('Melinda Weeks', 1), ('Brenda Barbour', 1), ('Susan Abbott', 1), ('Robyn Shapiro', 1)]), ('Norman Arthur', [('Constance Black', 1), ('Marc Wallace', 1), ('Pauline Chandler', 1), ('Chris Chu', 1), ('Vincent Woodward', 1)]), ('Tom Park', [('Marc Wallace', 2), ('Ronald Odom', 1), ('Jessica Rich', 1), ('Vicki Mills', 1), ('Paige Cherry', 1)]), ('Frances Wagner', [('Randall Hinson', 1), ('Nelson Pierce', 1), ('Beth Willis', 1), ('Theresa Dyer', 1), ('Gretchen Goldstein', 1)])]
- That's the final recommendations! For example Marc Wallace is the top recommendation for Tom park as they have 2 common friends.