Twitter 데이터에 HBase 스키마를 디자인하는 방법은 무엇입니까?

다음 트위터 데이터가 있고 동일한 스키마를 설계하고 싶습니다. 수행해야 할 쿼리는 다음과 같습니다. 시간 간격에 대한 트윗 볼륨, 해당 사용자 정보가있는 트윗, 해당 주제 정보가있는 트윗 등 스키마의 설계가 올 경우 ... 아래의 데이터를 바탕으로, 누구나 (ID + 타임 스탬프, 사용자로 열 가족, 차 컬럼으로 그룹화 등의 rowkey을 ... 말한다. 어떤 제안?Twitter 데이터에 HBase 스키마를 디자인하는 방법은 무엇입니까?

{ 
    "created_at":"Tue Feb 19 11:16:34 +0000 2013", 
    "id":303825398179979265, 
    "id_str":"303825398179979265", 
    "text":"Unleashing Innovation Conference Kicks Off - Wall Street Journal (India)    http:\/\/t.co\/3bkXJBz1", 
    "source":"\u003ca href=\"http:\/\/dlvr.it\" rel=\"nofollow\"\u003edlvr.it\u003c\/a\u003e", 
    "truncated":false, 
    "in_reply_to_status_id":null, 
    "in_reply_to_status_id_str":null, 
    "in_reply_to_user_id":null, 
    "in_reply_to_user_id_str":null, 
    "in_reply_to_screen_name":null, 
    "user":{ 
     "id":948385189, 
     "id_str":"948385189", 
     "name":"Innovation Plaza", 
     "screen_name":"InnovationPlaza", 
     "location":"", 
     "url":"http:\/\/tinyurl.com\/ee4jiralp", 
     "description":"All the latest breaking news about Innovation", 
     "protected":false, 
     "followers_count":136, 
     "friends_count":1489, 
     "listed_count":1, 
     "created_at":"Wed Nov 14 19:49:18 +0000 2012", 
     "favourites_count":0, 
     "utc_offset":28800, 
     "time_zone":"Beijing", 
     "geo_enabled":false, 
     "verified":false, 
     "statuses_count":149, 
     "lang":"en", 
     "contributors_enabled":false, 
     "is_translator":false, 
     "profile_background_color":"131516", 
     "profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/781710342\/17a75bf22d9fdad38eebc1c0cd441527.jpeg", 
     "profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/781710342\/17a75bf22d9fdad38eebc1c0cd441527.jpeg", 
     "profile_background_tile":true, 
     "profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3205718892\/8126617ac6b7a0e80fe219327c573852_normal.jpeg", 
     "profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3205718892\/8126617ac6b7a0e80fe219327c573852_normal.jpeg", 
     "profile_link_color":"009999", 
     "profile_sidebar_border_color":"FFFFFF", 
     "profile_sidebar_fill_color":"EFEFEF", 
     "profile_text_color":"333333", 
     "profile_use_background_image":true, 
     "default_profile":false, 
     "default_profile_image":false, 
     "following":null, 
     "follow_request_sent":null, 
     "notifications":null 
    }, 
    "geo":null, 
    "coordinates":null, 
    "place":null, 
    "contributors":null, 
    "retweet_count":0, 
    "entities":{ 
     "hashtags":[ 

     ], 
     "urls":[ 
     { 
      "url":"http:\/\/t.co\/3bkXJBz1", 
      "expanded_url":"http:\/\/dlvr.it\/2yyG5C", 
      "display_url":"dlvr.it\/2yyG5C", 
      "indices":[ 
       73, 
       93 
      ] 
     } 
     ], 
     "user_mentions":[ 

     ] 
    }, 
    "favorited":false, 
    "retweeted":false, 
    "possibly_sensitive":false 
}

출처

2013-03-07 anups

당신이 만약 ID가 고유하다는 것을 100 % 확신하면이 데이터를 대량의 데이터를 저장할 행 키로 사용할 수 있습니다.

303825398179979265 -> data_CF

당신의 열 가족 data_CF는이 라인에서 정의된다 :

 
"created_at":"Tue Feb 19 11:16:34 +0000 2013" 
"id_str":"303825398179979265" 
... 
"user_id":948385189 { take note here I'm denormalizing your dictionary } 
"user_name":"Innovation Plaza"

그것은 목록에 대한 약간의 난이도가 가져옵니다. , URL의

 
"entities_hashtags_":"\x00" { Here \x00 is a dummy value }

순서가 중요하지 않은 경우, 당신은 UUID로 접두사 수 있습니다 :이 솔루션은 카테고리 접두어가 특별하게 만드는 무언가를 넣어하는 것입니다. 그것은 그것이 유일 함을 보장 할 것입니다.

이 접근법의 장점은 HBase가 행의 원 자성을 보장하므로이 데이터의 필드를 업데이트해야 할 경우 원자 적으로 수행된다는 것입니다.

두 번째 질문의 경우, 즉석 집계 정보가 필요하면 미리 계산하여 다른 테이블에서 말한대로 저장해야합니다. 이 데이터가 M/R을 통해 생성되도록하려면 시간 기반 인 경우 타임 스탬프 + 행 ID를 넣을 수 있습니다. 주제별로 topic + row id와 같을 것입니다. 이를 통해 시작 시간 행이나 관심있는 주제 만 포함하는 접두어 스캔을 작성할 수 있습니다.

재미있게 보내세요!

출처

2013-03-14 19:59:25

Twitter 데이터에 HBase 스키마를 디자인하는 방법은 무엇입니까?

답변

관련 문제