Hive sentiment analysis project


#1

Can you tell me how you described the schema here??

CREATE EXTERNAL TABLE tweets_raw (
    id BIGINT,
    created_at STRING,
    source STRING,
    favorited BOOLEAN,
    retweet_count INT,
    retweeted_status STRUCT<
    text:STRING,
    users:STRUCT<screen_name:STRING,name:STRING>>,
    entities STRUCT<
    urls:ARRAY<STRUCT<expanded_url:STRING>>,
    user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
    hashtags:ARRAY<STRUCT<text:STRING>>>,
    text STRING,
    user STRUCT<
    screen_name:STRING,
    name:STRING,
    friends_count:INT,
    followers_count:INT,
    statuses_count:INT,
    verified:BOOLEAN,
    utc_offset:STRING, -- was INT but nulls are strings
    time_zone:STRING>,
    in_reply_to_screen_name STRING,
    year int,
    month int,
    day int,
    hour int
)

i have seen the data file and found many other information in the table before the ‘id’ column. How come we directed hive to ignore those fields of the data file.


#2

Hi, Adarsh.

You see the data-set for IromMan3 located in HDFS at "/data/SentimentFiles/upload/data " and the polarity of the common words in the dictionary files in HDFS at
“/data/SentimentFiles/SengtimentFiles/upload/data/dictionary/dictionary.tsv”.
You will come to know why we have taken these fields, some fields are taken for our understanding or business requirements, what fields we want from the data. It depends upon us.
All the best!