Comprehensive case: Analysis of website user profiles - DataWorks

This topic uses a simple website user profile analysis as an example to help you become familiar with the main features and common tasks of DataWorks, including data synchronization, data processing, data management, and data consumption operations.

Case objective

Case expectation

After you perform operations described in this case, you will be able to independently complete common data-related tasks in DataWorks, such as data synchronization, data development, and task O&M.

Target audience

This case is suitable for personnel who need to obtain data from data warehouses for analysis and insights, such as developers, data analysts, and product operations personnel.

Case design

To formulate business strategies, you need to extract basic user profiles from website behavioral data. For example, you need to obtain information such as geographical attributes and social attributes of user groups to implement periodic task scheduling and achieve fine-grained operations of website traffic. You need to complete these operations by using DataWorks.

Services involved

The website user profile analysis process involves databases used to store raw data, computing storage databases, and the platform for developing the entire process. The following table lists the services that are involved in this case.

Service category	Service name	Description
Database	ApsaraDB RDS for MySQL	This case provides an ApsaraDB RDS for MySQL database to store basic user information.
Database	Object Storage Service (OSS)	This case provides an OSS object to store log information.
Compute engine	MaxCompute	In this case, you can use a MaxCompute, EMR, EMR Serverless StarRocks, or EMR Serverless Spark computing resource based on DataWorks for development to process the raw data and store the processed data in the desired data warehouse.
	EMR Serverless StarRocks
	E-MapReduce (EMR)
	EMR Serverless Spark
Data mid-end	DataWorks	In this case, DataWorks serves as the data mid-end and is used for data synchronization, data processing, data quality monitoring, data consumption, and task scheduling.

Important

When you experience website user profile analysis processes that involve different compute engines, databases and DataWorks are common resources for these website user profile analysis processes. You need to only associate different compute engines with your DataWorks workspace as computing resources.
If you use an EMR or EMR Serverless Spark computing resource, you must prepare an OSS data source to receive the basic user information and log information of the case. If you use EMR Serverless StarRocks as the computing and storage service, you must prepare an OSS data source to store the .jar package used to register a function in StarRocks. In addition, you must make sure that the OSS data sources have sufficient storage space and you have the required permissions on them.

Scenario design

You must add the required databases to your DataWorks workspace as data sources and associate the required compute engines with your DataWorks workspace as computing resources. This way, you can process the data in the computing resources to obtain the required geographical attributes, social attributes, and other information of user groups, and manage and consume the data information.

Process design

In this case, you can select the appropriate website user profile analysis process based on compute engines that you use. Documentations for user profile analysis that involves four types of compute engines are provided: User profile analysis (MaxCompute), User profile analysis (StarRocks), User profile analysis (EMR), and User profile analysis (Spark). The process contains the following steps:

Use Data Integration to extract basic user information and website access logs of users from different data sources to a compute engine.
Process and split the website access logs of users in the compute engine into fields that can be analyzed.
Aggregate the basic user information and the processed website access logs of users in the compute engine.
Further process the data to produce basic user profiles.

Operations

The following table describes the operations that are involved in this case.

Step	Operation	Phased objective
Synchronize data	Synchronize user information data from MySQL and user access log data stored in OSS to different computing resources. MaxCompute and Spark: You can directly use Data Integration to synchronize the raw data to the related computing resources. EMR and Spark: You need to use the prepared OSS object to store the synchronized raw data and use EMR and Spark tables to read the data.	Learn the following items: How to synchronize data from different data sources to MaxCompute, EMR, StarRocks, or Spark. How to create a table for the related data source. How to quickly trigger a task to run. How to view task logs.
Process data	Use Data Studio to split log data into analyzable fields by using methods such as functions and regular expressions, and then process and aggregate the fields with the user information table to produce basic user profile data.	Learn the following items: How to create and configure tasks in a DataWorks workflow. How to run a workflow.
Manage data	Us Data Map to manage and view the metadata of source tables. Monitor the dirty data that is generated during the change in source data. If an error occurs, stop the running of the related task to prevent negative impacts caused by the error.	Learn how to obtain the metadata of the data source table based on DataWorks, search for the data source table, and view the detailed information of the data source table. Learn how to configure data quality monitoring rules for the table generated by a DataWorks task to quickly identify the dirty data that is generated during the change in source data and prevent the dirty data from affecting descendant tasks.
Consume data	Use DataAnalysis to perform SQL queries and analysis on the final result table for website user profile analysis. For example, you can analyze the geographical distribution of users and the rankings of cities by the number of registered users. Use the DataService Studio API feature to create API services from the final result table.	Learn how to present data in a visualized manner and create APIs based on DataWorks.

Case data

The data structures in this section will be used in subsequent data synchronization, processing, and management steps to generate user profiles.

Log data structure

Before you perform the operations in this case, make sure that you are familiar with the existing business data, the data format, and the basic user profile data structure that is required for business background analysis.

The following code displays the raw log data in the OSS file user_log.txt:

$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];

The following effective information can be obtained from the raw log data.

Field name	Field description
$remote_addr	The IP address of the client that sends the request.
$remote_user	The username that is used to log on to the client.
$time_local	The local time of the server.
$request	The HTTP request. An HTTP request consists of the request type, request URL, and HTTP version number.
$status	The status code that is returned by the server.
$body_bytes_sent	The number of bytes returned to the client. The number of bytes of the header is not included in the field value.
$http_referer	The source URL of the request.
$http_user_agent	The information about the client that sends the request, such as the browser that is used.

User information data structure

The following table lists the table structure of the MySQL user information data table ods_user_info_d.

Field name	Field description
uid	The username.
gender	The gender.
age_range	The age range.
zodiac	The zodiac sign.

Finally obtained data structure

The following table lists the final data table structure that can be obtained based on analysis of the raw data. You can confirm the final data table structure based on your business requirements.

Field name	Field description
uid	The username.
region	The region.
device	The terminal type.
pv	The number of page views.
gender	The gender.
age_range	The age range.
Zodiac	The zodiac sign.